Rotating Tor Proxy

At Fathom Data we have a few projects which require us to send HTTP requests from an evolving selection of IP addresses. This post details one approach which uses Tor (The Onion Router).

What is a Proxy Server? 

A proxy server acts as an intermediary between a client and a server. When a request goes through a proxy server there is no direct connection between the client and the server. The client connects to the proxy and the proxy then connects to the server. Requests and responses pass through the proxy.

There are various types of proxy server.

HTTP & SOCKS Proxies 

HTTP (HyperText Transfer Protocol) is the dominant protocol for information exchange on the internet. HTTP is connectionless. This means that the client (often a browser) sends a request to a server. The server then replies with a response. Once this interaction is over there is no persistent connection between the client and the server. Any further interactions require new connections.

An HTTP proxy uses the HTTP protocol: the client-proxy and proxy-server interactions all occur via HTTP. An HTTP proxy is able to filter or modify the content of the requests and responses passing through it. An HTTP proxy is only able to handle HTTP and HTTPS requests.

SOCKS (SOCKet Secure) is another internet protocol. SOCKS uses a TCP (Transmission Control Protocol) connection. TCP is a lower level protocol than HTTP.

A SOCKS proxy uses the SOCKS protocol. Since it’s “secure”, a SOCKS proxy cannot understand the contents of requests or responses, so is unable to modify or filter them. Since SOCKS operates at a lower level in the networking hierarchy, a SOCKS proxy is faster and more flexible than an HTTP proxy.

Tor Proxy Docker Image 

We have constructed a Docker image which uses the Tor network to expose both SOCKS and HTTP proxies. The image uses the following components:

Tor 

Tor provides an anonymous SOCKS proxy. The container will run multiple (by default 5) Tor instances, each of which will (in general) have a different exit node. This means that requests being routed through each instance will appear to come from a distinct IP address.

HAProxy 

HAProxy is a high availability proxy server and load balancer (spreads requests across multiple services).

The container uses HAProxy to distribute SOCKS requests across the Tor instances. It uses a round robin scheduling strategy.

Privoxy 

Many services prefer to communicate via HTTP or cannot communicate via SOCKS. To provide for this the container uses Privoxy to accept HTTP requests and forwards them as SOCKS requests to HAProxy.

Running the Docker Image 

The image can be used to launch containers which serve a HTTP or SOCKS proxy.

Environment Variables 

There are a few environment variables which can be used to tweak the specifications of the container:

  • TORS — Number of Tor instances (default: 5)
  • HAPROXY_LOGIN — Username for HAProxy (default: “admin”)
  • HAPROXY_PASSWORD — Password for HAProxy (default: “admin”)

Ports 

The container exposes the following ports:

  • 1080 — HAProxy port (SOCKS protocol)
  • 2090 — HAProxy statistics port and
  • 8888 — Privoxy port (HTTP protocol).

You can map one or more of these ports onto the host when you launch a container.

# HTTP proxy on port 8888
docker run -p 8888:8888 datawookie/tor-proxy-rotating
# SOCKS proxy on port 1080
docker run -p 1080:1080 datawookie/tor-proxy-rotating
# Both HTTP and SOCKS proxies
docker run -p 8888:8888 -p 1080:1080 datawookie/tor-proxy-rotating

Browser 

Set up your browser to use the HTTP proxy. You could equally choose to use the SOCKS proxy, which will have an IP address of 127.0.0.1 and port 1080.

Testing 

To illustrate how the proxy works we’ll send requests to http://httpbin.org/ip to retrieve our effective IP address.

First let’s look at the set of IP addresses reported by the container.

2021-07-29 04:24:40 [INFO] Testing proxy (port 10000): 195.154.35.52.
2021-07-29 04:24:40 [INFO] Testing proxy (port 10001): 178.20.55.18.
2021-07-29 04:24:40 [INFO] Testing proxy (port 10002): 195.176.3.24.
2021-07-29 04:24:40 [INFO] Testing proxy (port 10003): 185.220.100.252.
2021-07-29 04:24:40 [INFO] Testing proxy (port 10004): 185.220.101.198.

Now send out a series of requests.

curl http://httpbin.org/ip
{
  "origin": "195.154.35.52"
}
curl http://httpbin.org/ip
{
  "origin": "178.20.55.18"
}
curl http://httpbin.org/ip
{
  "origin": "195.176.3.24"
}
curl http://httpbin.org/ip
{
  "origin": "185.220.100.252"
}
curl http://httpbin.org/ip
{
  "origin": "185.220.101.198"
}
curl http://httpbin.org/ip
{
  "origin": "195.154.35.52"
}

Notice that each request appears to originate from a distinct IP address and that once we’ve cycled through all of the Tor instances, we wrap back to the first one.

Rotating 

The exit nodes are periodically rotated. Some time later we see that we’re using a different set of IP addresses.

2021-07-29 04:39:13 [INFO] Testing proxy (port 10000): 171.25.193.20.
2021-07-29 04:39:13 [INFO] Testing proxy (port 10001): 199.249.230.87.
2021-07-29 04:39:14 [INFO] Testing proxy (port 10002): 185.220.101.132.
2021-07-29 04:39:14 [INFO] Testing proxy (port 10003): 199.249.230.184.
2021-07-29 04:39:15 [INFO] Testing proxy (port 10004): 23.129.64.161.

Proxy List 

A proxy list is server as a plain text file on port 8800. This can be used to configure rotating proxies in clients.

Statistics 

You can monitor the performance of the proxies via the statistics interface, which each HAProxy instance exposes via a port numbered sequentially from 2090.

Technical Details 

The image is derived from the official Alpine image, onto which Python 3, Tor, HAProxy and Privoxy are installed. The configuration files for each of the services are created from templates using Jinja templates.