Medusa: A Multi-Headed Tor Proxy

At Fathom Data we have a few projects which require us to send HTTP requests from an evolving selection of IP addresses. This post details the Medusa proxy docker image which uses Tor (The Onion Router) as a proxy.

What is a Proxy Server? 

A proxy server acts as an intermediary between a client and a server. When a request goes through a proxy server there is no direct connection between the client and the server. The client connects to the proxy and the proxy then connects to the server. Requests and responses pass through the proxy.

Thanks to @axiematic for the figure.

HTTP & SOCKS Proxies 

HTTP (HyperText Transfer Protocol) is the dominant protocol for information exchange on the internet. HTTP is connectionless. This means that the client (often a browser) sends a request to a server. The server then replies with a response. Once this interaction is over there is no persistent connection between the client and the server. Any further interactions require new connections.

An HTTP proxy uses the HTTP protocol for all interactions with the client and server. As a result, an HTTP proxy is only able to handle HTTP and HTTPS requests. An HTTP proxy is also able to filter or modify the content of the requests and responses passing through it.

SOCKS is another internet protocol. Whereas as HTTP is an application layer protocol (at the top of the OSI Model), SOCKS is a lower level protocol in the session layer.

A SOCKS proxy uses the SOCKS protocol. Since it’s secure (the name is an abbreviation for “SOCKet Secure”), a SOCKS proxy cannot understand the contents of requests or responses, so is unable to modify or filter them. Since it operates at a lower level in the networking hierarchy, a SOCKS proxy is also faster and more flexible than an HTTP proxy.

Tor Proxy Docker Image 

We constructed a Docker image which uses the Tor network to expose both SOCKS and HTTP proxies. The image uses the following components:

The relationship between these components is detailed in the figure below.

Components in the Medusa Proxy (single headed configuration).

Tor 

Tor provides an anonymous SOCKS proxy. The image will run multiple Tor instances, each of which will (in general) have a different exit node. This means that requests being routed through each instance will appear to come from a distinct IP address.

HAProxy 

HAProxy is a high availability proxy server and load balancer (spreads requests across multiple services). HAProxy is used to distribute SOCKS requests across the Tor instances using a round robin scheduling strategy.

Privoxy 

To provide for services that prefer to communicate via HTTP or cannot communicate via SOCKS, Privoxy is used to accept HTTP requests and forward them as SOCKS requests to HAProxy.

Running the Docker Image 

Let’s spin up a container and take a look.

docker run \
    -p 8800:8800 \
    -p 8888:8888 \
    -p 1080:1080 \
    -p 2090:2090 \
    datawookie/medusa-proxy

We’re mapped a lot of ports.

  • 8800 — list of proxy URLs (as text/plain)
  • 8888 — Privoxy port (HTTP protocol)
  • 1080 — HAProxy port (SOCKS protocol) and
  • 2090 — HAProxy statistics port.

Not all of them are required, but they all fulfill a distinct purpose. Below are some alternative ways to invoke the image.

# HTTP proxy on port 8888
docker run -p 8888:8888 datawookie/medusa-proxy
# SOCKS proxy on port 1080
docker run -p 1080:1080 datawookie/medusa-proxy
# Both HTTP and SOCKS proxies
docker run -p 8888:8888 -p 1080:1080 datawookie/medusa-proxy

Once we’ve got a running container we can set up a client to use the proxy. We’ll start with a browser and then look at curl on the command line.

Browser 

Set up your browser to use the HTTP proxy. You could equally choose to use the SOCKS proxy, which will have an IP address of 127.0.0.1 and port 1080.

Once you’ve configured the proxy settings in your browser, head over to What Is My IP Address to check on your effective IP address. Refresh the page to confirm that the IP address changes.

Testing 

Further testing is easier to do (and record) on the command line. To illustrate how the proxy works we’ll send requests to http://httpbin.org/ip to retrieve our effective IP address.

First let’s look at the set of IP addresses reported by the container. Below is an extract from the Docker logs. Each exit node is tested periodically and its effective IP address is logged.

2021-09-29 04:24:40 [INFO] Testing proxy (port 10000): 195.154.35.52.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10001): 178.20.55.18.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10002): 195.176.3.24.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10003): 185.220.100.252.
2021-09-29 04:24:40 [INFO] Testing proxy (port 10004): 185.220.101.198.

Don’t worry about the port numbers (those are only relevant within the container). Note that there are five proxies (one for each Tor instance) and that each has a different IP address.

Set the http_proxy environment variable so that curl uses the proxy.

export http_proxy=http://127.0.0.1:8888

Now send out a series of requests.

curl http://httpbin.org/ip
{
  "origin": "195.154.35.52"
}
curl http://httpbin.org/ip
{
  "origin": "178.20.55.18"
}
curl http://httpbin.org/ip
{
  "origin": "195.176.3.24"
}
curl http://httpbin.org/ip
{
  "origin": "185.220.100.252"
}
curl http://httpbin.org/ip
{
  "origin": "185.220.101.198"
}
curl http://httpbin.org/ip
{
  "origin": "195.154.35.52"
}

Notice that each request appears to originate from a distinct IP address and that once we’ve cycled through all of the Tor instances, we wrap back to the first one.

Rotating 

The exit nodes are periodically rotated. Some time later we see that we’re using a different set of IP addresses.

2021-09-29 04:39:13 [INFO] Testing proxy (port 10000): 171.25.193.20.
2021-09-29 04:39:13 [INFO] Testing proxy (port 10001): 199.249.230.87.
2021-09-29 04:39:14 [INFO] Testing proxy (port 10002): 185.220.101.132.
2021-09-29 04:39:14 [INFO] Testing proxy (port 10003): 199.249.230.184.
2021-09-29 04:39:15 [INFO] Testing proxy (port 10004): 23.129.64.161.

A Weakness 

So this is great, but there’s one major weakness: if a request via any one of the Tor instances fails then the proxy would be marked as broken (regardless of whether the other Tor instances were fine or not). To get around this I have been running multiple containers, each assigned to a different port. With this setup, even if a few of the containers are marked as broken, there are still others which are considered healthy and able to accept requests. Although easy enough to automate, the logistics associated with this setup are a little onerous.

Wouldn’t it be convenient if there was just a single container which exposes multiple proxies, each of which is hooked up to a distinct set of Tor instances?

Why would a Tor instance fail? I can think of at least two plausible reasons:

  1. There is actually a problem with the Tor exit node and it’s not currently processing requests.
  2. The exit node might be blocked by the target site. This might either be based on geolocation of the exit node or its IP address (there’s anactively maintained list of Tor exit nodes).

Beast with Many Heads 

So, rather than exposing just a single unit per container, the Medusa proxy can cater for multiple proxy units or heads. What this means is that there can be multiple copies of the components illustrated in the diagram above. All heads are served from the same network location but on different ports.

There are a few environment variables which can be used to tweak the configuration:

  • TORS — Number of heads (default: 2)
  • TORS — Number of Tor instances (default: 5)
  • HAPROXY_LOGIN — Username for HAProxy (default: “admin”)
  • HAPROXY_PASSWORD — Password for HAProxy (default: “admin”)

Let’s give this a try. We’ll launch Medusa with 4 heads (each linking to 3 Tor instances) and only map the ports for the HTTP proxies.

docker run \
    -e HEADS=4 \
    -e TORS=3 \
    -p 8800:8800 \
    -p 8888:8888 -p 8889:8889 -p 8890:8890 -p 8891:8891 \
    datawookie/medusa-proxy

So now we have 4 heads at ports 8888, 8889, 8890 and 8891. If we look at the Docker logs then we see that there are 3 Tor endpoints for each head.

2021-09-30 08:05:34,606 [   INFO] Testing proxies.
2021-09-30 08:05:34,606 [   INFO] * Privoxy 0
2021-09-30 08:05:35,295 [   INFO] Testing proxy (port 10000): 185.220.101.2.
2021-09-30 08:05:35,753 [   INFO] Testing proxy (port 10001): 198.98.62.74.
2021-09-30 08:05:36,041 [   INFO] Testing proxy (port 10002): 185.220.100.245.
2021-09-30 08:05:36,041 [   INFO] * Privoxy 1
2021-09-30 08:05:36,652 [   INFO] Testing proxy (port 10003): 37.123.163.58.
2021-09-30 08:05:37,304 [   INFO] Testing proxy (port 10004): 37.187.196.70.
2021-09-30 08:05:37,634 [   INFO] Testing proxy (port 10005): 77.68.20.217.
2021-09-30 08:05:37,634 [   INFO] * Privoxy 2
2021-09-30 08:05:37,909 [   INFO] Testing proxy (port 10006): 89.163.143.8.
2021-09-30 08:05:38,954 [   INFO] Testing proxy (port 10007): 185.220.101.10.
2021-09-30 08:05:39,721 [   INFO] Testing proxy (port 10008): 195.206.105.217.
2021-09-30 08:05:39,721 [   INFO] * Privoxy 3
2021-09-30 08:05:40,050 [   INFO] Testing proxy (port 10009): 185.220.100.243.
2021-09-30 08:05:41,010 [   INFO] Testing proxy (port 10010): 185.220.101.43.
2021-09-30 08:05:41,415 [   INFO] Testing proxy (port 10011): 185.185.170.27.

Proxy List 

A proxy list is served as a plain text file on port 8800. This can be used to configure rotating proxies in clients.

Statistics 

You can monitor the performance of the proxies via the statistics interface, which each HAProxy instance exposes via a port numbered sequentially from 2090.

Using Medusa with Scrapy 

We developed Medusa Proxy with the primary intention of using it with the scrapy-rotating-proxies Scrapy package.

Here are some logs extracted from a problematic crawler using 4 heads. If there had just been a single proxy then this would not have worked well at all.

2021-10-01 03:05:29 INFO:  Proxies(good: 0, dead: 0, unchecked: 4, reanimated: 0, mean backoff time: 0s)
2021-10-01 03:05:29 DEBUG: Proxy <http://127.0.0.1:8890> is GOOD
2021-10-01 03:05:36 DEBUG: Proxy <http://127.0.0.1:8889> is DEAD
2021-10-01 03:05:51 DEBUG: GOOD proxy became DEAD: <http://127.0.0.1:8890>
2021-10-01 03:05:59 INFO:  Proxies(good: 0, dead: 2, unchecked: 2, reanimated: 0, mean backoff time: 47s)
2021-10-01 03:06:00 DEBUG: Proxy <http://127.0.0.1:8889> is DEAD
2021-10-01 03:06:13 DEBUG: Proxy <http://127.0.0.1:8890> is DEAD
2021-10-01 03:06:24 DEBUG: Proxy <http://127.0.0.1:8888> is DEAD
2021-10-01 03:06:29 INFO:  Proxies(good: 0, dead: 3, unchecked: 1, reanimated: 0, mean backoff time: 29s)
2021-10-01 03:06:29 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:06:34 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:06:36 DEBUG: Proxy <http://127.0.0.1:8890> is DEAD
2021-10-01 03:06:52 DEBUG: Proxy <http://127.0.0.1:8890> is DEAD
2021-10-01 03:06:59 INFO:  Proxies(good: 0, dead: 1, unchecked: 1, reanimated: 2, mean backoff time: 175s)
2021-10-01 03:07:06 DEBUG: Proxy <http://127.0.0.1:8891> is DEAD
2021-10-01 03:07:22 DEBUG: Proxy <http://127.0.0.1:8890> is GOOD
2021-10-01 03:07:29 INFO:  Proxies(good: 1, dead: 1, unchecked: 0, reanimated: 2, mean backoff time: 46s)
2021-10-01 03:07:37 DEBUG: Proxy <http://127.0.0.1:8888> is DEAD
2021-10-01 03:07:50 DEBUG: Proxy <http://127.0.0.1:8891> is DEAD
2021-10-01 03:07:59 INFO:  Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 1, mean backoff time: 42s)
2021-10-01 03:08:04 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:08:04 DEBUG: Proxy <http://127.0.0.1:8888> is DEAD
2021-10-01 03:08:18 DEBUG: Proxy <http://127.0.0.1:8891> is DEAD
2021-10-01 03:08:29 INFO:  Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 1, mean backoff time: 63s)
2021-10-01 03:08:33 DEBUG: Proxy <http://127.0.0.1:8891> is DEAD
2021-10-01 03:08:43 DEBUG: Proxy <http://127.0.0.1:8888> is DEAD
2021-10-01 03:08:55 DEBUG: Proxy <http://127.0.0.1:8891> is DEAD
2021-10-01 03:08:59 INFO:  Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 1, mean backoff time: 171s)
2021-10-01 03:08:59 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:09:09 DEBUG: Proxy <http://127.0.0.1:8888> is DEAD
2021-10-01 03:09:23 DEBUG: Proxy <http://127.0.0.1:8889> is GOOD
2021-10-01 03:09:29 INFO:  Proxies(good: 2, dead: 1, unchecked: 0, reanimated: 1, mean backoff time: 62s)
2021-10-01 03:09:36 DEBUG: GOOD proxy became DEAD: <http://127.0.0.1:8890>
2021-10-01 03:09:52 DEBUG: Proxy <http://127.0.0.1:8890> is DEAD
2021-10-01 03:09:59 INFO:  Proxies(good: 1, dead: 2, unchecked: 0, reanimated: 1, mean backoff time: 68s)
2021-10-01 03:10:12 DEBUG: GOOD proxy became DEAD: <http://127.0.0.1:8889>
2021-10-01 03:10:14 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:10:24 DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2021-10-01 03:10:29 INFO:  Proxies(good: 0, dead: 1, unchecked: 0, reanimated: 3, mean backoff time: 73s)
2021-10-01 03:10:59 INFO:  Proxies(good: 0, dead: 1, unchecked: 0, reanimated: 3, mean backoff time: 73s)

The logs reflect the current state of each of the proxies, which can be either:

  • good — working fine
  • dead — broken
  • unchecked — has not yet been used; or
  • reanimated — previously broken but now ready to be tried again. A reanimated proxy might become either good or dead depending on how it functions when next used.

Technical Details 

The image is derived from the official Alpine image, onto which Python 3, Tor, HAProxy and Privoxy are installed. The configuration files for each of the services are created using Jinja templates.