This post shows an approach to using a rotating Tor proxy with Scrapy.
I’m using the scrapy-rotating-proxies
download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over time too, so I’m using the Tor network.
Setup
I’ve got the following in the settings.py
for my Scrapy project:
DOWNLOADER_MIDDLEWARES = {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
}
ROTATING_PROXY_LIST_PATH = 'proxy-list.txt'
ROTATING_PROXY_PAGE_RETRY_TIMES = 5
This (1) specifies where the package middleware fits into the pipeline for processing requests and (2) points to a file, proxy-list.txt
, which contains a list of proxies. There are other settings for the package, but they are not important right now.
Proxy List
The contents of proxy-list.txt
looks like this:
# Generated by create-proxies script.
http://127.0.0.1:9990
http://127.0.0.1:9991
http://127.0.0.1:9992
http://127.0.0.1:9993
I’m running four local proxies. How? Well, with Docker, of course!
The scrapy-rotating-proxies
package ensures that
- requests are sent out via these proxies and
- the proxies are used in rotation, so that consecutive requests use distinct proxies.
The reason for rotating through a list of proxies is to ensure that at any given time there are multiple proxies (each with a different IP address) available for sending requests.
Tor Proxies
In order to access a truly diverse set of IP addresses I’m tapping into the Tor network via the pickapp/tor-proxy Docker image.
Using Docker Compose it’s easy to spin up a cluster of Tor proxies. This is my docker-compose.yml
:
# Generated by create-proxies script.
version: '3'
services:
tor-bart:
container_name: 'tor-bart'
image: 'pickapp/tor-proxy:latest'
ports:
- '9990:8888'
environment:
- IP_CHANGE_SECONDS=60
restart: always
tor-homer:
container_name: 'tor-homer'
image: 'pickapp/tor-proxy:latest'
ports:
- '9991:8888'
environment:
- IP_CHANGE_SECONDS=60
restart: always
tor-marge:
container_name: 'tor-marge'
image: 'pickapp/tor-proxy:latest'
ports:
- '9992:8888'
environment:
- IP_CHANGE_SECONDS=60
restart: always
tor-lisa:
container_name: 'tor-lisa'
image: 'pickapp/tor-proxy:latest'
ports:
- '9993:8888'
environment:
- IP_CHANGE_SECONDS=60
restart: always
There are four services defined, each of which maps port 8888 on the container to a specific host port (a sequence of ports starting at 9990 and corresponding to the ports listed in proxy-list.txt
).
CONTAINER ID IMAGE PORTS NAMES
98feb5a034e6 datawookie/tor-privoxy 0.0.0.0:9990->8888/tcp tor-bart
26f05b1deb17 datawookie/tor-privoxy 0.0.0.0:9991->8888/tcp tor-homer
b856ded83585 datawookie/tor-privoxy 0.0.0.0:9992->8888/tcp tor-marge
c352aea63eed datawookie/tor-privoxy 0.0.0.0:9993->8888/tcp tor-lisa
Setting the IP_CHANGE_SECONDS
environment variable to 60 causes the Tor exit node used by a proxy to change every minute.
Generating Configuration
To make this setup more flexible I have a script, create-proxies
, which generates the contents of proxy-list.txt
and docker-compose.yml
.
#!/usr/bin/env python3
NAMES = ['bart', 'homer', 'marge', 'lisa']
WARNING = "# Generated by create-proxies script.\n\n"
# Generate docker-compose.yml.
#
with open("docker-compose.yml", "w") as f:
f.write(WARNING)
f.write("version: '3'\n\nservices:\n")
for index, name in enumerate(NAMES):
f.write(f" tor-{name}:\n")
f.write(f" container_name: 'tor-{name}'\n")
f.write(" image: 'pickapp/tor-proxy:latest'\n")
f.write(" ports:\n")
f.write(f" - '{9990+index}:8888'\n")
f.write(" environment:\n")
f.write(" - IP_CHANGE_SECONDS=60\n")
f.write(" restart: always\n")
# Generate proxy-list.txt.
#
with open("proxy-list.txt", "w") as f:
f.write(WARNING)
for index, name in enumerate(NAMES):
f.write(f'http://127.0.0.1:{9990+index}\n')
If I want to add or remove proxies then I simply edit the NAMES
list, run the script again, restart Docker Compose and voila!
Results
This is what an extract from the crawler logs looks like:
Proxies(good: 0, dead: 0, unchecked: 4, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9993> is GOOD
Proxy <http://127.0.0.1:9992> is GOOD
Proxies(good: 2, dead: 0, unchecked: 2, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9991> is GOOD
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxies(good: 3, dead: 0, unchecked: 1, reanimated: 0, mean backoff: 0s)
Proxy <http://127.0.0.1:9990> is GOOD
Proxies(good: 4, dead: 0, unchecked: 0, reanimated: 0, mean backoff: 0s)
The addresses for the proxies are fixed (sampled from the list in proxy-list.txt
). However, the each Tor proxy refreshes its exit node every minute. Here are the logs from a slightly updated version of the Tor proxy Docker image:
🔁 HUP → Tor.
📌 exit IP: 109.70.100.50.
🔁 HUP → Tor.
📌 exit IP: 31.7.61.190.
🔁 HUP → Tor.
📌 exit IP: 178.20.55.18.
🔁 HUP → Tor.
📌 exit IP: 185.220.102.242.
🔁 HUP → Tor.
📌 exit IP: 109.70.100.51.
This is happening for each of the proxies, so requests effectively are being sent from a constantly changing set of IP addresses. Good way to stay below the radar!