The scrapy-rotating-proxies package makes it simple to use rotating proxies with Scrapy.
One issue that I’ve run into though is that pages which return a 404 error are retried (and the corresponding proxy is marked as dead). This does not make sense to me since if a server returns a 404 error this generally means that the requested page is just not available. It’s not a proxy problem; it’s a URL problem.
So to get around this I defined a slightly enhanced ban detection policy.
settings.py add the following:
ROTATING_PROXY_BAN_POLICY = 'project.policy.BanPolicy'
Then in the
project folder create a
policy.py file with the following content:
from rotating_proxies.policy import BanDetectionPolicy class BanPolicy(BanDetectionPolicy): def response_is_ban(self, request, response): ban = super(BanPolicy, self).response_is_ban(request, response) if response.status == 404: return False else: return ban
That’s it. Now when a page returns a 404 error your crawler will simply move on to another page without modifying the health of any of your proxies.