Scrapy Ban Policies with Rotating Proxies

The scrapy-rotating-proxies package makes it simple to use rotating proxies with Scrapy.

One issue that I’ve run into though is that pages which return a 404 error are retried (and the corresponding proxy is marked as dead). This does not make sense to me since if a server returns a 404 error this generally means that the requested page is just not available. It’s not a proxy problem; it’s a URL problem.

To get around this I defined a slightly enhanced ban detection policy.

First in settings.py add the following:

ROTATING_PROXY_BAN_POLICY = 'project.policy.BanPolicy'

Then in the project folder create a policy.py file with the following content:

from rotating_proxies.policy import BanDetectionPolicy

class BanPolicy(BanDetectionPolicy):
    def response_is_ban(self, request, response):
        ban = super(BanPolicy, self).response_is_ban(request, response)

        if response.status == 404:
            return False
        else:
            return ban

That’s it. Now when a page returns a 404 error your crawler will simply move on to another page without modifying the health of any of your proxies.