The scrapy-rotating-proxies
package makes it simple to use rotating proxies with Scrapy.
One issue that I’ve run into though is that pages which return a 404 error are retried (and the corresponding proxy is marked as dead). This does not make sense to me since if a server returns a 404 error this generally means that the requested page is just not available. It’s not a proxy problem; it’s a URL problem.
To get around this I defined a slightly enhanced ban detection policy.
First in settings.py
add the following:
ROTATING_PROXY_BAN_POLICY = 'project.policy.BanPolicy'
Then in the project
folder create a policy.py
file with the following content:
from rotating_proxies.policy import BanDetectionPolicy
class BanPolicy(BanDetectionPolicy):
def response_is_ban(self, request, response):
ban = super(BanPolicy, self).response_is_ban(request, response)
if response.status == 404:
return False
else:
return ban
That’s it. Now when a page returns a 404 error your crawler will simply move on to another page without modifying the health of any of your proxies.