Cloudflare is a service that aims improve the performance and security of websites. It operates as a content delivery network (CDN) to ensure faster load times and consequently better user experience. However, it also protects against online threats by filtering “malicious” traffic.
Web scraping requests are often deemed to be malicious (certainly by Cloudflare!) and thus blocked. There are various approaches to circumventing this, most of which involve running a live browser instance. For some applications though, this is a bit hammer for a small nail. The cloudscraper
package provides a lightweight option for dealing with Cloudflare and has an API similar to the requests
package.
Sites using Cloudflare
Take a look at the list of sites using Cloudflare. We’ll pick the first item on the list, OpenAI, as a test target.
Setup
Install the cloudscraper
and requests
Python packages.
beautifulsoup4==4.12.3
cloudscraper==1.2.71
requests==2.32.3
Also throw in beautifulsoup4
so that we can parse a response… when we get one!
Using Requests
Let’s try retrieving the content of the OpenAI homepage using a GET request via requests
.
import requests
response = requests.get("https://openai.com/")
print(response.status_code)
Not surprisingly, Cloudflare intervenes and we get a 403 “Forbidden” response.
Using Cloudscraper
Maybe we’ll have more success using the cloudscraper
package?
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.CloudScraper()
response = scraper.get("http://openai.com")
print(response.status_code)
soup = BeautifulSoup(response.text, "html.parser")
banner = soup.select_one("h2")
print(banner.text)
And the result?
200
ChatGPT on your desktop
Success!
There are lots of options for tweaking the Cloudscraper configuration. For example, you can specify the browser type (Chrome or Firefox), platform (Linux, Windows, Darwin, Android or IOS) and whether or not it’s on desktop or mobile. You can also choose from a selection of JavaScript engines.
import cloudscraper
URL = "http://openai.com"
scraper = cloudscraper.CloudScraper(
# Browser specifications.
browser={
"browser": "firefox",
"platform": "linux",
"desktop": True,
"mobile": False
},
# JavaScript Engine
interpreter="nodejs",
# debug=True
)
response = scraper.get(URL)
print(response.status_code)
Various third party CAPTCHA solvers, like 2captcha, anticaptcha and CapSolver, are also supported.