There’s one major problem with ChromeDriver: anti-bot services are able to detect that a browser session is being automated (as opposed to being used by a regular meat sack) and will often impose restrictions or deny connections altogether. The Undetected ChromeDriver (undetected-chromedriver
) Python package is a patched version of ChromeDriver which avoids triggering a selection of anti-bot services, allowing it to glide under the anti-bot radar.
What is ChromeDriver?
ChromeDriver is used for testing websites and apps, as well as web scraping. It is often used via Selenium, which provides a consistent, high level interface for controlling a browser. It’s useful to understand the relationship between client programming languages, Selenium, ChromeDriver and the controlled browser.
Browsers
It’s useful to be able to choose from a selection of browsers. If you’re testing an app or website then you’ll want to be confident that it works on a variety of browsers. If you’re web scraping then your choice of browser might be based on subtle changes in the way that a site is rendered on different browsers, differences in performance and memory footprint, or just personal preference.
WebDriver
The WebDriver specification defines a protocol for remotely inspecting and controlling user agents (which in this context is just a general term for “browsers”). It’s a general specification, which means that it is language and browser agnostic. ChromeDriver and GeckoDriver are implementations of WebDriver for browsers built on the Chromium and Mozilla codebases respectively. They provide the mechanism for controlling a specific browser.
Selenium
The WebDriver specification provides a low level protocol for communicating with a browser. Using this protocol directly would be hard work. Selenium provides a high level interface to WebDriver, which makes writing client code easier and more efficient.
Clients
There are wrappers for the Selenium library which make it accessible from a variety of languages. Possibly the most frequently used languages for this purpose are (IMHO) Java, Python and R, but you could also use C#, Ruby or JavaScript.
Undetected ChromeDriver in Docker
You can install the undetected-chromedriver
package using pip
.
pip install undetected-chromedriver
Many applications get wrapped up in a Docker image, so it’s rather useful to have Python, the undetected-chromedriver
package, ChromeDriver and a browser all neatly enclosed in a single image.
There’s an Undetected ChromeDriver Docker image. However, the corresponding Dockerfile
is not available and I like to understand what’s gone into an image. I rolled my own, which can be found here.
Example
We’re going to access two sites:
- https://nowsecure.nl — a test site with “max anti-bot protection” and
- https://datadome.co — a provider of “bot management software”.
💡 If you’re trying this out yourself then you might want to run the examples using Undetected ChromeDriver first before coming back to Selenium because the latter will likely result in your IP address being flagged.
Using Selenium
To run these examples I launched a Selenium Docker container exposing VNC on port 7900 and the Selenium hub on port 4444.
docker run -p 4444:4444 -p 7900:7900 selenium/standalone-chrome-debug:3.141.59
First we’ll visit https://nowsecure.nl and take a screenshot.
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
driver = webdriver.Remote("http://127.0.0.1:4444/wd/hub", DesiredCapabilities.CHROME)
driver.get("https://nowsecure.nl")
driver.set_window_size(1000, 900)
driver.save_screenshot('selenium-nowsecure.png')
This is what the screenshot looks like.
It’s a little underwhelming, but it indicates that one of the anti-bot mechanisms on the site is blocking us. Hang on, you’ll see shortly what it should look like. Or just visit the site now. If you experience a sensory assault then it’s confirmation that you’re not a bot.
Now let’s take a swing at https://datadome.co. We’ll take another screenshot to record the result.
driver.get("https://datadome.co")
driver.save_screenshot('selenium-datadome.png')
Aha! It looks like we’ve been spotted. A CAPTCHA indicates that the site regards the request as suspicious and would normally scupper our attempts to browse the site.
Using Undetected ChromeDriver
Now we’ll try the same sites using Undetected ChromeDriver. These examples were run in an interactive session using the Undetected ChromeDriver Docker image. Again VNC is exposed on port 7900.
docker run -it -p 7900:7900 --shm-size=2gb datawookie/undetected-chromedriver:latest
The -shm-size
is not always necessary. Depending on the resource use associated with specific web pages it might or not be required. If in doubt, use it! See this post for an explanation of shared memory and Docker.
Let’s visit https://nowsecure.nl.
import undetected_chromedriver as uc
driver = uc.Chrome()
driver.get("https://nowsecure.nl")
A screenshot indicates that we have penetrated the anti-bot measures.
What about DataDome?
driver.get("https://datadome.co")
Looks good!
🚨 If your IP address has already been flagged by an anti-bot mechanism then using Undetected ChromeDriver is probably not going to help you. Well, not from the compromised IP address. If you can get a fresh IP address then you’re back in business.
Extending the Undetected ChromeDriver Image
The benefits of having a Docker image with the Undetected ChromeDriver functionality is that you can easily create a derived image with additional capabilities. Suppose, for example, that I wanted an Undetected ChromeDriver script that also used the pyjokes
package (because why wouldn’t you?). The script, doit.py
, might look like this:
import undetected_chromedriver as uc
import pyjokes
driver = uc.Chrome()
driver.get("https://nowsecure.nl")
print(driver.page_source)
print(pyjokes.get_joke())
And the corresponding Dockerfile
would be:
FROM datawookie/undetected-chromedriver:latest
RUN pip3 install pyjokes
COPY doit.py .
CMD ["python", "doit.py"]
This is based on the Undetected ChromeDriver image but adds the pyjokes
package and includes the script itself (a container will automatically run the script).
🚨 Don’t install the following Python packages into the derived image because the correct versions are already in the base image:
selenium
requests
orurllib3
undetected_chromedriver
.
Logging
It can be useful to record the ChromeDriver logs. This is especially handy if you have trouble with launching Chrome. To do this simply give the service_log_path
argument when you instantiate a Chrome
object.
driver = uc.Chrome(service_log_path="chromedriver.log")
Troubleshooting
Sometimes you might get the following WebDriverException
:
unknown error: session deleted because of page crash
This is most likely due to the process running out of memory in the Docker container. To get around this use --shm-size="2g"
when running the image.