
Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.
These are the approaches that I usually take.
Explicit User Agent
Using an explicit User Agent (rather than the one used by default with Selenium or Playwright) is often enough to persuade a site that you are a legitimate browser.
Get a User Agent string from a recently updated browser.
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0"
That’s from Firefox. You might have more success with a Chrome User Agent. Try various options.
With Selenium:
options = Options()
options.add_argument(f"user-agent={USER_AGENT}")
driver = webdriver.Chrome(options=options)
With Playwright:
browser = p.chromium.launch(headless=True)
context = browser.new_context(user_agent=USER_AGENT)
page = context.new_page()
Virtual Framebuffer
Another approach is to simply not use a headless browser at all.
Normally I’ll use a headless browser under the following conditions:
- a crawler that’s working reliably and I don’t need to monitor its progress;
- a crawler that’s running on a remote server; or
- a crawler that’s running in a container.
Presumably the first case is not applicable here because we’re talking about crawler’s that don’t run well in headless mode. In both of the remaining cases using a Virtual Framebuffer is a good option.
First install the xvfb
package. Instruction below is for Debian-based machines.
sudo apt-get update -q
sudo apt-get install xvfb
Now start the framebuffer as a background process.
Xvfb :99 -screen 0 1024x768x16 &
That will effectively launch a virtual X11 server. It’s “virtual” in the sense that it doesn’t require any physical hardware. Breaking down the arguments:
:99
— the display number;-screen 0
— the screen number;1024x768x16
— the width, height and colour depth of the virtual display.
Set the DISPLAY
environment variable to correspond to the display number specified when launching the framebuffer.
export DISPLAY=:99
Now you can run the crawler without headless mode but it will not launch a browser window onto a physical display. Rather the browser will be rendered onto the framebuffer.