Headless Browser Hacks

Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that I usually take.

Realistic Explicit User Agent

Using an explicit User Agent (rather than the one used by default with Selenium or Playwright) is often enough to persuade a site that you are a legitimate browser.

Get a User Agent string from a recently updated browser.

# Chrome
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"
# Firefox
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0"

Choose a User Agent that’s appropriate to the browser that you’re using. Don’t choose a Firefox User Agent if you are using a Chrome browser!

With Selenium:

options = Options()
options.add_argument(f"user-agent={USER_AGENT}")

driver = webdriver.Chrome(options=options)

With Playwright:

browser = p.chromium.launch(headless=True)

context = browser.new_context(user_agent=USER_AGENT)

page = context.new_page()

Realistic Window Size

In the same way that a realistic User Agent can improve the browser fingerprint, so too can setting a non-standard window size.

With Selenium:

driver.set_window_size(1250, 750)

With Playwright:

context = browser.new_context(viewport={"width": 1250, "height": 750})

Appearing to be Less Headless

There are other settings that can be applied to make the browser less likely to be flagged as automated. You’ll need to apply some trial and error to see which (if any) of these are effective.

With Selenium:

# Chrome
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-popup-blocking")

With Playwright:

context = browser.new_context(
    device_scale_factor=1,
    is_mobile=False,
    has_touch=False,
)

context.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")

Virtual Framebuffer

Another approach is to simply not use a headless browser at all.

Normally I’ll use a headless browser under the following conditions:

  • a crawler that’s working reliably and I don’t need to monitor its progress;
  • a crawler that’s running on a remote server; or
  • a crawler that’s running in a container.

Presumably the first case is not applicable here because we’re talking about crawler’s that don’t run well in headless mode. In both of the remaining cases using a Virtual Framebuffer is a good option.

First install the xvfb package. Instruction below is for Debian-based machines.

sudo apt-get update -q
sudo apt-get install xvfb

Now start the framebuffer as a background process.

Xvfb :99 -screen 0 1024x768x16 &

That will effectively launch a virtual X11 server. It’s “virtual” in the sense that it doesn’t require any physical hardware. Breaking down the arguments:

  • :99 — the display number;
  • -screen 0 — the screen number;
  • 1024x768x16 — the width, height and colour depth of the virtual display.

Set the DISPLAY environment variable to correspond to the display number specified when launching the framebuffer.

export DISPLAY=:99

Now you can run the crawler without headless mode but it will not launch a browser window onto a physical display. Rather the browser will be rendered onto the framebuffer.