Headless Browser Hacks

Sometimes a site will work fine with Selenium or Playwright. Until you try headless mode… Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that have worked for me, in order of increasing complexity.

Realistic Explicit User Agent

Using an explicit User Agent (rather than the default from Selenium or Playwright) is often enough to persuade a site that you are a legitimate browser.

Get a User Agent string from a recently updated browser.

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0"

That’s from Firefox. You might have more success with a Chrome User Agent. Try various options.

With Selenium:

options = Options()
options.add_argument(f"user-agent={USER_AGENT}")

driver = webdriver.Chrome(options=options)

With Playwright:

browser = p.chromium.launch(headless=True)

context = browser.new_context(user_agent=USER_AGENT)

page = context.new_page()

If this does the trick then you might want to consider just updated to a more recent browser version.

Realistic Window Size

In the same way that a realistic User Agent can improve the browser fingerprint, so too can setting a non-standard window size.

With Selenium:

driver.set_window_size(1250, 750)

With Playwright:

context = browser.new_context(viewport={"width": 1250, "height": 750})

Appearing to be Less Headless

There are other settings which can be applied to make the browser less likely to be flagged as automated. You’ll need to use trial and error to see which (if any) of these are effective.

With Selenium:

# Chrome
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-popup-blocking")

With Playwright (and a lot more tweaks!):

browser = p.chromium.launch(
    channel="chrome",
    headless=True,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage",
        "--disable-extensions",
        "--disable-gpu",
        "--disable-infobars",
        "--disable-popup-blocking",
        "--no-sandbox",
        "--start-maximized",
    ]
)

context = browser.new_context(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    viewport={"width": 1280, "height": 720},
    locale="en-US",
    timezone_id="Europe/London",
    # Use a location that's suitable for site and consistent with timezone.
    geolocation={"longitude": -0.1125172, "latitude": 51.325689},
    # Automatically allow sites access to geolocation information.
    permissions=["geolocation"],
    java_script_enabled=True,
    device_scale_factor=1,
    is_mobile=False,
    has_touch=False,
)

# Override the webdriver property which if often set with automation.
context.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")

page = context.new_page()

📌 Some other things that I observed for Playwright:

  • Using the Chrome channel is better than using Chromium.
  • You might be penalised for using Chrome from Chrome for Testing. Rather use the current production version of Chrome.
  • Firefox and WebKit browsers might fare better that Chrome for headless. In fact, for some sites where I would need an elaborate Chrome setup like the one above, I could get through with an unmodified instance of Firefox or WebKit.

I often flip back and forth between a headed and headless browser. A setup like this can make that easier. I especially like to set the no_viewport option because that makes the headed browser experience a bit friendlier.

kwargs = {
    "user_agent": USER_AGENT,
    "locale": "en-US",
    "timezone_id": "Europe/London",
    "java_script_enabled": True,
    "is_mobile": False,
    "has_touch": False,
    "ignore_https_errors": True,
}

if headless:
    kwargs.update({
        "viewport": {"width": 1280, "height": 720},
        "device_scale_factor": 1,
    })
else:
    kwargs.update({
        "no_viewport": True,
    })

context = browser.new_context(**kwargs)

Virtual Framebuffer

Another approach is to simply not use a headless browser at all. Normally I’d go headless under the following conditions:

  • a crawler that’s working reliably and I don’t need to monitor its progress;
  • a crawler that’s running on a remote server; or
  • a crawler that’s running in a container.

Presumably the first case is not applicable here because we’re talking about crawlers that don’t run well in headless mode. In the remaining cases using a Virtual Framebuffer is a good option.

First install the xvfb package. The instruction below is for Debian-based machines.

sudo apt-get update -q
sudo apt-get install xvfb

Now start the framebuffer as a background process.

Xvfb :99 -screen 0 1024x768x16 &

That will effectively launch a virtual X11 server. It’s “virtual” in the sense that it doesn’t require any physical hardware. Breaking down the arguments:

  • :99 — the display number;
  • -screen 0 — the screen number;
  • 1024x768x16 — the width, height and colour depth of the virtual display.

Set the DISPLAY environment variable to correspond to the display number specified when launching the framebuffer.

export DISPLAY=:99

Now you can run the crawler without headless mode but it will not launch a browser window onto a physical display. Rather the browser will be rendered onto the framebuffer. Take a look here to see how you can view what’s being rendered onto the framebuffer.