Headless Browser Hacks

Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that I usually take.

Realistic Explicit User Agent

Using an explicit User Agent (rather than the one used by default with Selenium or Playwright) is often enough to persuade a site that you are a legitimate browser.

Get a User Agent string from a recently updated browser.

USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:137.0) Gecko/20100101 Firefox/137.0"

That’s from Firefox. You might have more success with a Chrome User Agent. Try various options.

With Selenium:

options = Options()
options.add_argument(f"user-agent={USER_AGENT}")

driver = webdriver.Chrome(options=options)

With Playwright:

browser = p.chromium.launch(headless=True)

context = browser.new_context(user_agent=USER_AGENT)

page = context.new_page()

Realistic Window Size

In the same way that a realistic User Agent can improve the browser fingerprint, so too can setting a non-standard window size.

With Selenium:

driver.set_window_size(1250, 750)

With Playwright:

context = browser.new_context(viewport={"width": 1250, "height": 750})

Appearing to be Less Headless

There are other settings that can be applied to make the browser less likely to be flagged as automated. You’ll need to apply some trial and error to see which (if any) of these are effective.

With Selenium:

# Chrome
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument("--disable-popup-blocking")

With Playwright (and a lot more tweaks!):

browser = p.chromium.launch(
    channel="chrome",
    headless=True,
    args=[
        "--disable-blink-features=AutomationControlled",
        "--disable-dev-shm-usage",
        "--disable-extensions",
        "--disable-gpu",
        "--disable-infobars",
        "--disable-popup-blocking",
        "--no-sandbox",
        "--start-maximized",
    ]
)

context = browser.new_context(
    user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
    viewport={"width": 1280, "height": 720},
    locale="en-US",
    timezone_id="Europe/London",
    # Use a location that's suitable for site and consistent with timezone.
    geolocation={"longitude": -0.1125172, "latitude": 51.325689},
    # Automatically allow sites access to geolocation information.
    permissions=["geolocation"],
    java_script_enabled=True,
    device_scale_factor=1,
    is_mobile=False,
    has_touch=False,
)

# Override the webdriver property which if often set with automation.
context.add_init_script("""
    Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
""")

page = context.new_page()

ANOTHER OPTION (EXANE ONE)

ANOTHER OPTION (EXANE ONE)

ANOTHER OPTION (EXANE ONE)

ANOTHER OPTION (EXANE ONE)

from playwright.sync_api import sync_playwright

# This works for headless.

with sync_playwright() as p:
    browser = p.chromium.launch(
        # Need to use Chrome (Chromium doesn't work!).
        # Might need to be "proper" Chrome (not Chrome for Testing!).
        channel="chrome",
        headless=True,
        args=[
            "--disable-blink-features=AutomationControlled",
            "--no-sandbox",
            "--disable-dev-shm-usage",
            "--disable-gpu",
        ]
    )
    context = browser.new_context(
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        viewport={"width": 1280, "height": 720},
        locale="en-US",
        timezone_id="Europe/London",
        java_script_enabled=True,
        device_scale_factor=1,
        is_mobile=False,
        has_touch=False,
    )
    page = context.new_page()

    def handler(route, request):
        print(request.url)
        route.continue_()

    page.route("**/*", handler)

    page.goto("https://cube.cib.bnpparibas/home")
    print(page.locator(".textDevices").text_content())

    browser.close()

📌 Some other things that I observed for Playwright:

  • Using the Chrome channel is better than using Chromium.
  • You might be penalised for using Chrome from Chrome for Testing. Rather use the current production version of Chrome.
  • Firefox and WebKit browsers might fare better that Chrome for headless. In fact, for some sites where I would need an elaborate Chrome setup like the one above, I could get through with an unmodified instance of Firefox or WebKit.

I often flip back and forth between a headed and headless browser. A setup like this can make that easier. I especially like to set the no_viewport option because that makes the headed browser experience a bit friendlier.

kwargs = {
    "user_agent": USER_AGENT,
    "locale": "en-US",
    "timezone_id": "Europe/London",
    "java_script_enabled": True,
    "is_mobile": False,
    "has_touch": False,
    "ignore_https_errors": True,
}

if headless:
    kwargs.update({
        "viewport": {"width": 1280, "height": 720},
        "device_scale_factor": 1,
    })
else:
    kwargs.update({
        "no_viewport": True,
    })

context = browser.new_context(**kwargs)

Virtual Framebuffer

Another approach is to simply not use a headless browser at all.

Normally I’ll use a headless browser under the following conditions:

  • a crawler that’s working reliably and I don’t need to monitor its progress;
  • a crawler that’s running on a remote server; or
  • a crawler that’s running in a container.

Presumably the first case is not applicable here because we’re talking about crawler’s that don’t run well in headless mode. In both of the remaining cases using a Virtual Framebuffer is a good option.

First install the xvfb package. Instruction below is for Debian-based machines.

sudo apt-get update -q
sudo apt-get install xvfb

Now start the framebuffer as a background process.

Xvfb :99 -screen 0 1024x768x16 &

That will effectively launch a virtual X11 server. It’s “virtual” in the sense that it doesn’t require any physical hardware. Breaking down the arguments:

  • :99 — the display number;
  • -screen 0 — the screen number;
  • 1024x768x16 — the width, height and colour depth of the virtual display.

Set the DISPLAY environment variable to correspond to the display number specified when launching the framebuffer.

export DISPLAY=:99

Now you can run the crawler without headless mode but it will not launch a browser window onto a physical display. Rather the browser will be rendered onto the framebuffer. Take a look at here to see how you can view what’s being rendered onto the framebuffer.