Web Scraping with Class Name Mangling

Class name mangling (or hashing) is becoming increasingly prevalent. There’s no need to let it slow you down though. This is how you can deal with it.

Class Name What?

Class name mangling is the practice of adding a random suffix to the end of class names. There are a variety of reasons why it might be applied:

  1. Namespace isolation — Ensuring that the styles defined in one part of a website or application do not unintentionally affect other parts. Particularly useful in large projects or when integrating third-party libraries.
  2. Minimization and Optimization — Shorter, mangled class names take up less space. This can improve load times and overall performance.
  3. Cache busting — Browsers will not use cached versions of the CSS.
  4. Scraping deterrent — Can potentially make web scraping slightly more difficult.

Consider the following two chunks of HTML. The first has nice deterministic class names.

<div class="people-item-card">
  <div class="item-card">
    <div class="item-card-header">
      <div class="item-card-people">
        <div class="item-card-body">
          <div class="imageWrapper">
            <img class="image-item">
          <div class="card-hover-item">
            <p class="text-black people-name"></p>
            <p class="text-black people-title"></p>
            <p class="text-black people-location"></p>

The second has class names polluted by randomised gunk.

<div class="people-item-card___Xma9s">
  <div class="item-card___0HOmd">
    <div class="item-card-header___IP-Zw">
      <div class="item-card-people___6Jm8O">
        <div class="item-card-body___uSiw4">
          <div class="imageWrapper___LFfv-">
            <img class="image-item___p50d5">
          <div class="card-hover-item___JdoHD">
            <p class="text-black people-name___AT6yb"></p>
            <p class="text-black people-title___mYz85"></p>
            <p class="text-black people-location___3Bgbo"></p>

CSS Selectors for Mangled Class Names

With deterministic class names it’s easy to write a CSS selector. For example, the outermost <div> in the first code sample above could be selected with people-item-card. Certainly you could use .people-item-card___Xma9s to for the second code sample. And that might work for a while. But if the site gets rebuilt then the class names will all get new randomised suffixes and your carefully crafted crawler will cease to work. ☹️ Bummer.

Attribute Selectors

So is there a robust way around this? Yes, indeed there is! You can use an attribute selector with the ^= operator. For example, div[class^="people-item-card"] will match the <div> with class people-item-card___Xma9s but will also match with class people-item-card___b8WW9. 💡 Although a . occurs at the beginning of a class selector, it’s not included in an attribute selector.

How does this work? The [] in the CSS selector introduce attribute matching and the ^= operator does a “starts with” match.

Multiple Classes

The above solution will only work if the class name that you are targeting is listed first. If not then you need to use the *= operator, which will match anywhere (as opposed to only at the beginning). So, for example, to match the <p> tag with name information you’d use p[class*="people-name"].

If you run into a site with mangled class names, don’t be dismayed, just use suitably crafted attribute selectors.