
I’ve noticed that some of my scraper tests are significantly slower than others. On closer examination I discovered that most of the delay is being incurred when BeautifulSoup
is parsing HTML. And a significant proportion of that time is spent checking character encoding.
What is Character Encoding
Computers store and transmit text as a sequence of bytes. Character encodings are the rules that map those raw bytes to human-readable text. To get the correct letters, punctuation, and symbols, we need to know what encoding was used.
In the early days of computing the rules were simple. Everyone used the ASCII character set, which defined 128 characters including basic English letters, digits and punctuation. One byte per character, no ambiguity.
But as computing evolved and people needed to represent accented letters, symbols and non-Latin scripts, ASCII was no longer enough. New encodings were developed to support these characters. And then you had to know which encoding was being used or risk turning perfectly good text into complete nonsense.
These are some well known encodings:
ascii
— the original 7-bit (128 character) ASCII encoding (still used widely but not often for HTML);utf-8
— the 8-bit UTF (Unicode Transformation Format) encoding which is most widely used today and handles all Unicode characters; andiso-8859-1
— Latin 1 encoding used for Western European languages.
Detecting Character Encoding
HTML documents are found with a variety of different encodings. BeautifulSoup needs to detect the appropriate encoding to ensure that it’s output is not a garbled mess.
Depending on how it’s done, detecting character encoding can be relatively time consuming. To get some intuition around this we’ll use the chardet
package which gathers statistics to infer character encoding. On larger HTML documents identifying character encoding can often take significantly longer than actually parsing the HTML.
CLI Simple Test
Let’s see how this works with a few simple HTML document. First a tiny HTML5 document that uses a <meta>
tag in the <head>
section to specify the character set.
<!DOCTYPE html>
<html lang="fr">
<head>
<meta charset="UTF-8">
</head>
<body>
<p>Résumé, naïve, café.</p>
</body>
</html>
The legacy equivalent uses a <meta>
tag to simulate a Content-Type
HTTP header.
<!DOCTYPE html>
<html lang="fr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<p>Résumé, naïve, café.</p>
</body>
</html>
The chardet
package has a CLI client that can be used to quickly check encoding. The result is the same with either of the above files.
$ chardet with-charset-header.html
with-charset-header.html: utf-8 with confidence 0.938125
It identifies UTF-8 encoding with high confidence. What about the same document but without the header?
<!DOCTYPE html>
<html>
<body>
<p>Résumé, naïve, café.</p>
</body>
</html>
$ chardet without-header.html
without-header.html: utf-8 with confidence 0.938125
For a simple document the presence of a <meta>
tag giving the character encoding doesn’t seem to impact the results from chardet
. I suspect that chardet
ignores the header because it’s possible that it might not provide the correct information anyway (for example, the header says utf-8
but the document is actually encoded with iso-8859-1
), in which case chardet
would get better results from a statistical analysis of the HTML contents anyway. The chardet
documentation says as much:
Sometimes you receive text with verifiably inaccurate encoding information.
chardet
FAQ
CLI Realistic Test
Let’s try the CLI out with a more realistic HTML file.
time chardet cashmere-interior-acrylic-latex.html
cashmere-interior-acrylic-latex.html: utf-8 with confidence 0.99
real 0m0.936s
user 0m0.922s
sys 0m0.013s
The CLI client takes just under 1 second to detect the encoding for this file. Not long. However, if this was happening repeatedly across an extensive suite of tests then this delay would accumulate. Slow tests are a problem because they’re an impediment to rapid development.
Python Tests
Let’s repeat these checks from Python.
import chardet
def find_encoding(filename: str):
with open(filename, "rb") as f:
html = f.read()
print(chardet.detect(html))
find_encoding("with-charset-header.html")
find_encoding("without-header.html")
find_encoding("cashmere-interior-acrylic-latex.html")
The results are consistent with what we got from the CLI.
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.938125, 'language': ''}
{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}
Unicode, Dammit!
BeautifulSoup
doesn’t actually use chardet
directly. Instead it uses a sub-library called UnicodeDammit
(Unicode, Dammit!) to detect encoding and if necessary convert to Unicode. It depends on either chardet
(implemented in Python) or the quicker cchardet
(implemented in C) to do the actual encoding detection.
from bs4 import UnicodeDammit
with open("cashmere-interior-acrylic-latex.html", "rb") as f:
html = f.read()
dammit = UnicodeDammit(html)
print(dammit.original_encoding)
The .original_encoding
attribute of the UnicodeDammit
object gives the document’s original encoding.
utf-8
This takes around half a second.
real 0m0.661s
user 0m0.649s
sys 0m0.012s
HTML Parsing and Encoding Detection
Let’s try parsing the realistic HTML file using BeautifulSoup
.
from bs4 import BeautifulSoup
with open("cashmere-interior-acrylic-latex.html", "rb") as f:
html = f.read()
# Create soup from bytes.
soup = BeautifulSoup(html, "lxml")
print(soup.original_encoding)
I’m reading the file as bytes because this most closely emulates my normal scraping workflow, where I persist downloaded HTML as bytes. Again the .original_encoding
attribute gives the document’s original encoding.
utf-8
This takes slightly longer than simply determining the encoding, some of the extra time being taken to parse the HTML.
real 0m0.704s
user 0m0.687s
sys 0m0.017s
Use the cProfile
module to generate profiling data and retain only the biggest contributors. The total execution time is a little longer because of the overhead of running the profiler. However, it’s clear that most of the time is being spent figuring out the correct encoding.
ncalls tottime percall cumtime percall filename:lineno(function)
68/1 0.000 0.000 1.117 1.117 {built-in method builtins.exec}
1 0.000 0.000 1.117 1.117 html-parse-naive.py:1(<module>)
1 0.000 0.000 1.040 1.040 __init__.py:122(__init__)
1 0.000 0.000 0.923 0.923 _lxml.py:149(prepare_markup)
1 0.000 0.000 0.923 0.923 dammit.py:407(encodings)
1 0.000 0.000 0.923 0.923 dammit.py:43(chardet_dammit)
1 0.000 0.000 0.923 0.923 __init__.py:24(detect)
1 0.000 0.000 0.923 0.923 universaldetector.py:111(feed)
2 0.000 0.000 0.703 0.352 charsetgroupprober.py:65(feed)
14 0.082 0.006 0.469 0.034 sbcharsetprober.py:77(feed)
13 0.000 0.000 0.387 0.030 charsetprober.py:66(filter_international_words)
1274 0.387 0.000 0.387 0.000 {method 'findall' of 're.Pattern' objects}
13 0.000 0.000 0.386 0.030 __init__.py:209(findall)
1 0.010 0.010 0.218 0.218 latin1prober.py:116(feed)
1 0.116 0.116 0.211 0.211 utf8prober.py:57(feed)
1 0.157 0.157 0.209 0.209 charsetprober.py:103(filter_with_english_letters)
Since most HTML documents are UTF-8 encoded this seems like wasted effort. Surely if we know (or are pretty certain of) the correct encoding then we don’t need to check each time?
How can we make that more efficient? Here are some options.
Open as Text with Explicit Encoding
Open the file with the correct encoding.
from bs4 import BeautifulSoup
# The mode is implicitly "rt".
with open("cashmere-interior-acrylic-latex.html", encoding="utf-8") as f:
html = f.read()
# Create soup from str.
soup = BeautifulSoup(html, "lxml")
print(soup.original_encoding)
A Unicode string (rather than bytes) is being passed to BeautifulSoup
, so it doesn’t need to guess the encoding. And the .original_encoding
attribute is consequently empty.
None
It’s also much faster.
real 0m0.109s
user 0m0.090s
sys 0m0.019s
Read as Bytes then Decode
Read the file as bytes and then decode using the correct encoding.
from bs4 import BeautifulSoup
with open("cashmere-interior-acrylic-latex.html", "rb") as f:
html = f.read()
# Decode to str using appropriate encoding.
html = html.decode("utf-8")
# Create soup from str.
soup = BeautifulSoup(html, "lxml")
print(soup.original_encoding)
The .original_encoding
attribute is empty again because BeautifulSoup
receives a decoded string and doesn’t have to guess the encoding.
None
Execution time is essentially the same as the previous example.
Read as Bytes then Parse with Explicit Encoding
Read the file as bytes and then parse using the correct encoding.
from bs4 import BeautifulSoup
with open("cashmere-interior-acrylic-latex.html", "rb") as f:
html = f.read()
# Create soup from bytes with specific encoding.
soup = BeautifulSoup(html, "lxml", from_encoding="utf-8")
print(soup.original_encoding)
Now the .original_encoding
attribute is populated with the provided encoding. No guessing required.
utf-8
Execution time is similar again.
Much ado about nothing?
Whether or not encoding overhead is a problem depends on your workflow.
If you retrieve the HTML content (using requests.get()
or httpx.get()
) and immediately parse it then you probably don’t need to worry about encodings because requests
or httpx
will do it for you. The .text
attribute on the response object will automatically apply the appropriate encoding. 🤞 If the result is not quite what you expected then you can intervene manually by either setting the .encoding
attribute on the response object or decoding explicitly.
# Set the encoding.
response.encoding = "utf-8"
response.text
# Decode explicitly.
response.content.decode("utf-8")
It’s really more of an issue where your workflow consists of multiple steps like this:
- Download the raw HTML content and then persist to file as bytes.
- Load bytes from the file then parse.
In this case BeautifulSoup may need to work harder to determine the appropriate encoding to apply. But using any of the three approaches illustrated above should sort this out!