data:image/s3,"s3://crabby-images/93fb4/93fb4301f3584069aafb3395130b3b7874050670" alt="A robot hand using a paint scraper to remove paint from a wall."
What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.
HTML Entities
HTML entities are codes used to represent characters that either have a special meaning in HTML (like <
, >
and &
) or are difficult to type directly (such as non-breaking spaces or special symbols). They are written using an ampersand (&
), followed by a code name or number and end with a semicolon (;
). There’s remarkable array of these symbols (see table here).
For example, both <
and >
have special significance in HTML (they denote the beginning and end of a tag name, like <p>
, <div>
or <blockquote>
). In order to get a literal less than or greater than symbol you’d use the corresponding HTML entity: <
or >
.
Example: HTML Quote
Suppose, for example, that you had the following HTML content:
<blockquote>
<p>
“All that is gold does not glitter,
Not all those who wander are lost; <br>
The old that is strong does not wither,
Deep roots are not reached by the frost.”
</p>
— J.R.R. Tolkien
© 1954
• <em>The Lord of the Rings: The Fellowship of the Ring</em>
</blockquote>
The HTML renders like this:
“All that is gold does not glitter, Not all those who wander are lost;
— J.R.R. Tolkien © 1954 • The Lord of the Rings: The Fellowship of the Ring
The old that is strong does not wither, Deep roots are not reached by the frost.”
We’ll look at attacking this in both R and Python. In both cases the raw HTML string will be stored initially in quote
.
Let’s attack it in R first.
library(xml2)
library(stringr)
# Handle HTML entities.
quote <- xml_text(read_html(quote))
# Clean up whitespace.
quote <- quote |> str_squish() |> str_trim()
The result is squeaky clean (printed using the strwap()
function). All of the HTML entities have been replaced.
“All that is gold does not glitter, Not all those who wander are lost; The old that is strong does
not wither, Deep roots are not reached by the frost.” — J.R.R. Tolkien © 1954 • The Lord of the
Rings: The Fellowship of the Ring
Now flipping over to Python (where we’ll stay for the rest of the post).
import re
import html
from bs4 import BeautifulSoup
soup = BeautifulSoup(quote, "html.parser")
# Extract text and convert HTML entities.
quote = html.unescape(soup.get_text())
# Clean up whitespace.
quote = re.sub(r'\s+', ' ', quote).strip()
Again the result is nice and clean (printing using the textwrap.fill()
function).
“All that is gold does not glitter, Not all those who wander are lost; The old that is strong does
not wither, Deep roots are not reached by the frost.” — J.R.R. Tolkien © 1954 • The Lord of the
Rings: The Fellowship of the Ring
Example: JSON-LD from Job Post
A more realistic example, parsing HTML embedded in JSON-LD content. In the previous post on gathering JSON-LD data I cunningly omitted the description field from the JSON data for the Research Chemist job post. You can access the full JSON data here. Take a quick look at that to appreciate the quagmire in the description field.
In the interests of clarity I have truncated the description text at the end of the job summary. It is otherwise unaltered. This is what the raw text looks like.
<strong>Adesis Inc. is a subsidiary of Universal Display Corporation (Nasdaq:
OLED)</strong>, a global leader in organic light emitting diodes. Within our OLED Chemistry
organization, we engage in research and development of new materials for use in the fabrication of
OLED devices leading to the next generation of OLED displays.<br />\n<br />\nOur
R&amp;D footprint encompasses almost 100,000 square feet of state-of-the-art laboratory and
manufacturing space in two U.S. locations - New Castle and Wilmington, DE, centrally located in the
Mid-Atlantic biotech/pharmaceutical hub.<br />\n<br />\nWe are an extraordinary company
looking for extraordinary talent! If you would like to be a part of a multi-disciplinary team of
scientists and engineers working on fast-paced and innovative OLED research programs that will
impact the next generation of consumer electronics, then please come and join us!\u00a0<br
/>\n<br />\n<span><span><span><b><span><span>Job
Summary</span></span></b></span></span></span><br
/>\n<br />\n<span><span><span><span>Responsible for assisting
chemists in performing various purifications of target compounds using various methods, including,
but not limited to preparative high performance liquid chromatography (prep HPLC), normal phase and
reverse phase chromatography, and recrystallization.
</span></span></span></span>
There are a lot of HTML entities in there. In this case the content has been doubly encoded: the HTML itself has also been converted into HTML entities. So we first need to decode that. Then parse the HTML. Then finally decode the remaining entities.
# Convert HTML entities.
description = html.unescape(description)
<strong>Adesis Inc. is a subsidiary of Universal Display Corporation (Nasdaq: OLED)</strong>, a
global leader in organic light emitting diodes. Within our OLED Chemistry organization, we engage in
research and development of new materials for use in the fabrication of OLED devices leading to the
next generation of OLED displays.<br />\n<br />\nOur R&D footprint encompasses almost 100,000
square feet of state-of-the-art laboratory and manufacturing space in two U.S. locations - New
Castle and Wilmington, DE, centrally located in the Mid-Atlantic biotech/pharmaceutical hub.<br
/>\n<br />\nWe are an extraordinary company looking for extraordinary talent! If you would like to
be a part of a multi-disciplinary team of scientists and engineers working on fast-paced and
innovative OLED research programs that will impact the next generation of consumer electronics, then
please come and join us!\u00a0<br />\n<br />\n<span><span><span><b><span><span>Job
Summary</span></span></b></span></span></span><br />\n<br />\n<span><span><span><span>Responsible
for assisting chemists in performing various purifications of target compounds using various
methods, including, but not limited to preparative high performance liquid chromatography (prep
HPLC), normal phase and reverse phase chromatography, and recrystallization.
</span></span></span></span>
We have revealed the HTML tags. The result still has residual HTML entities though: if you squint hard you might see &
in there. We’ll parse the HTML, extract the text and convert the remaining HTML entity to text.
soup = BeautifulSoup(description, "html.parser")
description = html.unescape(soup.get_text())
Adesis Inc. is a subsidiary of Universal Display Corporation (Nasdaq: OLED), a global leader in
organic light emitting diodes. Within our OLED Chemistry organization, we engage in research and
development of new materials for use in the fabrication of OLED devices leading to the next
generation of OLED displays.\n\nOur R&D footprint encompasses almost 100,000 square feet of state-
of-the-art laboratory and manufacturing space in two U.S. locations - New Castle and Wilmington, DE,
centrally located in the Mid-Atlantic biotech/pharmaceutical hub.\n\nWe are an extraordinary company
looking for extraordinary talent! If you would like to be a part of a multi-disciplinary team of
scientists and engineers working on fast-paced and innovative OLED research programs that will
impact the next generation of consumer electronics, then please come and join us!\u00a0\n\nJob
Summary\n\nResponsible for assisting chemists in performing various purifications of target
compounds using various methods, including, but not limited to preparative high performance liquid
chromatography (prep HPLC), normal phase and reverse phase chromatography, and recrystallization.
The &
entity has been replaced with &
. We’re almost done. There are two remaining problems:
- The text contains raw escape sequences, encoded with double backslashes (for example,
\\n
rather than\n
). This is not apparent in the printed text above, but in the underlying string each of the printed backslashes is represented by a double backslash. - There’s a Unicode character (represented by
\u00a0
, which you’ll find on the fourth line of text from the bottom).
We’ll fix these shortly.
data:image/s3,"s3://crabby-images/bb154/bb15475f3a1797059877989cd38f299892ae7b24" alt=""
Unicode
Unicode characters are often represented using Unicode escape notation, which is used to represent these (potentially multi-byte) characters using readable ASCII characters. The escape notation could be either:
\u
followed by a four digit hexadecimal number or\U
followed by an eight digit hexadecimal number.
Here are some examples of escape codes and the corresponding Unicode characters:
\u2014
→ — (em dash)\u23E9
→ ⏩\u2705
→ ✅\U0001F525
→ 🔥\U0001F692
→ 🚒\U0001F9E8
→ 🧨
firecracker = "\U0001F9E8"
print(firecracker)
🧨
Digression on Raw Strings & Escape Sequences
📢 Feel free to skip this sub-section.
Let’s digress briefly on the topic of raw escape sequences. Suppose that we create a string using the “White Heavy Check Mark” unicode symbol.
tick = "\u2705"
Printing yields the corresponding Unicode symbol, which is, I suppose, the expected result.
print(tick)
✅
To expect the unexpected shows a thoroughly modern intellect. Oscar Wilde, An Ideal Husband
However, my experience with scraped data (as opposed to data you neatly enter via the keyboard) is that it sometimes doesn’t behave quite the way you expect. What if we created a raw string?
tick = r"\u2705"
The r
prefix prevents Python from interpreting the string as a Unicode escape. Instead it’s stored precisely as it’s written. Printing it simply returns the Unicode escape sequence.
print(tick)
\u2705
This is just what we saw earlier with the job posting, where \u00a0
occurred in the text. Let’s fiddle around with this a bit. First let’s establish that it’s equivalent to a string with an escaped backslash.
r"\u2705" == "\\u2705"
True
So, when creating a raw string with a backslash we are inserting a literal backslash. No escaping required.
What if we apply repr()
?
repr(tick)
"'\\\\u2705'"
Whoah! Somehow the backslashes have multiplied like wire coat hangers in a cupboard. What happened? To make sense of this we actually need to print()
the result.
print(repr(tick))
'\\u2705'
Okay, that’s more manageable and makes sense. Each backslash needs escaping in a string. So two backslashes actually represent only one backslash internally. Let’s drive this home by looking at the length of the internal representation.
len(tick)
6
Aha! That’s one character for the \
, another character for the u
and four more characters for 2705
. 👍
Converting Raw Unicode
If you have a Unicode escape code in a raw string, how do you get a regular string with the actual Unicode character? Use a combination of encode()
and decode()
:
encode()
— converts the raw string into bytes; anddecode("unicode_escape")
— interprets Unicode escape sequences and converts into Unicode characters.
tick.encode().decode("unicode_escape")
'✅'
Example: JSON-LD from Job Post (Continued)
We left the job description with some escaped special characters. Let’s fix those.
description = description.encode().decode('unicode_escape')
# Clean up whitespace (not strictly necessary but removes embedded newlines).
description = re.sub(r'\s+', ' ', description).strip()
Adesis Inc. is a subsidiary of Universal Display Corporation (Nasdaq: OLED), a global leader in
organic light emitting diodes. Within our OLED Chemistry organization, we engage in research and
development of new materials for use in the fabrication of OLED devices leading to the next
generation of OLED displays. Our R&D footprint encompasses almost 100,000 square feet of state-of-
the-art laboratory and manufacturing space in two U.S. locations - New Castle and Wilmington, DE,
centrally located in the Mid-Atlantic biotech/pharmaceutical hub. We are an extraordinary company
looking for extraordinary talent! If you would like to be a part of a multi-disciplinary team of
scientists and engineers working on fast-paced and innovative OLED research programs that will
impact the next generation of consumer electronics, then please come and join us! Job Summary
Responsible for assisting chemists in performing various purifications of target compounds using
various methods, including, but not limited to preparative high performance liquid chromatography
(prep HPLC), normal phase and reverse phase chromatography, and recrystallization.
If you squint then you’ll find that the raw Unicode character is gone. 💡 The Unicode character \u00a0
is actually a No-Break Space, so you won’t see the result of the conversion anyway. But at least you don’t have any unsightly escape codes.
Decoding Unicode
Another thing that you might want to do is convert Unicode text characters into ASCII.
📢 Warning! This is a lossy process. Since the number of Unicode characters dwarfs the number of ASCII characters, any translation from Unicode to ASCII is likely to lose information!
You can translate from Unicode to ASCII with the unidecode
package, which attempts a context-free character-by-character mapping from Unicode to ASCII.
For accented characters it reliably does the right thing.
from unidecode import unidecode
# Unicode accented characters (using mixture of literal and escaped representations).
accented = "Café, Résum\u00E9, Crème, P\u00E8re, Jalapeño"
unidecode(accented)
'Cafe, Resume, Creme, Pere, Jalapeno'
Ligatures are also treated correctly.
ligatures = "Æsir, Œuvre, Bærs\u00E6rk, f\u0153tus"
unidecode(ligatures)
'AEsir, OEuvre, Baersaerk, foetus'
Does the right thing for Greek symbols.
unidecode("αβγΔΩ")
'abgDO'
Currency symbols are okay.
unidecode("€100, £50, ¥2000")
'EUR100, PS50, Y=2000'
Non-Latin script is questionable.
unidecode("你好, Привет, مرحبا")
'Ni Hao , Privet, mrHb'
And many symbols are not translated at all.
unidecode("🔥 ✔ ❌ ")
' '
Conclusion
I would suggest always converting HTML entities before persisting content.
In most cases you are probably better off not translating Unicode into ASCII. However, if you have text with escaped Unicode characters then it’s a good idea to convert them into the corresponding Unicode characters. Also if the Unicode characters are simply accented versions of Latin characters then you could argue that the ASCII equivalents will capture most of the information. But, hey, why not just keep the Unicode in this case?