Dealing with a Byte Order Mark (BOM)

How to deal with byte order mark (BOM) characters at the beginning of a file when reading data into R. The key is to specify the correct character encoding!

Published

11 Mar 2015 09:00

I have just been trying to import some data into R. The data were exported from a SQL Server client in tab-separated value (TSV) format. However, reading the data into R the “usual” way produced unexpected results:

data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE)
head(data)

                                   V1    V2
1 ï»¿7E51B3EC4263438B22811BE78391A823  2129
2    8617E5E557903C7FAF011FBE2DFCED1D  3518
3    1E8B37DFB143BEEEE052516D2F3B58F5  6018
4    60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5    5A3BA8589DCD62B31948DC2715CA3ED9 12850
6    3552BF8AF58A58C794A43D4AA21F4FBA 13284

Those weird characters in the first record… where did they come from? They don’t show up in a text editor, so they’re not easy to edit out.

Googling ensued and revealed that those weird characters were in fact the byte order mark (BOM), special characters which indicate the endianness of the file. This was quickly confirmed using CYGWIN. (Yes, shamefully, I am working under Windows at the moment!)

CYGWIN window showing results from running the file command on a TSV file.

The solution is remarkably simple: just specify the correct character encoding.

data <- read.delim("sample-query.tsv", header = FALSE, stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
head(data)

                                V1    V2
1 7E51B3EC4263438B22811BE78391A823  2129
2 8617E5E557903C7FAF011FBE2DFCED1D  3518
3 1E8B37DFB143BEEEE052516D2F3B58F5  6018
4 60B8AA536CFD26C5B5CF5BA6D7B7893C  7811
5 5A3BA8589DCD62B31948DC2715CA3ED9 12850
6 3552BF8AF58A58C794A43D4AA21F4FBA 13284

Problem solved.