Pandas on Linux importing UTF 8 (BOM) csv with BOM header as cleartext in column names

463 Views Asked by At

Difference on importing csv data on Linux and MacOS

Hello everyone,

when importing a csv file with pandas.read_csv using UTF8 (BOM) encoding under Linux, the first column name contains the BOM as cleartext, e.g. \\xEF\\xBB\\xBFColumnName.

When i do the same under MacOS, everything is beautiful. Why does this happen?

I use python-3.10.12 and pandas-2.1.2.

1

There are 1 best solutions below

1
On

It seems that it is not an issue with pandas but with printf behaviour when used in a makefile:

Given a file target.csv in UTF-8 (no-BOM) with contents:

colA;colB:...

and a Makefile containing a target like:

target.csv:
    python3 somescript.py
    mv -v $@ $@~
    printf '\xEF\xBB\xBF' | cat - $@~ > $@

a make target.csv

leads to a target.csv containing

\xEF\xBB\xBFcolA;colB;...

as plain text and encoding is shown in editor (e.g. VSCode or VIM) as UTF-8.

But when directly issuing printf '\xEF\xBB\xBF' | cat - target.csv~ > target.csv in a bash prompt, file is correctly encoded as UTF-8 (BOM):

colA;colB:...

This is correctly done on MacOS.

I will put this question in a new thread because it is not an issue with pandas or python but with make / printf.