Fix encoding errors in csv with mixed encodings

718 Views Asked by At

there are some other questions here regarding this problem, but none of them fixed my problem so far.

I have a large (40MB) CSV file. Most of the file is encoded in iso-8859-1 (latin1), but some entries (just entries!) are in utf-8.

If i try to open the file with utf-8 encoding, python already throws an encoding error on me. If i open the file in iso-8859-1, the file can be read, but some entries stay mojibake then.

I tried the following code to fix the issues linewise, but obviously i miss something, because the utf-8 entries stay mojibake.


import os
import ftfy

# There are issues with file encoding with this file. Try to convert it to proper utf-8
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
sourceFileName = os.path.join(folder, file + config.CSV_EXTENSION)
targetFileName = os.path.join(folder, file + "_temp" + config.CSV_EXTENSION)

# input file
csvfile = open(sourceFileName, "r", encoding = sourceEncoding)
Lines = csvfile.readlines()
for line in Lines:
    line = ftfy.fix_encoding(line) 

# output stream
outfile = open(targetFileName, "w", encoding = targetEncoding) # Windows doesn't like utf8
outfile.writelines(Lines)

# close files
csvfile.close()
outfile.close()

os.remove(sourceFileName)
os.rename(targetFileName, sourceFileName)

One specific string i have looks like this:

Ãberarbeitung A6 Heft

I want it to look like this:

Überarbeitung A6 Heft

Edit:

Some clarifications.

I assume there are some encoding issues in the file, because i know there are two different sources for entries in the csv. The most ones come into it by typing the value into a GUI. Some values come from a self-written script with god knows what encoding.

If i open the CSV in VSCode it assumes it is ISO-8859-1. But then some entries look like i mentioned above:

 Ãberarbeitung A6 Heft

If i change the encoding to UTF-8, this entry becomes 'right':

Überarbeitung A6 Heft

But then other entries change to the worse:

Testdurchf�hrung

The error message when trying to open the file with utf-8 encoding is:

Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xe4 in position 56: invalid continuation byte

I will try to import the csv binary and decode it linewise. Maybe this will do the trick.

3

There are 3 best solutions below

2
On BEST ANSWER

You can read in the binary mode and decode each line manually:

def try_decode(b, encodings):
    for enc in encodings:
        try:
            return b.decode(enc)
        except UnicodeDecodeError:
            pass
    raise ValueError('no matching encoding!')

with open(YOUR_FILE, 'rb') as fp:
    for b in fp:
        line = try_decode(b, ['utf8', 'latin1'])
0
On

It's possible for a file to be written w/"mixed encodings", and, based on your comment on Panagiotis_Kanavos' answer, even seems likely:

# Process 1
with open("input.csv", "w", encoding="utf-8") as f_in:
    f_in.write("Überarbeitung 1\n")

# Process 2 (not aware of Process 1's intent)
with open("input.csv", "a", encoding="iso-8859-1") as f_in:
    f_in.write("Testdurchführung 2\n")

If your file is a mix of ISO-8859-1 (latin1) and UTF-8, then @gog's answer gives you a template for fixing.

Be sure to specify UTF-8 before ISO-8859-1, otherwise a valid UTF-8 multibyte sequence will be decoded as a sequence of individual single-byte latin1 characters:

# Adapt @gog's solution
def try_decode(b: bytes) -> str | None:
    for enc in ["utf-8", "iso-8859-1"]:
        try:
            s = b.decode(enc)
            print(f"{enc:<10}: {b}")  # log the encoding
            return s
        except UnicodeDecodeError:
            pass

    return None


with (
    open("input.csv", "rb") as f_in,
    open("output.csv", "w") as f_out,
):
    for i, line in enumerate(f_in, start=1):
        s = try_decode(line)
        if s:
            f_out.write(s)
        else:
            print(f"could not decode line {i}: {line}")

That prints:

utf-8     : b'\xc3\x9cberarbeitung 1\n'
iso-8859-1: b'Testdurchf\xfchrung 2\n'

and output.csv is uniform UTF-8.

1
On

Try iso-8859-9 ie Turkish, instead of Latin1. Or use chardet on the actual file to detect possible encodings.


There's no such thing as "mixed encodings". That file isn't UTF8 either. In UTF8 any character outside the 7-bit US-ASCII range uses 2 bytes. The US-ASCII characters are identical to Latin1. Both à and Ü would appear as two characters if you uses Latin1 to read a UTF8 file.

This isn't a Windows issue either - Windows is a native Unicode OS. Since the very first Windows NT version back in 1994, Windows was built to be exclusively Unicode. The ASCII APIs were meant for compatibility reasons with legacy applications until they too were recompiled to Unicode. The "system locale" isn't the system locale. As the settings page explains, it's the default encoding used by non-Unicode applications

You can use the chardet package to try to detect the encoding used in a file. This is done by checking for invalid byte sequences and statistical analysis to detect the most likely codepage that matches the byte frequency of the file's bytes.

With just a single phrase detection is rather chancy, but this

chardet.detect_all(codecs.encode("Ãberarbeitung A6 Heft",encoding="latin1"))

Produces

[
    {'encoding': 'ISO-8859-9', 'confidence': 0.8516926783630553, 'language': 'Turkish'}, 
    {'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}, 
    {'encoding': 'MacRoman', 'confidence': 0.494516129032258, 'language': ''}
]

We already know it's not Latin1 and I highly doubt it's MacRoman

UTF8 would encode either à or Ü as two bytes. All the characters in the US-ASCII range would use the same byte values as US-ASCII.

>>> codecs.encode("Ãberarbeitung A6 Heft")
b'\xc3\x83berarbeitung A6 Heft'
>>> codecs.encode("Überarbeitung A6 Heft")
b'\xc3\x9cberarbeitung A6 Heft'