Why does Python String concatenation work with Russian text but string.format() does not

427 Views Asked by At

I'm trying to parse (and escape) rows of a CSV file that is stored in Windows-1251 character encoding. Using this excellent answer to deal with this encoding I've ended up with this one line to test the output, for some reason this works:

print(row[0]+','+row[1])

Outputting:

Тяжелый Уборщик Обязанности,1 литр

While this line doesn't work:

print("{0},{1}".format(*row))

Outputting this error:

Name,Variant

Traceback (most recent call last):
  File "Russian.py", line 26, in <module>
    print("{0},{1}".format(*row))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

Here are the first 2 lines of the CSV:

Name,Variant
Тяжелый Уборщик Обязанности,1 литр

and in case it helps, here is the full source of Russian.py:

import csv
import cgi
from chardet.universaldetector import UniversalDetector
chardet_detector = UniversalDetector()

def charset_detect(f, chunk_size=4096):
    global chardet_detector
    chardet_detector.reset()
    while 1:
        chunk = f.read(chunk_size)
        if not chunk: break
        chardet_detector.feed(chunk)
        if chardet_detector.done: break
    chardet_detector.close()
    return chardet_detector.result

with open('Russian.csv') as csv_file:
    cd_result = charset_detect(csv_file)
    encoding = cd_result['encoding']
    csv_file.seek(0)
    csv_reader = csv.reader(csv_file)
    for bytes_row in csv_reader:
        row = [x.decode(encoding) for x in bytes_row]
        if len(row) >= 6:
            #print(row[0]+','+row[1])
            print("{0},{1}".format(*row))
3

There are 3 best solutions below

1
Zizouz212 On BEST ANSWER

The strings in your list were likely already unicode, so you didn't get an issue.

print(row[0]+','+row[1])
Тяжелый Уборщик Обязанности,1 литр

But here we are trying to add unicode to a normal string! That's why you get the UnicodeEncodeError.

print("{0},{1}".format(*row))

So just change it to:

print(u"{0}, {1}".format(*row))
1
Hetzroni On

the + operand works fine between a unicode string and an str string. On the other hand, str.format doesn't accept unicode strings as parameters.

Thus, you can simply replace the problematic line with the following:

print(u"{0},{1}".format(*row))

That should do the trick.

0
Martijn Pieters On

You are using str.format() which converts unicode() to str() implicitly. It has to do so to be able to interpolate values into the template provided.

Use unicode.format() instead:

print(u"{0},{1}".format(*row))

Note the u before the format literal. unicode.format() has to decode str inputs to fit in the resulting Unicode output.

Concatenation on the other hand can implicitly decode to produce a final unicode() object result. Had your ',' value contained non-ASCII bytes that implicit decoding would also fail.

Moral of the story: use Unicode string literals throughout your code when handling text.