Encoding in a in-memory stream or how does TextIOBase work?

5k Views Asked by At

I am currently reading the documentation for the io module: https://docs.python.org/3.5/library/io.html?highlight=stringio#io.TextIOBase

Maybe it is because I don't know Python well enough, but in most cases I just don't understand their documentation.

I need to save the data in addresses_list to a csv file and serve it to the user via https. So all of this must happen in-memory. This is the code for it and currently it is working fine.

addresses = Abonnent.objects.filter(exemplare__gt=0)
addresses_list = list(addresses.values_list(*fieldnames))

csvfile = io.StringIO()
csvwriter_unicode = csv.writer(csvfile)
csvwriter_unicode.writerow(fieldnames)

for a in addresses_list:
    csvwriter_unicode.writerow(a)
csvfile.seek(0)

export_data = io.BytesIO()
myzip = zipfile.ZipFile(export_data, "w", zipfile.ZIP_DEFLATED)
myzip.writestr("output.csv", csvfile.read())
myzip.close()
csvfile.close()
export_data.close()

# serve the file via https

Now the problem is that I need the content of the csv file to be encoded in cp1252 and not in utf-8. Traditionally I would just write f = open("output.csv", "w", encoding="cp1252") and then dump all the data into it. But with in-memory streams it doesn't work that way. Both, io.StringIO() and io.BytesIO() don't take a parameter encoding=.

This is where I have truoble understanding the documentation:

The text stream API is described in detail in the documentation of TextIOBase.

And the documentation of TextIOBase says this:

encoding=

The name of the encoding used to decode the stream’s bytes into strings, and to encode strings into bytes.

But io.StringIO(encoding="cp1252") just throws: TypeError: 'encoding' is an invalid keyword argument for this function.

So how can I use TextIOBase's enconding parameter with StringIO? Or how does this work in general? I am so confused.

2

There are 2 best solutions below

2
Tom Dalton On

StringIO deals only with strings/text. It doesn't know anything about encodings or bytes. The easiest way to do what you want is probably something like:

f = StringIO()
f.write("Some text")

# Old-ish way:
f.seek(0)
my_bytes = f.read().encode("cp1252")

# Alternatively
my_bytes = f.getvalue().encode("cp1252")
0
farax On

reading text from io.BytesIO (in memory streams) using io.TextIOWrapper including encoding and error handling (python3)

this does what io.StringIO cant

sample code

>>> import io
>>> import chardet
>>> # my bytes, single german umlaut
... bts = b'\xf6'
>>> 
>>> # try reading as utf-8 text and on error replace
... my_encoding = 'utf-8'
>>> fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='replace')
>>> fh.read()
'�'
>>> 
>>> # try reading as utf-8 text with strict error handling
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> # catch exception
... try:
...     fh.read()
... except UnicodeDecodeError as err:
...     print('"%s"' % err)
...     # try to get encoding
...     my_encoding = chardet.detect(err.object)['encoding']
...     print("correct encoding is %s" % my_encoding)
... 
"'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte"
correct encoding is windows-1252
>>> # retry with detected encoding
... fh_bytes = io.BytesIO(bts)
>>> fh = io.TextIOWrapper(fh_bytes, encoding=my_encoding, errors='strict')
>>> fh.read()
'ö'