I got below error when writing dask dataframe to S3. Couldn't figure out why. Does anybody know how to fix.
dd.from_pandas(pred, npartitions=npart).to_parquet(out_path)
The error is
error.. Error converting column "team_nm" to bytes using encoding UTF8. Original error: bad argument type for built-in operation Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/fastparquet/writer.py", line 175, in convert out = array_encode_utf8(data) File "fastparquet/speedups.pyx", line 60, in fastparquet.speedups.array_encode_utf8 TypeError: bad argument type for built-in operation
During handling of the above exception, another exception occurred:
I tried to encode the "team_nm" to "latin-1" before writing to parquet but doesn't work.
pred['team_nm'] = pred['team_nm'].str.encode("Latin-1")
Tried to upgrade fastparquet from 0.4.1 to 0.7.1 but it doesn't work either
Parquet assumes UTF8 encoding and no other encoding is possible, so if your text is something else, it will fail. If you encode your column yourself to bytes, you can indeed choose any encoding you like, so long as wherever you are loading is prepared to do the decoding manually too.
If you have a column of bytes (because you encoded manually), then fastparquet will generally be able to guess this unless your column starts with some NULL/None values. To help it along, you can use the argument
object_encoding='bytes'(all object columns to be interpreted as bytes) orobject_encoding={'team_nm': 'bytes'}(the one specific column if known to be bytes).