import pandas as pd
from flatten_json import flatten

actual_column_list = ["_id", "external_id", "email", "created_at","updated_at", "dob.timestamp", "dob_1.timestamp","column_10"]

data = [{'_id': '60efe3333333445', 'external_id': 'ID2', 'dob': {'timestamp': 412214400}, 'email': '[email protected]', 'created_at': 1626334203, 'updated_at': 1629338900},
        { 'external_id': 'ID3', '_id': '60efe3333333487', 'email': '[email protected]', 'created_at': 1626334203, 'updated_at': 1629338900, 'dob_1': {'timestamp': 'oops'}}]

df = pd.DataFrame(data=[flatten(row, ".") for row in data], dtype='str', columns=actual_column_list)

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    print(df)

df.to_parquet(f"test.parquet", engine='fastparquet', compression="snappy", index=False)

ERROR Displayed:

root = parquet_thrift.SchemaElement(name=b'schema',
AttributeError: module 'fastparquet.parquet_thrift' has no attribute 'SchemaElement'

Python Version : 3.6.9 pyarrow=5.0.0 fastparquet=0.8.0 numpy=1.19.5 pandas=1.1.5. Tried the exact code snippet with Python Version : 3.7.13 pyarrow=7.0.0 fastparquet=0.8.0 numpy=1.21.5 pandas=1.3.5 and it worked but need I need it to work with Python Version : 3.6.9 Tried to explicitly use these versions in python 3.6.9 but it failed to install the dependencies.

What I want is to make the above code snippet compatible with python 3.6.9

1

There are 1 best solutions below

0
On BEST ANSWER

Use fastparquet 0.7.2 Even though fastparquet 0.8.0 is compatible with python 3.6, looks like it requires a pyarrow version greater 5.0.0 to function properly. So had to downgrade fastparquet to 0.7.2 in order to be compatible with pyarrow 5.0.0

Note: This code snippet can be used to obtain all string columns parquet with columns having null datatype as well, without the columns being converted to float when its null which is the default behavior when pandas is used with pyarrow to save dataframe to parquet