binary format that allows to store multiple pandas dataframes with different columns, width, rows

100 Views Asked by At

I have like 200 pandas dataframe, and every dataframe has some unique column, or maybe completely different columns. example:

df1 = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Orange', 'Mango'],
    'Quantity': [10, 15, 12, 8],
    'Price': [2.5, 1.5, 2, 3],
    'Category': ['Fruit', 'Fruit', 'Fruit', 'Fruit']
})
df2 = pd.DataFrame({
    'Student Name': ['John', 'Emma', 'Lisa', 'Tom'],
    'Age': [18, 17, 19, 18],
    'Grade': ['A', 'B', 'A', 'B'],
    'City': ['New York', 'London', 'Paris', 'Sydney']
})
df3 = pd.DataFrame({
    'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'Company': ['AAPL', 'GOOG', 'AMZN', 'MSFT'],
    'Price': [132.69, 1760.33, 3187.50, 215.41]
})
# and many more

while I thought that I can easily jump into Parquet and make a one folder, this turned out that it doesn't work that way if the Parquet files has different schemas (I haven't implemented it, so maybe I'm wrong too)

obviously I have read this post Storing multiple dataframes of different widths with Parquet?

so what are some of the formats that allow storing multiple dataframes in one file? other that excel

note: I'm trying to look into to_orc() and orc format, but I don't know if I can merge different schemas and cutoff NA values.

note2: maybe it's not an answerable question, but you can help with sharing topics and links.

1

There are 1 best solutions below

1
On

so what are some of the formats that allow storing multiple dataframes in one file? other that excel

You can use HDF5. Install pytables first with pip install tables

with pd.HDFStore('dataframes.hdf') as store:
    df1.to_hdf(store, key='df1')
    df2.to_hdf(store, key='df2')
    df3.to_hdf(store, key='df3')

Check:

store = pd.HDFStore('dataframes.hdf')

>>> store.keys()
['/df1', '/df2', '/df3']

>>> print(store.info())
<class 'pandas.io.pytables.HDFStore'>
File path: dataframes.hdf
/df1            frame        (shape->[4,4])
/df2            frame        (shape->[4,4])
/df3            frame        (shape->[1,3])