I am trying to save dataframes iteratively in a datastore. However, the size increase is humungous between the first and second dataframe saves. The first one makes the datastore size to be 6.8 MB and then the next one is 1.4 GB whereas, both dataframes when converted to CSV files, are around 1.4 MB.
Could anyone shed light on this issue and how this could be remedied?
if j == stueli_data_raster_spec.shape[0]-1:
print(f'{i[-6:]} module done')
store_training_data = pd.HDFStore('dataframe_training_data.h5')
with store_training_data as hdf_t:
if f'/train_df_{i[-6:]}' not in hdf_t.keys():
training_data_df = pd.DataFrame.from_dict(training_data_dict).reset_index().drop('index', axis=1)
training_data_df.to_hdf(store_training_data, f'train_df_{i[-6:]}', mode = 'w')
else:
print(f'/train_df_{i[-6:]} already saved.')
This is the output of querying the object:
<class 'pandas.io.pytables.HDFStore'>
File path: dataframe_training_data.h5
/train_df_020030 frame (shape->[13861,30])
/train_df_020099 frame (shape->[11935,30])