I am using Pandas and ArcticDB and found applying library_options=LibraryOptions(dynamic_schema=True) enables column additions using the append method. However, the outcome isn't as anticipated.
Import necessary libraries and create the library:
from arcticdb import Arctic
from arcticdb import LibraryOptions
import pandas as pd
ac = Arctic('lmdb://D:/Database?map_size=100GB')
if 'test' not in ac.list_libraries():
ac.create_library(name='test', library_options=LibraryOptions(dynamic_schema=True))
Generate data and insert into library:
df = pd.DataFrame(
{'column_1': range(1,6)},
index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
).reset_index()
lib = ac['test']
lib.write('symbol', df, prune_previous_versions=True)
The dataframe I want to add a new column to (next to column_2):
date column_1
0 2018-01-01 1
1 2018-01-02 2
2 2018-01-03 3
3 2018-01-04 4
4 2018-01-05 5
Generate data of column_2:
to_append_df = pd.DataFrame(
{'column_2': range(6,11)},
index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
).reset_index()
The dataframe:
date column_2
0 2018-01-01 6
1 2018-01-02 7
2 2018-01-03 8
3 2018-01-04 9
4 2018-01-05 10
I use lib.append('symbol', to_append_df) to insert it into the library. I thought the result would be:
date column_1 column_2
0 2018-01-01 1 6
1 2018-01-02 2 7
2 2018-01-03 3 8
3 2018-01-04 4 9
4 2018-01-05 5 10
However, I receive:
date column_2 column_1
0 2018-01-01 0 1
1 2018-01-02 0 2
2 2018-01-03 0 3
3 2018-01-04 0 4
4 2018-01-05 0 5
5 2018-01-01 6 0
6 2018-01-02 7 0
7 2018-01-03 8 0
8 2018-01-04 9 0
9 2018-01-05 10 0
I get this because append only allows data for which index of first row is equal to or greater than index of last row in existing data (so it's no surprise).
I can access the library using df = lib.read("symbol").data, use pandas.merge to combine df and df_to_append before using write to update the database. But with large datasets reading from library and merging can consume considerable memory and time. Columns are continuously added, making it impossible to read from memory when they surpass memory capacity. Is there a solution to this?
This feature has not been added to ArcticDB yet, but you can track the status in the issue here.
For now, there's a workaround using
lib.updateon row batches to save reading the whole dataframe stored in the ArcticDB symbol into memory. Please read the notes below about using dynamic schema.Steps
lib.read('symbol', date_range=...).df[new_col] = ...syntax.df[new_col] = ...syntax, rather thanpd.concatorpd.mergewhen adding/concatenating your new column because we want to override anything that's already there.lib.updateto update the relevant date range for this row batch in the ArcticDB symbol.Doing it in batches means you can concatenate a column to an ArcticDB symbol which is too large to fit in memory, as you describe.
Extras
.reset_index()to store aRangeIndexrather than aDatetimeIndex. Since ArcticDB only supportslib.updateoperations with datetime-indexed data, it is necessary to omit this.reset_indexstep.bucketize_dynamicbeing disabled by default in the issue is relevant. Since we do not fully support this yet, the default will be to stop "column slicing" the dataframe, when dynamic schema is enabled. This means that for greater than 127 columns, previously we would split the dataframe, e.g. for 200 columns we would split into segments of up to [100k rows x 127 columns], and [100k rows x 73 columns]. With dynamic schema enabled, currently this will look like [100k rows x 200 columns]. For a large number of columns (aka "wide") dataframe, this will slow down reads when you ask for just a particular set of columns as compared to static schema since we are forced to pull the whole segment from storage containing all 200 columns (or 1M columns etc). Note we will then decode just the relevant columns, so this isn't maybe as inefficient as it sounds. Thebucketize_dynamicmethod would fix this, but as mentioned it is not production ready yet.Code
Output: