How to add columns side by side in datasets surpassing memory capacity?

Question

How to add columns side by side in datasets surpassing memory capacity?

154 Views Asked by river7816 At 06 October 2023 at 10:48

I am using Pandas and ArcticDB and found applying library_options=LibraryOptions(dynamic_schema=True) enables column additions using the append method. However, the outcome isn't as anticipated.

Import necessary libraries and create the library:

from arcticdb import Arctic 
from arcticdb import LibraryOptions 
import pandas as pd  

ac = Arctic('lmdb://D:/Database?map_size=100GB') 
if 'test' not in ac.list_libraries():     
    ac.create_library(name='test', library_options=LibraryOptions(dynamic_schema=True))

Generate data and insert into library:

df = pd.DataFrame(
   {'column_1': range(1,6)},
   index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
).reset_index()

lib = ac['test']
lib.write('symbol', df, prune_previous_versions=True)

The dataframe I want to add a new column to (next to column_2):

        date  column_1
0 2018-01-01         1
1 2018-01-02         2
2 2018-01-03         3
3 2018-01-04         4
4 2018-01-05         5

Generate data of column_2:

to_append_df = pd.DataFrame(
   {'column_2': range(6,11)},
   index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
).reset_index()

The dataframe:

        date  column_2
0 2018-01-01         6
1 2018-01-02         7
2 2018-01-03         8
3 2018-01-04         9
4 2018-01-05        10

I use lib.append('symbol', to_append_df) to insert it into the library. I thought the result would be:

        date  column_1  column_2
0 2018-01-01         1         6
1 2018-01-02         2         7
2 2018-01-03         3         8
3 2018-01-04         4         9
4 2018-01-05         5        10

However, I receive:

        date  column_2  column_1
0 2018-01-01         0         1
1 2018-01-02         0         2
2 2018-01-03         0         3
3 2018-01-04         0         4
4 2018-01-05         0         5
5 2018-01-01         6         0
6 2018-01-02         7         0
7 2018-01-03         8         0
8 2018-01-04         9         0
9 2018-01-05        10         0

I get this because append only allows data for which index of first row is equal to or greater than index of last row in existing data (so it's no surprise).

I can access the library using df = lib.read("symbol").data, use pandas.merge to combine df and df_to_append before using write to update the database. But with large datasets reading from library and merging can consume considerable memory and time. Columns are continuously added, making it impossible to read from memory when they surpass memory capacity. Is there a solution to this?

Original Q&A

There are 1 best solutions below

**Joe Iddon** · Answer 1 · 2024-02-22T12:01:51.633000

This feature has not been added to ArcticDB yet, but you can track the status in the issue here.

For now, there's a workaround using lib.update on row batches to save reading the whole dataframe stored in the ArcticDB symbol into memory. Please read the notes below about using dynamic schema.

Steps

For each batch we will read the relevant existing rows, using lib.read('symbol', date_range=...).
We add the new column, using the df[new_col] = ... syntax.
Note that there's a caveat that with dynamic schema, rows which have not yet been worked on might come back with the new column already there, but filled with zeroes. This is why I use the df[new_col] = ... syntax, rather than pd.concat or pd.merge when adding/concatenating your new column because we want to override anything that's already there.
We finally use lib.update to update the relevant date range for this row batch in the ArcticDB symbol.

Doing it in batches means you can concatenate a column to an ArcticDB symbol which is too large to fit in memory, as you describe.

Extras

Note that you were calling .reset_index() to store a RangeIndex rather than a DatetimeIndex. Since ArcticDB only supports lib.update operations with datetime-indexed data, it is necessary to omit this .reset_index step.
Note the mention of bucketize_dynamic being disabled by default in the issue is relevant. Since we do not fully support this yet, the default will be to stop "column slicing" the dataframe, when dynamic schema is enabled. This means that for greater than 127 columns, previously we would split the dataframe, e.g. for 200 columns we would split into segments of up to [100k rows x 127 columns], and [100k rows x 73 columns]. With dynamic schema enabled, currently this will look like [100k rows x 200 columns]. For a large number of columns (aka "wide") dataframe, this will slow down reads when you ask for just a particular set of columns as compared to static schema since we are forced to pull the whole segment from storage containing all 200 columns (or 1M columns etc). Note we will then decode just the relevant columns, so this isn't maybe as inefficient as it sounds. The bucketize_dynamic method would fix this, but as mentioned it is not production ready yet.

Code

from arcticdb import Arctic
from arcticdb import LibraryOptions
import pandas as pd
ac = Arctic('lmdb:///tmp')
if 'test' not in ac.list_libraries():
    ac.create_library(name='test', library_options=LibraryOptions(dynamic_schema=True))

df = pd.DataFrame(
   {'column_1': range(1,6)},
   index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
)

lib = ac['test']
lib.write('symbol', df, prune_previous_versions=True)

to_concat_df = pd.DataFrame(
   {'column_2': range(6,11)},
   index=pd.date_range(start='1/1/2018', end='1/5/2018', name='date')
)

print(df)

print(to_concat_df)

batch_start = 0
batch_size = 2
while batch_start < len(to_concat_df):
    batch = to_concat_df.iloc[batch_start:batch_start+batch_size]
    existing_batch = lib.read('symbol', date_range=batch.index[[0,-1]]).data
    assert len(to_concat_df.columns) == 1
    update_df = existing_batch.copy()
    update_df[to_concat_df.columns[0]] = to_concat_df
    lib.update('symbol', update_df)
    
    batch_start += batch_size
    
print(lib.read('symbol').data)

Output:

            column_1
date                
2018-01-01         1
2018-01-02         2
2018-01-03         3
2018-01-04         4
2018-01-05         5
            column_2
date                
2018-01-01         6
2018-01-02         7
2018-01-03         8
2018-01-04         9
2018-01-05        10
            column_1  column_2
date                          
2018-01-01         1         6
2018-01-02         2         7
2018-01-03         3         8
2018-01-04         4         9
2018-01-05         5        10

How to add columns side by side in datasets surpassing memory capacity?

There are 1 best solutions below

Steps

Extras

Code

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in DATAFRAME

Related Questions in LARGE-DATA

Related Questions in ARCTICDB

Trending Questions

Popular # Hahtags

Popular Questions