I have a DataFrame test
with shape (1138812, 57)
. The head looks like this:
And I have an array indices
which has a shape (1138812, 25)
. It is a 2D array with each subarray having 25 indices. It looks like this:
[
the indices
array has 25 indices from the DataFrame corresponding to each 1138812 indices from the same DataFrame. I want to create a new DataFrame with 25 X 1138812 rows based on this array. For example:
i have a 2d array, something like:
[[0,2,3],
[1,0,3],
[2,1,0],
[3,1,2]]
and i have a pandas dataframe something like:
id val
0 a 9
1 b 8
2 c 3
3 d 7
now i want to get a new dataframe based on the indexes listed in the 2d array, for this example, it will be like:
id val id_2 val
0 a 9 a 9
0 a 9 c 3
0 a 9 d 7
1 b 8 b 8
1 b 8 a 9
1 b 9 d 7
2 c 3 c 3
2 c 3 b 8
2 c 3 a 9
3 d 7 d 7
3 d 7 b 8
3 d 7 c 3
I tried many approaches including:
import pandas as pd
import numpy as np
index = [[0,2,3],
[1,0,3],
[2,1,0],
[3,1,2]]
idse = ['a','b','c','d']
vals = [9,8,3,7]
data = {'id': idse, 'val': vals}
df = pd.DataFrame(data=data)
newdf = pd.DataFrame(np.repeat(df.values, len(index[0]), axis=0))
flat_list = [item for sublist in index for item in sublist]
newdf['id_2'] = df.id[flat_list].values
newdf['val_2'] = df.val[flat_list].values
and
fdf = pd.DataFrame()
for i, ir in enumerate(l):
temp_df = df.iloc[ir]
temp_df['id'] = df.iloc[i]['id']
temp_df = pd.merge(df,temp_df,how="outer",on="id")
temp_df = temp_df[temp_df['id']==df.iloc[i]['id']]
fdf = pd.concat([fdf,temp_df])
fdf
both of them work the way I want but they are very very slow for the original DataFrame with 1.1m rows and they take up all the ram which crashes the notebook. I am using RAPIDS libraries including cuDF, cuPy, cuML which correspond to pandas, numpy/scipy and sklearn respectively and I need a pure numpy/pandas solution so that they can use the GPU cores and make this operation quicker and more efficient.
Thanks
Assuming
df
anda
the input dataframe and array, you canrepeat
the indices of your dataframe andconcat
it with the dataframes indexed from the flattened array:output:
used input:
NB. a quick test shows that is takes 900ms to process 1M rows