Efficient maping of large pandas dataframe (by index)

174 Views Asked by At

I'm currently optimising my code and I have found bottle neck. I have dataframe df with column 'Numbers' with numbers from 1 to 100 (integers). I would like to map those numbers with dictionary. I know that I can use .map() or .replace() function but it seems that both solutions are slow and does not take into account that numbers from 'Numbers' are index of my dictionary (which is series), i.e.: I would like to perform the following:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

df['Numbers_with_a']=df['Numbers'].apply(lambda x: dict_simple[x])

Unfortunatelly apply function is also very slow. Is there any other way to do it faster? Dataframe is 50M+ records.

I have tried .map(), replace() and .apply() functions from pandas package, but performance is very poor. I would like to improve calculation time.

4

There are 4 best solutions below

2
tdelaney On

pandas.Series have an index that can be used to map one value to another natively in pandas without the extra expense of calling apply for each row or converting values to python int type. Since the numbers you want to map start from zero and a Series indexes from 0 by default, you can

import pandas as pd

df = pd.DataFrame({"numbers":[1,4,22,7,99]})
str_map = pd.Series([f'a{i}' for i in range(100)])
df['numbers_with_a'] = str_map.iloc[df.numbers].reset_index(drop=True)
print(df)

str_map is a Series created from your "a0"... strings. str_map.iloc[df.numbers] uses your numbers as indicies, giving you a new Series of the mapped values. That series is indexed by your numbers, so you drop that index and assign the result back to the original dataframe.

0
yashaswi k On

convert your list to numpy array and map them as below:

dict_simple=[]
for i in range(100):
    dict_simple.append('a' +str(i))

dict_array = np.array(dict_simple)
df['Numbers_with_a'] = dict_array[df['Numbers'].values]
2
wojteka On

Thanks for all answers. I have done some comparison:

import pandas as pd
import time
import numpy as np

df=pd.DataFrame(np.random.randint(1,10,size=(10000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(10):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(10)])

print('map')
start_time=time.time()
df['Numbers_map']=df['N'].map(dict_dictionary)
print(time.time()-start_time)

print('replace')
start_time=time.time()
df['Numbers_replace']=df['N']
df['Numbers_replace'].replace(dict_dictionary,inplace=True)
print(time.time()-start_time)

print('array')
start_time=time.time()
df['Numbers_array']=dict_array[df['N'].values]
print(time.time()-start_time)

print('series')
start_time=time.time()
df['Numbers_series']=dict_series.iloc[df.N].reset_index(drop=True)
print(time.time()-start_time)

print('end')

Results are as follows:

map
1.424480676651001
replace
3.657830238342285
array
1.4687621593475342
series
0.4687619209289551
end

"replace" gains some performance for small dictionaries, but overall approach with series is the fastest.

0
wojteka On

I have updated numpy and pandas to the newest versions and right now "map" is very close to "series" approach. Computing time depends on the size of dictonary. Sometimes "map" is better and sometimes "series". Also I have tried parallell pandas to use all CPUs, but performance was worse than "map". Quite probably p_map performance is worse because I'm using only 2 CPU.

import pandas as pd
from  parallel_pandas import ParallelPandas
import time
import numpy as np
ParallelPandas.initialize(n_cpu=2, split_factor=2, disable_pr_bar=True)

df=pd.DataFrame(np.random.randint(1,99,size=(1000000,1)), columns=list('N'))

dict_dictionary={}
dict_list=[]
for i in range(100):
    dict_dictionary[i]='a' + str(i)
    dict_list.append('a' + str(i))
dict_array=np.array(dict_list)
dict_series=pd.Series([f'a{i}' for i in range(100)])
print('p_map')
start_time=time.time()
df['Numbers_p_map']=df['N'].p_map(dict_dictionary)
print(time.time()-start_time)