%%time
import pandas as pd
from datetime import datetime
from multiprocessing import Pool
def get_id(x):
return x['motor_id']
def parallel_load(df, get_id):
print('parallel_load: ', datetime.now())
return df.apply(lambda x: get_id(x), axis=1)
def parallel_df(df, func, n_cores):
print('start: ', datetime.now())
df_split = np.array_split(df, n_cores)
print('splitted: ', datetime.now())
pool = Pool(n_cores)
# df = pd.concat(pool.map(func, df_split))
s = datetime.now()
print('pool start: {}'.format(s))
data = pool.map(func, df_split)
print('pool end: {}, {}s'.format(datetime.now(), (datetime.now() - s).seconds))
pool.close()
pool.join()
print('pool join: {}s'.format((datetime.now() - s).seconds))
return data
new_func = partial(
parallel_load, get_id=get_id
)
ndf = pd.DataFrame({'motor_id': np.arange(10000)})
parallel_df(ndf, new_func, 4)
With the sample code above, I get this result
start: 2024-01-02 16:58:09.390751
splitted: 2024-01-02 16:58:09.391897
parallel_load: parallel_load: parallel_load: parallel_load: parallel_load: parallel_load: 2024-01-02 16:58:12.453533parallel_load: parallel_load: 2024-01-02 16:58:12.456901parallel_load: parallel_load:
2024-01-02 16:58:12.456053 2024-01-02 16:58:12.4578302024-01-02 16:58:12.457674 2024-01-02 16:58:12.457189
2024-01-02 16:58:12.457663
2024-01-02 16:58:12.4583492024-01-02 16:58:12.458657
2024-01-02 16:58:12.457537
pool start: 2024-01-02 16:58:12.448087
pool end: 2024-01-02 16:58:12.504469, 0s
pool join: 0s
CPU times: user 52.5 ms, sys: 3.48 s, total: 3.53 s
Wall time: 3.39 s
As you can see, it seems like it takes about 3 seconds to start the actual code for each process after pool.map was calling.
check the time between "splitted" and "parallel_load"
Is there a way to shorten this time? It seems like pool.map is not good for using small works like this; since getting to start takes too much time compare to the actual processing time.