I am using AWS Batch to run a python script with few modules that run in parallel (in a docker container on AWS ECR). When I manually invoke the script on a Linux 16 core machine, I see 16 python processes executing the code in parallel.
In hopes of speeding up the run further, I wanted to use AWS Batch to run the same script by autoscale to 64 cores. But, this method is only spinning up one python process — Which is obviously slower than my initial approach.
Other details: The parallel python method I am running is pairwise_distances (Built on joblib library) I built the docker image on a Windows 10 machine, pushed it to ECR and invoked its run using AWS Batch.
Am I missing something critical to invoke python’s parallel backend or are there any docker configuration settings that I didn’t configure. Thanks a lot for your help in advance.
Sample Python Code: script.py
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
X = pd.DataFrame(np.random.randint(0,100,size=(1000, 4)), columns=list('ABCD'))
Y = pd.DataFrame(np.random.randint(0,100,size=(10000, 4)), columns=list('ABCD'))
output = pd.DataFrame(
pairwise_distances(X.to_numpy(),Y.to_numpy(), metric= lambda u, v: round((np.sum( np.minimum(u,v), axis = 0)/np.sum(u,axis= 0))*100,2) , n_jobs=-1),
columns = Y.index,
index = X.index
)
pd.DataFrame.to_csv(output, 'outputData.csv', sep=',', na_rep='', index=False)
Dockerfile:
python:3.7
ADD script.py /
COPY requirements.txt /tmp/
RUN pip install -r /tmp/requirements.txt
CMD ["python", "./script.py"]
requirements.txt:
pandas
numpy
sklearn
joblib
Does it make a difference if you wrap the code-to-be-parallelized in a
joblib.Parallel()
context manager?