My data look like these
id1,id2,similarity
CHEMBL1,CHEMBL1,1
CHEMBL2,CHEMBL1,0.18
CHEMBL3,CHEMBL1,0.56
CHEMBL4,CHEMBL1,0.64
CHEMBL5,CHEMBL1,0.12
CHEMBL1,CHEMBL2,0.18
CHEMBL2,CHEMBL2,1
CHEMBL3,CHEMBL2,0.26
CHEMBL4,CHEMBL2,0.78
CHEMBL5,CHEMBL2,0.33
CHEMBL1,CHEMBL3,0.56
CHEMBL2,CHEMBL3,0.26
CHEMBL3,CHEMBL3,1
CHEMBL4,CHEMBL3,0.04
CHEMBL5,CHEMBL3,0.85
CHEMBL1,CHEMBL4,0.64
CHEMBL2,CHEMBL4,0.78
CHEMBL3,CHEMBL4,0.04
CHEMBL4,CHEMBL4,1
CHEMBL5,CHEMBL4,0.49
CHEMBL1,CHEMBL5,12
CHEMBL2,CHEMBL5,0.33
CHEMBL3,CHEMBL5,0.85
CHEMBL4,CHEMBL5,0.49
CHEMBL5,CHEMBL5,1
The whole file is around 197million lines (10GB). My goal is to compare the distributions of column 3 for each compound in column 1. With a lot of refactoring I managed to have this piece of code
import pandas as pd
from scipy.stats import ks_2samp
import re
with open('example.csv', 'r') as f, open('Metrics.tsv', 'a') as f_out:
f_out.write('compound_1' + '\t' + 'compound_2' + '\t' + 'Similarity' + '\t' + 'KS Distance' + '\n')
df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = None)
d = {}
l_id1 = []
l_id2 = []
l_sim = []
uniq_comps = df.iloc[:, 0].unique().tolist()
for i in uniq_comps:
d[i] = []
for j in range(df.shape[0]):
d[df.iloc[j, 0]].append(df.iloc[j, 2])
l_id1.append(df.iloc[j, 0])
l_id2.append(df.iloc[j, 1])
l_sim.append(df.iloc[j, 2])
for k in range(len(l_id1)):
sim = round(l_sim[k]*100, 0)/100
ks = re.findall(r"statistic=(.*)\,.*$", str(ks_2samp(d[l_id1[k]], d[l_id2[k]])))
f_out.write(l_id1[k] + '\t' + l_id2[k] + '\t' + str(sim) + '\t' + str(''.join(ks)) + '\n')
which runs but as expected is extremely slow. Does anyone have any ideas of how it could be made faster? My desired output looks like this
compound_1,compound_2,Similarity,KS Distance
CHEMBL1,CHEMBL1,1.0,0.0
CHEMBL2,CHEMBL1,0.18,0.4
CHEMBL3,CHEMBL1,0.56,0.2
CHEMBL4,CHEMBL1,0.64,0.2
CHEMBL5,CHEMBL1,0.12,0.4
CHEMBL1,CHEMBL2,0.18,0.4
CHEMBL2,CHEMBL2,1.0,0.0
CHEMBL3,CHEMBL2,0.26,0.2
CHEMBL4,CHEMBL2,0.78,0.4
CHEMBL5,CHEMBL2,0.33,0.2
CHEMBL1,CHEMBL3,0.56,0.2
CHEMBL2,CHEMBL3,0.26,0.2
CHEMBL3,CHEMBL3,1.0,0.0
CHEMBL4,CHEMBL3,0.04,0.2
CHEMBL5,CHEMBL3,0.85,0.2
CHEMBL1,CHEMBL4,0.64,0.2
CHEMBL2,CHEMBL4,0.78,0.4
CHEMBL3,CHEMBL4,0.04,0.2
CHEMBL4,CHEMBL4,1.0,0.0
CHEMBL5,CHEMBL4,0.49,0.2
CHEMBL1,CHEMBL5,12.0,0.4
CHEMBL2,CHEMBL5,0.33,0.2
CHEMBL3,CHEMBL5,0.85,0.2
CHEMBL4,CHEMBL5,0.49,0.2
CHEMBL5,CHEMBL5,1.0,0.0
Would it be wiser to run it in Pyspark due to the size of the data? If so, how could a similar effect be achieved?
Code check
There are some key points that should be highlighted and may improve the performance:
l_id1
,l_id2
, etc.). Avoid whenever possible having multiple copies of data, it undermines performance and makes code harder to debug.groupby
.statistic
andpvalue
member, use it instead of casting into string and then extract value using regular expression.Refactoring
As you seems to be able to open the CSV using pandas, I will assume the complete file fit in your memory. Checking twice, numerical data should fit in 2Gb of RAM.
It is not clear what you want to compute. I will assume you want to collect data based on the
id1
column, then you want to check if distribution are equivalent using the Two Sample Kolmogorov-Smirnow test based on each possible pair of identifiers. If this is not what you want to do, please update your post to detail what you intend to compute.Let's create a trial DataFrame:
The trial dataset looks like:
Once the DataFrame is created, it is easy to group data by identifier using
groupby
method. Then we can apply a statistical test on all possible pairs of identifier. If we assemble everything in a generator, it is about:At this point, just apply the function on the dataframe:
It returns for the trial dataset:
Then we can merge those results with the original dataframe: