How to make this piece of code more efficient (KS Test)

1k Views Asked by At

My data look like these

id1,id2,similarity
CHEMBL1,CHEMBL1,1
CHEMBL2,CHEMBL1,0.18
CHEMBL3,CHEMBL1,0.56
CHEMBL4,CHEMBL1,0.64
CHEMBL5,CHEMBL1,0.12
CHEMBL1,CHEMBL2,0.18
CHEMBL2,CHEMBL2,1
CHEMBL3,CHEMBL2,0.26
CHEMBL4,CHEMBL2,0.78
CHEMBL5,CHEMBL2,0.33
CHEMBL1,CHEMBL3,0.56
CHEMBL2,CHEMBL3,0.26
CHEMBL3,CHEMBL3,1
CHEMBL4,CHEMBL3,0.04
CHEMBL5,CHEMBL3,0.85
CHEMBL1,CHEMBL4,0.64
CHEMBL2,CHEMBL4,0.78
CHEMBL3,CHEMBL4,0.04
CHEMBL4,CHEMBL4,1
CHEMBL5,CHEMBL4,0.49
CHEMBL1,CHEMBL5,12
CHEMBL2,CHEMBL5,0.33
CHEMBL3,CHEMBL5,0.85
CHEMBL4,CHEMBL5,0.49
CHEMBL5,CHEMBL5,1

The whole file is around 197million lines (10GB). My goal is to compare the distributions of column 3 for each compound in column 1. With a lot of refactoring I managed to have this piece of code

import pandas as pd
from scipy.stats import ks_2samp
import re

with open('example.csv', 'r') as f, open('Metrics.tsv', 'a') as f_out:
    f_out.write('compound_1' + '\t' + 'compound_2' + '\t' + 'Similarity' + '\t' + 'KS Distance' + '\n')
    df = pd.read_csv(f, delimiter = ',', lineterminator = '\n', header = None)
    d = {}
    l_id1 = []
    l_id2 = []
    l_sim = []
    uniq_comps = df.iloc[:, 0].unique().tolist()
    for i in uniq_comps:
        d[i] = []
    for j in range(df.shape[0]):
        d[df.iloc[j, 0]].append(df.iloc[j, 2])
        l_id1.append(df.iloc[j, 0])
        l_id2.append(df.iloc[j, 1])
        l_sim.append(df.iloc[j, 2])
    for k in range(len(l_id1)):
        sim = round(l_sim[k]*100, 0)/100
        ks = re.findall(r"statistic=(.*)\,.*$", str(ks_2samp(d[l_id1[k]], d[l_id2[k]])))
        f_out.write(l_id1[k] + '\t' + l_id2[k] + '\t' + str(sim) + '\t' + str(''.join(ks)) + '\n')

which runs but as expected is extremely slow. Does anyone have any ideas of how it could be made faster? My desired output looks like this

 compound_1,compound_2,Similarity,KS Distance
CHEMBL1,CHEMBL1,1.0,0.0
CHEMBL2,CHEMBL1,0.18,0.4
CHEMBL3,CHEMBL1,0.56,0.2
CHEMBL4,CHEMBL1,0.64,0.2
CHEMBL5,CHEMBL1,0.12,0.4
CHEMBL1,CHEMBL2,0.18,0.4
CHEMBL2,CHEMBL2,1.0,0.0
CHEMBL3,CHEMBL2,0.26,0.2
CHEMBL4,CHEMBL2,0.78,0.4
CHEMBL5,CHEMBL2,0.33,0.2
CHEMBL1,CHEMBL3,0.56,0.2
CHEMBL2,CHEMBL3,0.26,0.2
CHEMBL3,CHEMBL3,1.0,0.0
CHEMBL4,CHEMBL3,0.04,0.2
CHEMBL5,CHEMBL3,0.85,0.2
CHEMBL1,CHEMBL4,0.64,0.2
CHEMBL2,CHEMBL4,0.78,0.4
CHEMBL3,CHEMBL4,0.04,0.2
CHEMBL4,CHEMBL4,1.0,0.0
CHEMBL5,CHEMBL4,0.49,0.2
CHEMBL1,CHEMBL5,12.0,0.4
CHEMBL2,CHEMBL5,0.33,0.2
CHEMBL3,CHEMBL5,0.85,0.2
CHEMBL4,CHEMBL5,0.49,0.2
CHEMBL5,CHEMBL5,1.0,0.0

Would it be wiser to run it in Pyspark due to the size of the data? If so, how could a similar effect be achieved?

1

There are 1 best solutions below

4
On

Code check

There are some key points that should be highlighted and may improve the performance:

  • You already have all the data loaded in RAM as you opened the CSV in pandas, then it is not necessary to have copy of that data into lists (eg. l_id1, l_id2, etc.). Avoid whenever possible having multiple copies of data, it undermines performance and makes code harder to debug.
  • When dealing with Pandas DataFrame, try to avoid writing explicit loops as often as possible, there should have a method that does it for you, such as groupby.
  • Scipy statistic package returns a Result object which always exposes statistic and pvalue member, use it instead of casting into string and then extract value using regular expression.
  • Avoid unnecessary call to expensive function, in the last loop you compute a lot of time the same quantity for the Two Sample KS Test, instead compute each once and merge results with dataset afterward.

Refactoring

As you seems to be able to open the CSV using pandas, I will assume the complete file fit in your memory. Checking twice, numerical data should fit in 2Gb of RAM.

8 bytes*197e6 rows/1024**3 ~ 1.47 Gb

It is not clear what you want to compute. I will assume you want to collect data based on the id1 column, then you want to check if distribution are equivalent using the Two Sample Kolmogorov-Smirnow test based on each possible pair of identifiers. If this is not what you want to do, please update your post to detail what you intend to compute.

Let's create a trial DataFrame:

import itertools
import numpy as np
import pandas as pd
from scipy import stats

N = 10**6
df = pd.DataFrame({
    "id1": np.random.choice([f"CHEMBL{i:d}" for i in np.arange(1, 6)], N),
    "id2": np.random.choice([f"CHEMBL{i:d}" for i in np.arange(1, 6)], N),
    "value": np.random.uniform(0, 12, N)
})

The trial dataset looks like:

       id1      id2      value
0  CHEMBL4  CHEMBL3  10.719870
1  CHEMBL2  CHEMBL5   2.911339
2  CHEMBL4  CHEMBL4   0.001595
3  CHEMBL2  CHEMBL3   0.148120
4  CHEMBL5  CHEMBL2   4.683689

Once the DataFrame is created, it is easy to group data by identifier using groupby method. Then we can apply a statistical test on all possible pairs of identifier. If we assemble everything in a generator, it is about:

def apply_test(df, idkey="id", valuekey="value", test=stats.ks_2samp):
    """
    Apply statistical test to each possible pair of identifier
    """
    # Group by identifier:
    g = df.groupby(idkey)
    # Generate all 2-combination of identifier:
    for k1, k2 in itertools.combinations(g.groups.keys(), 2):
        # Apply Statistical Test to grouped data:
        t = test(df.loc[g.groups[k1],valuekey], df.loc[g.groups[k2],valuekey])
        # Store Identifier pair:
        res = {"id1": k1, "id2": k2}
        # Store statistics and p-value:
        res.update({k: getattr(t, k) for k in t._fields})
        # Yield result:
        yield res

At this point, just apply the function on the dataframe:

r = pd.DataFrame([x for x in apply_test(df)])

It returns for the trial dataset:

       id1      id2  statistic    pvalue
0  CHEMBL1  CHEMBL2   0.002312  0.657859
1  CHEMBL1  CHEMBL3   0.002125  0.756018
2  CHEMBL1  CHEMBL4   0.001701  0.934290
3  CHEMBL1  CHEMBL5   0.002560  0.527594
4  CHEMBL2  CHEMBL3   0.002155  0.741524
5  CHEMBL2  CHEMBL4   0.001766  0.914602
6  CHEMBL2  CHEMBL5   0.003035  0.315677
7  CHEMBL3  CHEMBL4   0.001668  0.944053
8  CHEMBL3  CHEMBL5   0.002603  0.507482
9  CHEMBL4  CHEMBL5   0.002661  0.479805

Then we can merge those results with the original dataframe:

df.merge(r)

            id1      id2      value  statistic    pvalue
0       CHEMBL2  CHEMBL5   2.911339   0.003035  0.315677
1       CHEMBL2  CHEMBL5   6.583948   0.003035  0.315677
2       CHEMBL2  CHEMBL5  10.237092   0.003035  0.315677
3       CHEMBL2  CHEMBL5   8.049175   0.003035  0.315677
4       CHEMBL2  CHEMBL5   3.977925   0.003035  0.315677
...         ...      ...        ...        ...       ...
400776  CHEMBL4  CHEMBL5   4.339528   0.002661  0.479805
400777  CHEMBL4  CHEMBL5   5.353133   0.002661  0.479805
400778  CHEMBL4  CHEMBL5  10.599985   0.002661  0.479805
400779  CHEMBL4  CHEMBL5   9.701375   0.002661  0.479805
400780  CHEMBL4  CHEMBL5   7.951454   0.002661  0.479805