Calculate Tanimoto coefficient for dataframe

532 Views Asked by At

I have a table that looks like this:

enter image description here

and I want to calculate Tanimoto coefficient (Molecular similarity measure) by RDkit in python in order to have below result: enter image description here

but I failed.

My data:

{'name': ['16β-hydro-ent-kauran-17-oic acid ',
  '16α-hydro-entkauran-17-oic acid ',
  'ent-kaur-16-en-19-oic acid',
  '16β,17-dihydroxy-ent-kauran-19-oic acid ',
  'annomontacin'],
 'canonical_smile': ['CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC12CCCC(C1CCC34C2CCC(C3)C(=C)C4)(C)C(=O)O',
  'CC12CCCC(C1CCC34C2CCC(C3)C(C4)(CO)O)(C)C(=O)O',
  'CCCCCCCCCCCCC(C1CCC(O1)C(CCCCCCC(CCCCCC(CC2=CC(OC2=O)C)O)O)O)O']}

Here is my code:

import pandas as pd
import itertools
import matplotlib.pyplot as plt
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    PandasTools,
    Draw,
    Descriptors,
    MACCSkeys,
    rdFingerprintGenerator)

# Create two columns (SMILEs) from the combination of one column (SMILEs).
df3 = pd.DataFrame(list(itertools.combinations(df['canonical_smile'].unique(), 2)), 
                                   columns=['canonical_smile1', 'canonical_smile2']).dropna()
# Create two columns ROMoL objects from two columns (SMILEs).
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile1','ROMol1',includeFingerprints=True)
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile2','ROMol2',includeFingerprints=True)
# Calculate the circular Morgan fingerprints of two columns ROMoL objects 
df3["morgan1"] = rdFingerprintGenerator.GetFPs(df3["ROMol1"].tolist())
    df3["morgan2"] = rdFingerprintGenerator.GetFPs(df3["ROMol2"].tolist())
# Add the Tanimoto similarities between the Morgan fingerprints.
    df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

and this is my error:

    ArgumentError: Python argument types in
    rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity(Series, Series)
did not match C++ signature:
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned __int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<__int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class ExplicitBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)
    BulkTanimotoSimilarity(class SparseBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)
2

There are 2 best solutions below

4
Derek O On

Disclaimer: I don't have much chemistry background, but my understanding is BulkTanimotoSimilarity is a similarity metric between a query fingerprint and a list of target fingerprints (based on this article).

From the error message, you are passing arguments that are of type pd.Series, pd.Series to BulkTanimotoSimilarity when this method expects a SparseIntVect and a list (or list-like) as inputs.

So if we take each bit vector in column morgan1 to be your query fingerprint, and take the entire column morgan2 to be your list of target fingerprints, we can do something like the following:

df3["tanimoto_morgan"] = df3['morgan1'].map(lambda morgan1: DataStructs.BulkTanimotoSimilarity(morgan1, df3['morgan2']))

This compiles and results in the following column being added to df3:

>>> df3['tanimoto_morgan']
0    [0.42592592592592593, 0.4107142857142857, 0.07...
1    [0.42592592592592593, 0.4107142857142857, 0.07...
2    [0.42592592592592593, 0.4107142857142857, 0.07...
3    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
4    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
5    [0.5272727272727272, 1.0, 0.08536585365853659,...
Name: tanimoto_morgan, dtype: object
0
jacobdavis On

I think that the problem is as follows:

df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

I have fixed it with this code below, and it now runs normally:

df3["tanimoto_morgan"] = [DataStructs.TanimotoSimilarity(fp1, fp2) for fp1, fp2 in zip(df3["morgan1"], df3["morgan2"])]