Calculate Tanimoto coefficient for dataframe

Question

Calculate Tanimoto coefficient for dataframe

532 Views Asked by jacobdavis At 13 February 2023 at 04:12

I have a table that looks like this:

and I want to calculate Tanimoto coefficient (Molecular similarity measure) by RDkit in python in order to have below result:

but I failed.

My data:

{'name': ['16β-hydro-ent-kauran-17-oic acid ',
  '16α-hydro-entkauran-17-oic acid ',
  'ent-kaur-16-en-19-oic acid',
  '16β,17-dihydroxy-ent-kauran-19-oic acid ',
  'annomontacin'],
 'canonical_smile': ['CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC1(CCCC2(C1CCC34C2CCC(C3)C(C4)C(=O)O)C)C',
  'CC12CCCC(C1CCC34C2CCC(C3)C(=C)C4)(C)C(=O)O',
  'CC12CCCC(C1CCC34C2CCC(C3)C(C4)(CO)O)(C)C(=O)O',
  'CCCCCCCCCCCCC(C1CCC(O1)C(CCCCCCC(CCCCCC(CC2=CC(OC2=O)C)O)O)O)O']}

Here is my code:

import pandas as pd
import itertools
import matplotlib.pyplot as plt
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    PandasTools,
    Draw,
    Descriptors,
    MACCSkeys,
    rdFingerprintGenerator)

# Create two columns (SMILEs) from the combination of one column (SMILEs).
df3 = pd.DataFrame(list(itertools.combinations(df['canonical_smile'].unique(), 2)), 
                                   columns=['canonical_smile1', 'canonical_smile2']).dropna()
# Create two columns ROMoL objects from two columns (SMILEs).
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile1','ROMol1',includeFingerprints=True)
    PandasTools.AddMoleculeColumnToFrame(df3,'canonical_smile2','ROMol2',includeFingerprints=True)
# Calculate the circular Morgan fingerprints of two columns ROMoL objects 
df3["morgan1"] = rdFingerprintGenerator.GetFPs(df3["ROMol1"].tolist())
    df3["morgan2"] = rdFingerprintGenerator.GetFPs(df3["ROMol2"].tolist())
# Add the Tanimoto similarities between the Morgan fingerprints.
    df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

and this is my error:

    ArgumentError: Python argument types in
    rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity(Series, Series)
did not match C++ signature:
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned __int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<unsigned int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<__int64> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class RDKit::SparseIntVect<int> v1, class boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(class ExplicitBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)
    BulkTanimotoSimilarity(class SparseBitVect const * __ptr64 bv1, class boost::python::api::object bvList, bool returnDistance=0)

Original Q&A

There are 2 best solutions below

**Derek O** · Answer 1 · 2023-02-13T16:17:30.077000

Disclaimer: I don't have much chemistry background, but my understanding is BulkTanimotoSimilarity is a similarity metric between a query fingerprint and a list of target fingerprints (based on this article).

From the error message, you are passing arguments that are of type pd.Series, pd.Series to BulkTanimotoSimilarity when this method expects a SparseIntVect and a list (or list-like) as inputs.

So if we take each bit vector in column morgan1 to be your query fingerprint, and take the entire column morgan2 to be your list of target fingerprints, we can do something like the following:

df3["tanimoto_morgan"] = df3['morgan1'].map(lambda morgan1: DataStructs.BulkTanimotoSimilarity(morgan1, df3['morgan2']))

This compiles and results in the following column being added to df3:

>>> df3['tanimoto_morgan']
0    [0.42592592592592593, 0.4107142857142857, 0.07...
1    [0.42592592592592593, 0.4107142857142857, 0.07...
2    [0.42592592592592593, 0.4107142857142857, 0.07...
3    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
4    [1.0, 0.5272727272727272, 0.0875, 0.5272727272...
5    [0.5272727272727272, 1.0, 0.08536585365853659,...
Name: tanimoto_morgan, dtype: object

**jacobdavis** · Answer 2 · 2023-02-14T07:44:58.657000

I think that the problem is as follows:

df3["tanimoto_morgan"] = DataStructs.BulkTanimotoSimilarity(df3["morgan1"], df3["morgan2"])

I have fixed it with this code below, and it now runs normally:

df3["tanimoto_morgan"] = [DataStructs.TanimotoSimilarity(fp1, fp2) for fp1, fp2 in zip(df3["morgan1"], df3["morgan2"])]

Calculate Tanimoto coefficient for dataframe

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in CHEMISTRY

Related Questions in RDKIT

Related Questions in ARGUMENT-ERROR

Trending Questions

Popular # Hahtags

Popular Questions