Compute Tanimoto similarity from a csv file of bitstrings

64 Views Asked by At

I have a csv file ("file.csv") containing bitstrings (structural fingerprints) of 1000 molecules, something like:

0,0,1,0,1,0,0,1,0,...

1,0,3,1,0,1,1,0,0,...

0,1,0,0,1,0,0,3,2,...

2,1,1,3,0,0,0,1,0,...

...

1,0,0,0,1,0,0,0,2,...

(1000 lines, each line corresponds to the structural fingerprint of a molecule, each fingerprint contains 2048 bits)

With this csv file, I try to compute the pairwise Tanimoto similarity of these 1000 molecules, using this code:

input_data = pd.read_csv("file.csv", delimiter=',', header=None)
def tanimoto_distance_matrix(fp_list):
    dissimilarity_matrix = []
    for i in range(1, len(fp_list)):
        similarities = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
        dissimilarity_matrix.extend([1 - x for x in similarities])
    return dissimilarity_matrix
dist_matrix_raw = tanimoto_distance_matrix(input_data)

However, I receive the following message error:

ArgumentError: Python argument types in
    rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity(str, Series)
did not match C++ signature:
    BulkTanimotoSimilarity(RDKit::SparseIntVect<unsigned long> v1, boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(RDKit::SparseIntVect<unsigned int> v1, boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(RDKit::SparseIntVect<long> v1, boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(RDKit::SparseIntVect<int> v1, boost::python::list v2, bool returnDistance=False)
    BulkTanimotoSimilarity(ExplicitBitVect const* bv1, boost::python::api::object bvList, bool returnDistance=0)
    BulkTanimotoSimilarity(SparseBitVect const* bv1, boost::python::api::object bvList, bool returnDistance=0)

Could you please help me fix this error? Thank you very much in advance.

1

There are 1 best solutions below

0
Shovalt On

There are several problems with your approach and/or code:

  1. Your "bit" vectors are not actually bits (at least in the example you provided) - they need to contain only zeros and ones*.
  2. You are iterating over a Pandas dataframe like you would iterate over a list or Numpy array, but this doesn't do what you think it does.
  3. RDKit uses specific data types for these functions - it's a bit annoying (and somewhat un-Pythonic), but that is the case.

You can address all of these issues together with a single solution - convert your data to a list of RDKit's ExplicitBitVect:

def to_bit_vector(arr):
    # Convert list to ExplicitBitVect
    rdkit_fp = ExplicitBitVect(len(arr))
    for i, bit in enumerate(arr):
        if bit:  # Only set bits with value 1
            rdkit_fp.SetBit(i)
    return rdkit_fp

# Convert dataframe to list of bit vectors
fp_list = input_data.apply(to_bit_vector, axis=1).to_list()

# Now your function will work unchanged
dist_matrix_raw = tanimoto_distance_matrix(fp_list)

Please note that the to_bit_vector function treats anything other than 0 as 1.

* I think that Tanimoto similarity can be extended to integer vectors, but I don't know how.