I have a csv file ("file.csv") containing bitstrings (structural fingerprints) of 1000 molecules, something like:
0,0,1,0,1,0,0,1,0,...
1,0,3,1,0,1,1,0,0,...
0,1,0,0,1,0,0,3,2,...
2,1,1,3,0,0,0,1,0,...
...
1,0,0,0,1,0,0,0,2,...
(1000 lines, each line corresponds to the structural fingerprint of a molecule, each fingerprint contains 2048 bits)
With this csv file, I try to compute the pairwise Tanimoto similarity of these 1000 molecules, using this code:
input_data = pd.read_csv("file.csv", delimiter=',', header=None)
def tanimoto_distance_matrix(fp_list):
dissimilarity_matrix = []
for i in range(1, len(fp_list)):
similarities = DataStructs.BulkTanimotoSimilarity(fp_list[i], fp_list[:i])
dissimilarity_matrix.extend([1 - x for x in similarities])
return dissimilarity_matrix
dist_matrix_raw = tanimoto_distance_matrix(input_data)
However, I receive the following message error:
ArgumentError: Python argument types in
rdkit.DataStructs.cDataStructs.BulkTanimotoSimilarity(str, Series)
did not match C++ signature:
BulkTanimotoSimilarity(RDKit::SparseIntVect<unsigned long> v1, boost::python::list v2, bool returnDistance=False)
BulkTanimotoSimilarity(RDKit::SparseIntVect<unsigned int> v1, boost::python::list v2, bool returnDistance=False)
BulkTanimotoSimilarity(RDKit::SparseIntVect<long> v1, boost::python::list v2, bool returnDistance=False)
BulkTanimotoSimilarity(RDKit::SparseIntVect<int> v1, boost::python::list v2, bool returnDistance=False)
BulkTanimotoSimilarity(ExplicitBitVect const* bv1, boost::python::api::object bvList, bool returnDistance=0)
BulkTanimotoSimilarity(SparseBitVect const* bv1, boost::python::api::object bvList, bool returnDistance=0)
Could you please help me fix this error? Thank you very much in advance.
There are several problems with your approach and/or code:
You can address all of these issues together with a single solution - convert your data to a list of RDKit's
ExplicitBitVect:Please note that the
to_bit_vectorfunction treats anything other than 0 as 1.* I think that Tanimoto similarity can be extended to integer vectors, but I don't know how.