How to convert large sdf file to dataframe in RDKit

189 Views Asked by At

This is the question in cheminformatics major. I have a large sdf file (greater than 700 MB, 200.000 molecules). I want to convert to dataframe to analysis. I use this below code:

df = PandasTools.LoadSDF('Datatest/0 chemdiv_PPI.sdf')

Consequently, the memory (Ram) skyrocketed to near 100%. My question is: "is there any easy way to convert a large sdf file to df and handle it with pandas (of course this way does not affect ram too much)?

1

There are 1 best solutions below

0
On

I think your best bet is to handle this with the sdf mol supplier, which will not read the entire file into memory at once: https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.SDMolSupplier

From there you have a couple of options. You can iterate over the supplier and write chunks of molecules to sdf files, then load those one at a time into pandas with the PandasTools function. Otherwise it depends on what you are trying to accomplish with the PandasTools function.