How do I get molecular structural information from SMILES

1.2k Views Asked by At

My question is: is there any algorithm that can convert a SMILES structure into a topological fingerprint? For example if glycerol is the input the answer would be 3 x -OH , 2x -CH2 and 1x -CH.

I'm trying to build a python script that can predict the density of a mixture using an artificial neural network. As an input I want to have the structure/fingerprint of my molecules starting from the SMILES structure.

I'm already familiar with -rdkit and the morganfingerprint but that is not what i'm looking for. I'm also aware that I can use the 'matching substructure' search in rdkit, but then I would have to define all the different subgroups. Is there any more convenient/shorter way?

2

There are 2 best solutions below

0
On BEST ANSWER

For most of the structures, there's no existing option to find the fragments. However, there's a module in rdkit that can provide you the number of fragments especially when it's a function group. Check it out here. As an example, let's say you want to find the number of aliphatic -OH groups in your molecule. You can simply call the following function to do that

from rdkit.Chem.Fragments import fr_Al_OH
fr_Al_OH(mol)

or the following would return the number of aromatic -OH groups:

from rdkit.Chem.Fragments import fr_Ar_OH
fr_Ar_OH(mol)

Similarly, there are 83 more functions available. Some of them would be useful for your task. For the ones, you don't get the pre-written function, you can always go to the source code of these rdkit modules, figure out how they did it, and then implement them for your features. But as you already mentioned, the way would be to define a SMARTS string and then fragment matching. The fragment matching module can be found here.

0
On

If you want to predict densities of pure components before predicting the mixtures I recommend the following paper: https://pubs.acs.org/doi/abs/10.1021/acs.iecr.6b03809

You can use the fragments specified by rdkit as mnis proposes. Or you could specify the groups as SMARTS patterns and look for them yourself using GetSubstructMatches as you proposed yourself.

Dissecting a molecule into specific groups is not as straightforward as it might appear in the first place. You could also use an algorithm I published a while ago: https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0382-3 It includes a list of SMARTS for the UNIFAC model, but you could also use them for other things, like density prediction.