Problem transforming a SEQUENCE into SMILES with RDKit

967 Views Asked by At

I have a data set of enzyme sequences and a target variable to predict.
The process I am doing is transforming sequences into smiles and then get numerical inputs for machine learning models.
Problem is: rdkit fails to transform some of the sequences but not all of them. In this case the transformation was stopped for index = 5 which corresponds to the following sequence: 'PQITLWQRPIVTIKIGGQLIEALLDTGADDTVLEXXNLPGRWKPKXIGGIGGFXKVRQYDQVPIEIXGHKTXSTVLVGPTPVNIIGRNLMTQIGCTLNFPISPIETVPVKLKPGMDGPKXKQWPLTEEKIKALMEICKELEEEGKISKIGPENPYNTPVFAIKKKNSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKRKKSVTVLDVGDAYFSIPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYVDDLYVGSDLEIEQHRTKIKELRQYLWKWGFYTPDXKHQEEPPFHWXGYELHPDKWTVQPIVLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRG' enter image description here

1

There are 1 best solutions below

1
On

Looks like the issue is that you have X in your sequence. This is not an amino acid code but a placeholder for an unknown/atypical amino acid. Seems that RDKit cannot process this case:

amino_acids = {'G', 'A', 'L', 'M', 'F', 'W', 'K', 'Q', 'E', 'S', 'P', 'V', 'I', 'C', 'Y', 'H', 'R', 'N', 'D', 'T'}
seq = 'PQITLWQRPIVTIKIGGQLIEALLDTGADDTVLEXXNLPGRWKPKXIGGIGGFXKVRQYDQVPIEIXGHKTXSTVLVGPTPVNIIGRNLMTQIGCTLNFPISPIETVPVKLKPGMDGPKXKQWPLTEEKIKALMEICKELEEEGKISKIGPENPYNTPVFAIKKKNSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKRKKSVTVLDVGDAYFSIPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYVDDLYVGSDLEIEQHRTKIKELRQYLWKWGFYTPDXKHQEEPPFHWXGYELHPDKWTVQPIVLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRG'

edited_seq = ''
for aa in seq:
    if aa not in amino_acids:
        print('Non-standard/missing amino acid:', aa)
    else:
        edited_seq += aa

m1 = Chem.MolFromSequence(seq)
m2 = Chem.MolFromSequence(edited_seq)

print('Read seq successfully:', m1 is not None)
print('Read edited_seq successfully:', m2 is not None)

[Out]:

Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Non-standard/missing amino acid: X
Read seq successfully: False
Read edited_seq successfully: True

When we removed the Xs RDKit parsed the sequence correctly. I am not saying that merely removing these is the correct solution, just highlighting the issue. There is probably a much better method for processing these cases.