I have a data set of enzyme sequences and a target variable to predict.
The process I am doing is transforming sequences into smiles and then get numerical inputs for machine learning models.
Problem is: rdkit fails to transform some of the sequences but not all of them. In this case the transformation was stopped for index = 5 which corresponds to the following sequence: 'PQITLWQRPIVTIKIGGQLIEALLDTGADDTVLEXXNLPGRWKPKXIGGIGGFXKVRQYDQVPIEIXGHKTXSTVLVGPTPVNIIGRNLMTQIGCTLNFPISPIETVPVKLKPGMDGPKXKQWPLTEEKIKALMEICKELEEEGKISKIGPENPYNTPVFAIKKKNSTKWRKLVDFRELNKRTQDFWEVQLGIPHPAGLKRKKSVTVLDVGDAYFSIPLDKDFRKYTAFTIPSINNETPGIRYQYNVLPQGWKGSPAIFQSSMTKILEPFRKQNPDIVIYQYVDDLYVGSDLEIEQHRTKIKELRQYLWKWGFYTPDXKHQEEPPFHWXGYELHPDKWTVQPIVLPEKESWTVNDIQKLVGKLNWASQIYAGIKVKQLCKLLRG'

Problem transforming a SEQUENCE into SMILES with RDKit
987 Views Asked by Triki Sadok At
1
There are 1 best solutions below
Related Questions in PYTHON
- new thread blocks main thread
- Extracting viewCount & SubscriberCount from YouTube API V3 for a given channel, where channelID does not equal userID
- Display images on Django Template Site
- Difference between list() and dict() with generators
- How can I serialize a numpy array while preserving matrix dimensions?
- Protractor did not run properly when using browser.wait, msg: "Wait timed out after XXXms"
- Why is my program adding int as string (4+7 = 47)?
- store numpy array in mysql
- how to omit the less frequent words from a dictionary in python?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- python how to write list of lists to file
- Removing URL features from tokens in NLTK
- Optimizing for Social Leaderboards
- Python : Get size of string in bytes
- What is the code of the sorted function?
Related Questions in BIOINFORMATICS
- Delete first line of all files in a folder (on ubuntu)
- With every loop iteration in Python use resulting file for subprocess call
- What might explain the [blastall] ERROR: Arguments must start with '-' error?
- In bioinformatics, what is a singleton?
- Coding genotypes from nucleotides to 0/1/2 in R (or Python)
- How to extract short sequence based on step size?
- Find Substring of Trie Keys
- Identify conserved sequences in lists of integers
- How to extract start and end sites based on capital letter in a sequence?
- getting records which are different from two fastq files
- choosing reads with Hamming distance zero
- Bash: Converting 4 columns of text interleaved lines (tab-delimited columns to FASTQ file)
- Unable to read an SBML file in SBMLR
- How to extract coordinates in P-match result?
- Repeated ordered sequence search algorithm
Related Questions in FINGERPRINT
- Latent Fingerprint Matching
- Why is an IllegalBlockSizeException thrown when testing FingerprintManager in a simulator?
- jenkins, what does fingerprint artifacts means?
- 'System.Runtime.InteropServices.COMException' in Interop.ZKFPEngXControl.dll (0x80040202)
- WinBioCaptureSampleWithCallback failed. OperationStatus = 0x80004001
- Integrate MorphoSmart sdk in android
- ionic finger print android giving missing required parameters error
- Where should I store browser fingerprint?
- WPF Fingerprint application Doesn't run on some windows OS
- Comparing fingerprint live capture and template from DB
- Comparing two fingerprint image using Emgu in C#
- How to get user ID or info in onAuthenticationSucceeded method for android fingerprint
- Implement fingerprint in mobile banking android app
- Raspberry Pi based fingerprint authentication for a large database
- Reading AES1660 fingerprint sensor in MATLAB
Related Questions in RDKIT
- Bit match analogue for array of words (fingerprints)
- apache doesn't respect LD_LIBRARY_PATH?
- RDKit ERG Node attribute names python
- Calculate descriptors with RDkit
- Converting SMILES to chemical name or IUPAC name using rdkit or other python module
- Bioisosteric replacement using SMARTS (KNIME and RDKit)
- I am trying to compile a bb recipe file in yocto. I am facing : make: *** No targets specified and no makefile found. Stop. erorr
- graph theory - connect point in 3D space with other three nearest points (distance based)
- How do I use get_mol() on a pdb file in RDKIT-JS?
- How to use the CalCRDF function of rdkit.Chem.rdMolDescriptors to calculate RDFs of selected atom/atom type?
- MaxMin diversity selection with RDKit
- How to convert large sdf file to dataframe in RDKit
- molecule image in vscode
- Draw a cloud or lines over the polar area of a molecule in RDKit
- How to polymerize repeating units of polymers into dimers
Related Questions in CHEMINFORMATICS
- Parse multicolumn string using python
- Bit match analogue for array of words (fingerprints)
- Converting SMILES to chemical name or IUPAC name using rdkit or other python module
- How to use the CalCRDF function of rdkit.Chem.rdMolDescriptors to calculate RDFs of selected atom/atom type?
- MaxMin diversity selection with RDKit
- Draw a cloud or lines over the polar area of a molecule in RDKit
- normally distributed population, calculating in R the probability of negative or zero readings occurring
- Is there a way to calculate overlap percentage/score from a perspective of smaller chemical structure?
- Converting pandas columns of chemical formulas to SMILES
- How to use the ChEMBL API to download the chembldescriptors?
- Chemical representation - SNL to SMILES
- How to use regexp to identify the number of hydrogens in a chemical formula?
- How to save RDKit conformer object into a sdf file?
- Problem transforming a SEQUENCE into SMILES with RDKit
- How to create a Pandas df from a haphazard .dat file?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Looks like the issue is that you have X in your sequence. This is not an amino acid code but a placeholder for an unknown/atypical amino acid. Seems that RDKit cannot process this case:
When we removed the Xs RDKit parsed the sequence correctly. I am not saying that merely removing these is the correct solution, just highlighting the issue. There is probably a much better method for processing these cases.