Finding the complement of a DNA sequence

7.9k Views Asked by At

I have to translate the complement of a DNA sequence into amino acids

TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA
TATATATATATATATGGGTCATGCTACGTAAGGGGGTTCCCACTTTGGTCATGCTAGTATTGAAA
+1 TyrIleTyrIleTyrGlySerCysTyrValArgGlyPheProLeuTrpSerCysStpTyrStp
+2 IleTyrIleTyrMetGlyHisAlaThrOc*GlyGlySerHisPheGlyHisAlaSerIleglu
+3 TyrIleTyrIleTrpValMetLeuArgLysGlyValProThrLeuValMetLeuValLeuLys
  • The fist sequence is the normal sequence,
  • The second one is the complementary sequence,
  • The one with +1 is the amino acid sequence corresponding to my complementary sequence
  • The one with +2 is the amino acid sequence corresponding to my complementary sequence starting at the second base
  • The one with +3 is the amino acid sequence corresponding to my complementary sequence beginning with the third base

i have tried the next code to get my results, but so i get just a complementair seq. without split.

seq = "CCGGAAGAGCTTACTTAG"
basecomplement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

def translate(seq):

    x = 0
    aaseq = []
    while True:
        try:
            aaseq.append(basecomplement[seq[x:x+1]])
            x += 1

        except (IndexError, KeyError):
            break
    return aaseq

for frame in range(1):
    #print(translate(seq[frame:]))

    rseqn= (''.join(item.split('|')[0] for item in translate(seq[frame:])))

    rseqn = list(rseqn)
    rseqn.reverse()

    print( rseqn)

can someone help me to get my results ??

3

There are 3 best solutions below

1
On

Use:

for frame in range(1):
    rseqn = reversed([item for item in translate(seq[frame:])])
    rseqn = ''.join(rseqn)

    print(rseqn)

this produces the correct complementary (reversed) secuence:

CTAAGTAAGCTCTTCCGG

Note that you do not need the for loop (the current one in fact is doing nothing) to determine DNA or RNA complementary sequences, as this is independent on translation frame.

Having said that, however, I must stress that ALL your code can be simplified in four lines if you start using BioPython for your Bioinformatic tasks:

>>> from Bio import SeqIO
>>> from Bio.Alphabet import NucleotideAlphabet
>>> dna = SeqIO.Seq("CCGGAAGAGCTTACTTAG", NucleotideAlphabet())
>>> dna.reverse_complement()
Seq('CTAAGTAAGCTCTTCCGG', NucleotideAlphabet())
>>> 
1
On

I've cleaned up the code a bit:

seq = "CCGGAAGAGCTTACTTAG"
basecomplement = {'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'}

def translate(seq):
    aaseq = []
    for character in seq:
        aaseq.append(basecomplement[character])
    return aaseq

for frame in range(1):
    rseqn= (''.join(item.split('|')[0] for item in translate(seq[frame:])))
    rseqn = rseqn[::-1]
    print( rseqn)

See if this works for you.

What you are doing is converting rseqn to a list, reverse the list and print the list. The code that I've written never converts rseqn to a list. rseqn initially is a string and the line rseqn = rseqn[::-1] reverses the string for you. So, finally, what you are printing is a string and not a list and hence, there are no splits.

3
On

It seems like you have taken some code and tried to use it without at all understanding what it does. If you read the linked question, you'll notice that the poster in that question had a dictionary of amino acid code strings separated by |. The call to split was to extract the second part of each code string, e.g. from "F|Phe" you want to get "Phe", and that's why that poster needed the split. You don't have those sorts of strings so you shouldn't be using that part of the code.

I will second joaquin's recommendation to use BioPython, as it's clearly the right tool for the job, but for learning purposes: the first thing you need to know is that you have four tasks to accomplish:

  1. Compute the reverse complement of the DNA base sequence
  2. Break the reverse complementary sequence into groups of 3 bases
  3. Convert each group into an amino acid code
  4. Put the amino acid codes together into a string

The code in the linked answer doesn't handle the first step. For that you can use the translate method of Python string objects. First you use maketrans to produce a translation dictionary that will map key => value,

basecomplement = str.maketrans({'A': 'T', 'C': 'G', 'G': 'C', 'T': 'A'})

and then you can write a method to produce the reverse complement,

def reverse_complement(seq):
    return seq.translate(basecomplement)[::-1]

The translate method of joaquin's answer on the other question implements steps 2 and 3. It can actually be done more efficiently using the grouper recipe from itertools. First you will need a dictionary mapping base triplets to amino acids,

amino_acids = {'TAT': 'Tyr', ...}

and you can then use this to convert any sequence of bases,

amino_acids[''.join(a)] for a in zip(*([iter(rseq)]*3))

By way of explanation, zip(*([iter(rseq)]*3)) groups the characters three at a time. But it does so as tuples, not strings, e.g. for 'TATATA' you'd get ('T', 'A', 'T'), ('A', 'T', 'A'), so you need to join each tuple to make a string. That's what ''.join(a) does. Then you look up the string in the amino acid table, which is done by amino_acids[...].

Finally you need to join all the resulting amino acid codes together, which can be done by an outer ''.join(...). So you could define a method like this:

def to_amino_acids(seq):
    return ''.join(amino_acids[''.join(a)] for a in zip(*([iter(rseq)]*3)))

Note that you don't need .split('|') unless your amino_acids dictionary contains multiple representations separated by |.

Finally, to do this for the three different possible ways of converting the bases to amino acids, i.e. the three frames, you would use something akin to the final loop in joaquin's answer,

rseq = reverse_complement(seq)
for frame in range(3):
    # print the frame number
    print('+', frame+1, end=' ')
    # translate the base sequence to amino acids and print it
    print(to_amino_acids(rseq[frame:]))

Note that this loop runs three times, to print the three different frames. There's no point in having a loop if you were just going to have it run once.