Algorithm Options for Gene DNA Sequence Optimization? (related to TSP, dynamic programming)

422 Views Asked by At

The below question is applied specifically for a biotech application but may illustrate general principles for similar problems in other fields. It is an NP-hard problem that can be related to the Traveling Salesman Problem and I am curious what algorithms can be used to arrive at solutions.

Brief Bio background: Proteins are composed of 20 amino acids. DNA is composed of 4 bases - A, C, G, T. The DNA sequence for the protein determines the sequence of amino acids - each consecutive sequence of 3 DNA bases (the unit is called a codon) codes for one amino acid. A single amino acid can be coded by multiple codons, for example Valine has 4 ways of being coded.

Not all codons are made equal - some of them are processed faster than others. Also, not all codon PAIRS are made equal - some pairs are slower than others.

This means that for a particular gene of 100 amino acids (300 DNA bases), there are many ways of coding the same amino acid sequence, but with very different properties, such as processing speed.

Given a table of codon pairs with corresponding speed values, we'd like to write an algorithm that can output a sequence of desired speed, e.g. the fastest and slowest possible sequences, and a gradient in between. The input is a DNA sequence coding for a gene, and a dictionary of codon pairs and their respective speed score (-1 to 1). The output is the optimized DNA sequence and its whole speed score (which can be represented as a sum of all codon pair scores). The amino acid sequence must remain the same.

Example: if we have sequence AAATTTGGG coding for 3 amino acids, and we have the codon pairs with scores:

AAATTT = -0.5

TTTGGG = -0.5

then this sequence might have a score of -1.

Now if we also have the pair alternatives we can assess different possibilities:

AAATTG = -0.7 AAATTC = -0.3

TTGGGC = +0.2 TTCGGA = -1.0

One would find that the optimum sequence based on this info is AAATTCGGA since it gives an overall value of -1.3.

The complexity of this problem of course lies in the influence of a codon pair on all surrounding codon pairs.

The full codon pair chart would have 61*61 entries (because 3 codons stop the reading of the gene).

====

Questions

I believe this is an NP-hard problem and has relations to the TSP. I've seen one approach use a simulated annealing algorithm. I'm curious whether there are other insightful ways of considering this problem and corresponding algorithms and heuristics to produce desired outputs.

If dynamic programming, what approaches would be appropriate?

Also, how can we use the algorithm to create a gradient of speed scores, instead of just maximum and minimum values?

2

There are 2 best solutions below

0
On

Getting the fastest (or slowest) encoding for a particular sequence should be a straightforward dynamic programming problem. Think of each 3 amino acid group as a character in an 64-character alphabet. Then at each point in the resulting string you will have a choice of a few letters which produce the desired amino acid, and you want to maximise (minimise) the sum associated with each adjacent pair of letters.

Working from left to right, at each point, for each possible character, work out the maximum (minimum) sequence that terminates in that character. There are at most 64 possibilities for each character. Given the solutions for characters up to n, you can use that answers here to work out the solutions for characters up to n+1 - just take the score for characters n and n+1 and add this on to the best answer already calculated for characters up to n - keep the best answers you find for each possible character at n+1.

To get answers with intermediate scores, one way would be to just generate random answers which produce the right amino acid sequence. Another way would be to mix portions of the answer which produces the maximum score with the answer that produces the minimum score.

0
On

Using a genetic algorithm you should be able to get sequences which achieve the desired objective. Say your goal is a speed of x, you can create a population of genes - each coding for the same gene but encoded by different codons. Then select, mate, and mutate for several generations until x is achieved (or close enough). The element of mutation/recombination would have to be at the codon level (as opposed to nucleotide level). To achieve a series of sequences with different speeds, run the algorithm multiple times with the different objective x.