The below question is applied specifically for a biotech application but may illustrate general principles for similar problems in other fields. It is an NP-hard problem that can be related to the Traveling Salesman Problem and I am curious what algorithms can be used to arrive at solutions.
Brief Bio background: Proteins are composed of 20 amino acids. DNA is composed of 4 bases - A, C, G, T. The DNA sequence for the protein determines the sequence of amino acids - each consecutive sequence of 3 DNA bases (the unit is called a codon) codes for one amino acid. A single amino acid can be coded by multiple codons, for example Valine has 4 ways of being coded.
Not all codons are made equal - some of them are processed faster than others. Also, not all codon PAIRS are made equal - some pairs are slower than others.
This means that for a particular gene of 100 amino acids (300 DNA bases), there are many ways of coding the same amino acid sequence, but with very different properties, such as processing speed.
Given a table of codon pairs with corresponding speed values, we'd like to write an algorithm that can output a sequence of desired speed, e.g. the fastest and slowest possible sequences, and a gradient in between. The input is a DNA sequence coding for a gene, and a dictionary of codon pairs and their respective speed score (-1 to 1). The output is the optimized DNA sequence and its whole speed score (which can be represented as a sum of all codon pair scores). The amino acid sequence must remain the same.
Example: if we have sequence AAATTTGGG coding for 3 amino acids, and we have the codon pairs with scores:
AAATTT = -0.5
TTTGGG = -0.5
then this sequence might have a score of -1.
Now if we also have the pair alternatives we can assess different possibilities:
AAATTG = -0.7 AAATTC = -0.3
TTGGGC = +0.2 TTCGGA = -1.0
One would find that the optimum sequence based on this info is AAATTCGGA since it gives an overall value of -1.3.
The complexity of this problem of course lies in the influence of a codon pair on all surrounding codon pairs.
The full codon pair chart would have 61*61 entries (because 3 codons stop the reading of the gene).
====
Questions
I believe this is an NP-hard problem and has relations to the TSP. I've seen one approach use a simulated annealing algorithm. I'm curious whether there are other insightful ways of considering this problem and corresponding algorithms and heuristics to produce desired outputs.
If dynamic programming, what approaches would be appropriate?
Also, how can we use the algorithm to create a gradient of speed scores, instead of just maximum and minimum values?
Using a genetic algorithm you should be able to get sequences which achieve the desired objective. Say your goal is a speed of x, you can create a population of genes - each coding for the same gene but encoded by different codons. Then select, mate, and mutate for several generations until x is achieved (or close enough). The element of mutation/recombination would have to be at the codon level (as opposed to nucleotide level). To achieve a series of sequences with different speeds, run the algorithm multiple times with the different objective x.