I am calculating log-odds scores of sequences and returning the motif (small section of sequence) that gives the maximum score. I have code that calculates the maximum score for each sequence in my file, and I am having trouble storing the motif that gives that score. See my other post(s) for file format, general calculation of log-odds scores, etc Perl: Creating and manipulating hash of arrays for log-odds scores of DNA sequences. My code is as follows:
use strict;
use warnings;
use List::Util 'max';
use Data::Dumper;
#USER SPECIFICATIONS
#User specifies motif width
my $width = 3;
#User enters the filename that contains the sequence data
print "Please enter the filename of the fasta sequence data: ";
my $filename1 = <STDIN>;
#Remove newline from file
chomp $filename1;
#Open the file and store each dna seq in hash
my %id2seq = ();
my %HoA = ();
my %loscore = ();
my %maxscore = ();
my %maxmot = ();
my $id = '';
open (FILE, '<', $filename1) or die "Cannot open $filename1.",$!;
my $dna;
while (<FILE>)
{
if($_ =~ /^>(.+)/)
{
$id = $1; #Stores 'Sequence 1' as the first $id, for example
}
else
{
$HoA{$id} = [ split(//) ]; #Splits the contents to allow for position reference later
$id2seq{$id} .= $_; #Creates a hash with each seq associated to an id number
$maxmot{$id} = (); #Creates empty hash to push motifs to
foreach $id (keys %HoA)
{
for my $len (0..(length($HoA{$id})-$width-1))
{
push @{ $loscore{$id} }, 0;
}
}
push @{ $maxscore{$id} }, -30; #Creates a HoA with each id number to have a maxscore (initial score -30)
}
}
close FILE;
foreach $id (keys %id2seq)
{
print "$id2seq{$id}\n\n";
}
print "\n\n";
#EXPECTATION
#Create log-odds table of motifs
my %logodds;
$logodds{'A'}[0] = 0.1;
$logodds{'A'}[1] = 0.2;
$logodds{'A'}[2] = 0.3;
$logodds{'C'}[0] = 0.2;
$logodds{'C'}[1] = 0.5;
$logodds{'C'}[2] = 0.2;
$logodds{'G'}[0] = 0.3;
$logodds{'G'}[1] = 0.2;
$logodds{'G'}[2] = 0.4;
$logodds{'T'}[0] = 0.4;
$logodds{'T'}[1] = 0.1;
$logodds{'T'}[2] = 0.1;
#MAXIMIZATION
#Determine location for each sequence that maximally
#aligns to the motif pattern
foreach $id (keys %HoA)
{
for my $pos1 (0..length($HoA{$id})-$width-1) #Look through all positions the motif can start at
{
for my $pos2 ($pos1..$pos1+($width-1)) #Define the positions within the motif (0 to width-1)
{
for my $base (qw( A C G T))
{
if ($HoA{$id}[$pos2] eq $base) #If the character matches a base:
{
for my $pos3 (0..$width-1) #Used for position reference in %logodds
{
#Calculate the log-odds score at each location
$loscore{$id}[$pos2] += $logodds{$base}[$pos3];
#Calculate the maximum log-odds score for each sequence
#Find the motif that gives the maximum score for each sequence
$maxscore{$id} = max( @{ $loscore{$id} });
if ($loscore{$id}[$pos2] == $maxscore{$id})
{
push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; #NOT SURE THAT THIS IS THE CORRECT WAY TO PUSH WHAT I WANT
}
}
}
}
}
}
}
print "\n\n";
print Dumper(\%maxmot);
The expected output for the %maxmot should be something like this:
'Sequence 11' => [ 'C', 'A', 'T'],
'Sequence 14' => ['C', 'T', 'G'], etc.
There should be three bases in the expected output because the $width = 3. The output I get gives me multiples of each base, not in any noticeable order (or at least, I cannot notice a pattern):
'Sequence 11' => [ 'C', 'C', 'C', 'A', 'A', 'A', 'A', 'T', 'A', 'A', 'T', 'T', 'T'],
'Sequence 14' => ['C', 'C', 'T', 'T', 'C', 'C', 'T', 'T', 'T', 'T', 'T'], etc.
I believe the issue is rooted in the push @{ $maxmot{$id} }, $HoA{$id}[$pos3]; step, but I'm not quite sure how to fix it. I have tried pushing $HoA{$id}[$pos2] instead, but that does not seem to fix my problem either. As always, any and all help is appreciated! I can clarify if needed, I know my question is a little convoluted. Thank you in advance.
The
push()is not the problem. From running your code it becomes obvious that the conditional$loscore{$id}[$pos2] == $maxscore{$id}istruemore often than you expect it.Here are some questions I would ask in a code review:
length()on an array? It does not return the length of the array.for my $base (qw( A C G T)) { if ($HoA{$id}[$pos2] eq $base) {...just an inefficient way of the equivalentmy $base = $HoA{$id}[$pos2];?$pos2is executed$pos2 + 1times, i.e. once for0, twice for1, ... i.e. later positions get a higher score.$loscore{$id}[$pos2]is the sum of@{ $logodds{$base} }, i.e. the base at position$pos2 + $pos3is ignored for the calculation$maxscore{$id}while running over the sequences and then use the changing value in the conditional$widthbases long, but your code only stores single bases into%maxmotI'm making an educated guess and propose that the following is the correct algorithm. I'm using the 3 sequences you have given in your previous question. I'm also dumping the other 2 hashes, so that the calculation results become visible.
I took the liberty to rewrite your code to be more concise and clear. But you should be able to trace back the lines in the new code to your original code.
Test run:
This looks much closer to your expected output. But of course I could be completely off with my guesses...
If I understand the logscore calculation correctly, then the score per motif is a constant and hence can be pre-calculated. Which would lead to the following more straightforward approach:
Test run:
As the logscore per motif is a constant, the motif list sorted by logscore order will also be a constant. Given that list we will only have to find the first motif that matches on a given sequence. Hence we can apply the highly optimized regular expression engine on the problem. Depending on your actual problem size this will probably be the more efficient solution: