Finding longest common substring between DNA sequences (Python)

74 Views Asked by At

I am working on the "Finding a shared motif" problem from rosalind.info (https://rosalind.info/problems/lcsm/). This problem asks you to find the longest common substring in a collection of <100 DNA strings with length ~1000 nucleotides. The code I have written seems to return the correct result on any test data that I can generate on my own, but seems to consistently fail the actual check by rosalind. I am not sure what the actual problem is, and I think my code has ended up somewhat messy after repeated attempts to fix it by editing single lines. I used three functions in total to solve the problem, and I suspect that the problem is with the "check" function. Furthermore, the final function returns "None" rather than the expected value of newkeys[-1]

def readfile(filepath):
    with open(filepath,'r') as f:
        return [l.strip() for l in f.readlines()]

def check(key,lst):
    while True:
        for line in lst:
            if key not in line:
                return False
        else:
            return True


def Sharedmotif(filepath):
    Fastafile = readfile(filepath)
    Fastadict = {}
    Fastalabel = ''
    for line in Fastafile:
        if '>' in line:
            Fastalabel = line
            Fastadict[Fastalabel] = ''
        else:
            Fastadict[Fastalabel] += line
    Fastalist = list(Fastadict.values())
    #Actual solution
    alphabet=['A','C','T','G']
    tuples=list(itertools.combinations_with_replacement(alphabet,3))
    keys=[''.join(item) for item in tuples]
    newkeys=[]
    merged_list = []
    i=1
    for i in range(100):
        for key in keys:
            if check(key,Fastalist):
                newkeys.append(key)
                newkeys=newkeys[-4:]
        for key in newkeys:
            Tadd=[key+'T']
            Cadd=[key+'C']
            Aadd=[key+'A']
            Gadd=[key+'G']
            allkeys=list(Tadd+Cadd+Aadd+Gadd)
            for minis in allkeys:
                merged_list.append(minis)
        keys=merged_list
        length=len(newkeys[-1])
        i+=1
        print(newkeys)
        if i>=length:
            return newkeys[-1]`

1

There are 1 best solutions below

0
Younes Abuelayyan On

I tried to this problem before and here is ho I solved it, I think the best way of parsing fasta files is to use BioPython. I usually use argparse to get the input but you can use any way you like.

#!/usr/bin/env python3
import argparse
from typing import NamedTuple, TextIO
from Bio import SeqIO


class Args(NamedTuple):
    file: TextIO


def get_args() -> Args:
    parser = argparse.ArgumentParser(description='longest substring', formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('file', metavar='FASTA FILE', type=argparse.FileType('rt'), help='Path to the FASTA file')

    args = parser.parse_args()

    return Args(args.file)


def main():
    args = get_args()
    seqs = SeqIO.parse(args.file, 'fasta')
    seqs_list = [str(rec.seq) for rec in seqs]
    min_len = len(min(seqs_list))

    shared_kmers = set()
    for k in range(min_len, 0, -1):
        list_of_kmers = []
        with open(args.file.name, 'rt') as file:
            seqs = SeqIO.parse(file, 'fasta')
            for rec in seqs:
                list_of_kmers.append(find_kmers(str(rec.seq), k))
        shared = set.intersection(*map(set, list_of_kmers))
        if shared:
            shared_kmers.update(shared)
            break

    print(shared_kmers)


def find_kmers(seqs, k: int):
    n = len(seqs) - k + 1
    return [] if n < 1 else [seqs[i:i + k] for i in range(n)]


if __name__ == '__main__':
    main()