I am working on the "Finding a shared motif" problem from rosalind.info (https://rosalind.info/problems/lcsm/). This problem asks you to find the longest common substring in a collection of <100 DNA strings with length ~1000 nucleotides. The code I have written seems to return the correct result on any test data that I can generate on my own, but seems to consistently fail the actual check by rosalind. I am not sure what the actual problem is, and I think my code has ended up somewhat messy after repeated attempts to fix it by editing single lines. I used three functions in total to solve the problem, and I suspect that the problem is with the "check" function. Furthermore, the final function returns "None" rather than the expected value of newkeys[-1]
def readfile(filepath):
with open(filepath,'r') as f:
return [l.strip() for l in f.readlines()]
def check(key,lst):
while True:
for line in lst:
if key not in line:
return False
else:
return True
def Sharedmotif(filepath):
Fastafile = readfile(filepath)
Fastadict = {}
Fastalabel = ''
for line in Fastafile:
if '>' in line:
Fastalabel = line
Fastadict[Fastalabel] = ''
else:
Fastadict[Fastalabel] += line
Fastalist = list(Fastadict.values())
#Actual solution
alphabet=['A','C','T','G']
tuples=list(itertools.combinations_with_replacement(alphabet,3))
keys=[''.join(item) for item in tuples]
newkeys=[]
merged_list = []
i=1
for i in range(100):
for key in keys:
if check(key,Fastalist):
newkeys.append(key)
newkeys=newkeys[-4:]
for key in newkeys:
Tadd=[key+'T']
Cadd=[key+'C']
Aadd=[key+'A']
Gadd=[key+'G']
allkeys=list(Tadd+Cadd+Aadd+Gadd)
for minis in allkeys:
merged_list.append(minis)
keys=merged_list
length=len(newkeys[-1])
i+=1
print(newkeys)
if i>=length:
return newkeys[-1]`
I tried to this problem before and here is ho I solved it, I think the best way of parsing fasta files is to use BioPython. I usually use argparse to get the input but you can use any way you like.