cs50 week 6 dna program incorrectly identifies dna sequence

153 Views Asked by At

My code kinda works, except its selective in what it works with. It gives the correct name for a specific sequence, but for the others it will mess up.

For example, it will correctly identify that a strand belongs to Bob, but will match a supposed "No Match" strand with "Charlie", who doesn't even exist in the list cs50 gave us.

It's really weird, and I've checked my code against other peoples and they seem to be mostly similar. Don't know why this is happening, hopefully some help please.

import csv
import sys

def main():

    # TODO: Check for command-line usage
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")

    # TODO: Read database file into a variable
    database = []

    with open(sys.argv[1], 'r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            database.append(row)
 
    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2], 'r') as file:
        dna_sequence = file.read()

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = list(database[0].keys())[1:]

    results = {}
    for subsequence in subsequences:
        match = 0
        results[subsequence] = longest_match(dna_sequence, subsequence)
        match += 1

    # TODO: Check database for matching profiles
    for person in database:
        for subsequence in subsequences:
            if int(person[subsequence]) == results[subsequence]:
                match += 1
        
            if match == len(subsequence):
                print(person["name"])
                return 

    print("No match")
    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within
        #sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1
        
            # If there is no match in the substring
            else:
                break
    
        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run

main()
1

There are 1 best solutions below

0
kcw78 On

Are you still working on this? If so, there are 2 databases and 20 sequences to test. (They are listed with correct answers at the end of the DNA PSET.) Which one gives you the error above? I suspect it is the 3rd test. It says Run your program as python dna.py databases/small.csv sequences/3.txt. Your program should output No match.

When I do this, your program outputs Charlie instead of No match.
The subsequences you need to check are: ['AGATC', 'AATG', 'TATC']
Your subsequence count is: {'AGATC': 3, 'AATG': 3, 'TATC': 5}
That doesn't match anyone in the small.csv file.
Charlie is close, but his DNA subsequence count is: ('AGATC', '3'), ('AATG', '2'), ('TATC', '5')

The error occurs when you compare each person to the subsequence counts. There are 3 things to fix:

  1. The value of match is set the previous loop (for subsequence in subsequences:). In needs to be in the for person in database: loop.
  2. The indentation to test match needs to be modified. (this is inside the 2nd for subsequence in subsequences: loop.)
  3. You are testing match against len(subsequence). Think about it....

I made those changes and it works for all 4 small.csv tests and the 3 large.csv I tried.