Simple DNA Sequence Matching Always Returning: no match found

54 Views Asked by At

I'm working through CS50 currently. On the current assignment we are given a csv with columns: people, aatg(repetitions of dna subsequences one after another),ccag, etc.. We are also given a dna sequence of letters. I feel pretty confident with my code (though i may have needlessly complicated things?). When at the end it compares a persons maximum repetitions of a certain subsequence, with the maximum repetitions of a subsequence found from the database and always equates as not matching.

I've debugged it and all my variables have what i expect, and when comparing the two im on the two proper dictionarie values. But it always reads true. Though it shouldnt if both are equal. Wheres my logic error????

Sorry if over explained but figured its better than under. I've run through it with a debugger as well and it shed no light. It should equate as being the same but isnt.

Also assume my longest_match function is correct. it correctly returns the longest sequence of subsequences as an int, and stores it as agtc: '4'(means a sequence of 4 atgc's in a row). so runs is a list of dicts ( [ {agtc: 4}, {ttac: 8) .. etc]. It then compares the runs[key] with bob[key](also being an int representing the longest sequence of subsequences) The != comparison is at end of code.

import csv
import sys


def main():

    # Check for command-line usage
    if len(sys.argv) != 3:
        print("Usage: python python.py ____.csv ___.csv")
        sys.exit(1)
    # Read database file into a variable
    databaseName = sys.argv[1]
    DNASequence = sys.argv[2]

    with open(databaseName, 'r') as file:
        reader = csv.DictReader(file)
        people = []
        subSeq = reader.fieldnames


        for row in reader:
            people.append(row)

    # Read DNA sequence file into a variable
    with open(DNASequence, 'r') as file:
        sequence = file.read()
    # Find longest match of each STR in DNA sequence
    runs = []

    for i in range(1, len(subSeq)):
        sub = {subSeq[i]: longest_match(sequence, subSeq[i])}
        runs.append(sub)

    # Really proud of this one ^^^
    # Check database for matching profiles
    for dict1 in people:
        match = True

        for i in range(1, len(subSeq)):
            current_key = subSeq[i]

            if dict1[current_key] != runs[i - 1][current_key]:
                match = False

        if match == True:
            print(f"Match found: {dict1[subSeq[0]]}")

    if match == False:
        print("No match found.")

    return

def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1

            # If there is no match in the substring
            else:
                break

        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run


main()

CSV sample file:

name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5

dna sequence:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

With these two files bob should be the resulting match.

0

There are 0 best solutions below