Frequent words in Python

3.3k Views Asked by At

How can I write a code to find the most frequent 2-mer of "GATCCAGATCCCCATAC". I have written this code but it seems that I am wrong, please help in correcting me.

def PatternCount(Pattern, Text):
    count = 0
    for i in range(len(Text)-len(Pattern)+1):
        if Text[i:i+len(Pattern)] == Pattern:
            count = count+1
    return count

This code prints the most frequent k-mer in a string but it don't give me the 2-mer in the given string.

3

There are 3 best solutions below

4
On

If you want a simple approach, consider a sliding window technique. An implementation is available in more_itertools, so you don't have to make one yourself. This is easy to use if you pip install more_itertools.

Simple Example

>>> from collections import Counter
>>> import more_itertools

>>> s = "GATCCAGATCCCCATAC"
>>> Counter(more_itertools.windowed(s, 2))
Counter({('A', 'C'): 1,
         ('A', 'G'): 1,
         ('A', 'T'): 3,
         ('C', 'A'): 2,
         ('C', 'C'): 4,
         ('G', 'A'): 2,
         ('T', 'A'): 1,
         ('T', 'C'): 2})

The above example demonstrates what little is required to get most of the information you want using windowed and Counter.

Description

A "window" or container of length k=2 is sliding across the sequence one stride at a time (e.g. step=1). Each new group is added as a key to the Counter dictionary. For each occurrence, the tally is incremented. The final Counter object primarily reports all tallies and includes other helpful features.

Final Solution

If actual string pairs is important, that is simple too. We will make a general function that groups the strings and works for any k mers:

>>> from collections import Counter
>>> import more_itertools

>>> def count_mers(seq, k=1):
...     """Return a counter of adjacent mers."""
...     return Counter(("".join(mers) for mers in more_itertools.windowed(seq, k)))

>>> s = "GATCCAGATCCCCATAC"
>>> count_mers(s, k=2)
Counter({'AC': 1,
         'AG': 1,
         'AT': 3,
         'CA': 2,
         'CC': 4,
         'GA': 2,
         'TA': 1,
         'TC': 2})
7
On

In general, when I want to count things with python I use a Counter

from itertools import tee
from collections import Counter

dna = "GATCCAGATCCCCATAC"
a, b = tee(iter(dna), 2)
_ = next(b)
c = Counter(''.join(l) for l in zip(a,b))
print(c.most_common(1))

This prints [('CC', 4)], a list of the 1 most common 2-mers in a tuple with their count in the string.

In fact, we can generalize this to the find the most common n-mer for a given n.

from itertools import tee, islice
from collections import Counter

def nmer(dna, n):
    iters = tee(iter(dna), n)
    iters = [islice(it, i, None) for i, it in enumerate(iters)]
    c = Counter(''.join(l) for l in zip(*iters))
    return c.most_common(1)
3
On

You can first define a function to get all the k-mer in your string :

def get_all_k_mer(string, k=1):
   length = len(string)
   return [string[i: i+ k] for i in xrange(length-k+1)]

Then you can use collections.Counter to count the repetition of each k-mer:

>>> from collections import Counter
>>> s = 'GATCCAGATCCCCATAC'
>>> Counter(get_all_k_mer(s, k=2))

Ouput :

Counter({'AC': 1,
         'AG': 1,
         'AT': 3,
         'CA': 2,
         'CC': 4,
         'GA': 2,
         'TA': 1,
         'TC': 2})

Another example :

>>> s = "AAAAAA"
>>> Counter(get_all_k_mer(s, k=3))

Output :

Counter({'AAA': 4})
# Indeed : AAAAAA
           ^^^     -> 1st time
            ^^^    -> 2nd time
             ^^^   -> 3rd time
               ^^^ -> 4th time