Fast algorithm for finding fewest k-mers that cover regions of string

239 Views Asked by At

I have a string representing the human genome, which is a quaternary code (alphabet = {A, C, T, G}):

str = 'ACTGGTGCGAGACTGCGGGAACACTGAGCGGGCGAGACT'

The problem is, by dividing the string into substring of size bin_size, I wish to find the fewest unique, non-overlapping k-mers that span some threshold T length for each substring.

For instance, for k=3, bin_size = 10, and T = 0.6, it takes only 2 unique, non-overlapping 3-mers ACT & GCG to span 6 characters of each substring of length 10:

ACTGGTGCGA GACTGCGGGA ACACTGAGCG GGCGAGACT

One approach I have in mind is to count the k-mers with a Bloom filter first, and check each substring for the most popular k-mers. But the non-overlapping criteria, and that I want to find the fewest unique k-mers, suggest this could be written as an optimization problem (somehow). Is there already an existing algorithm that solves something like this?

** EDIT: ** Practically speaking, k=16, bin_size = 1 000 000, and total length of string len(str) = 3 000 000 000.

The k-mers should, in practice, not overlap more than some overlap ie. overlap = 3 letters.

0

There are 0 best solutions below