I have a string representing the human genome, which is a quaternary code (alphabet = {A, C, T, G}
):
str = 'ACTGGTGCGAGACTGCGGGAACACTGAGCGGGCGAGACT'
The problem is, by dividing the string into substring of size bin_size
, I wish to find the fewest unique, non-overlapping k-mers that span some threshold T
length for each substring.
For instance, for k=3
, bin_size = 10
, and T = 0.6
, it takes only 2 unique, non-overlapping 3-mers ACT
& GCG
to span 6 characters of each substring of length 10:
ACTGGTGCGA GACTGCGGGA ACACTGAGCG GGCGAGACT
One approach I have in mind is to count the k-mers with a Bloom filter first, and check each substring for the most popular k-mers. But the non-overlapping criteria, and that I want to find the fewest unique k-mers, suggest this could be written as an optimization problem (somehow). Is there already an existing algorithm that solves something like this?
** EDIT: **
Practically speaking, k=16
, bin_size = 1 000 000
, and total length of string len(str) = 3 000 000 000
.
The k-mers should, in practice, not overlap more than some overlap ie. overlap = 3 letters
.