Create high nr of random sequences with min Edit Distance time efficient

132 Views Asked by At

I need to create a program/script for the creation of a high numbers of random sequences (20 letter long sequence based on 4 different letters) with a minimum edit distance between all sequences. "High" would here be a minimum of 100k sequences, but if possible up to 1 million.

I started with a naive approach of just generating random 20 letter sequences, and for each sequence, calculate the edit distance between the sequence and all other sequences already created and stored. If the new sequence pass my threshold value, store it, otherwise discard.

As you understand, this scales very badly for higher number of sequences. Up to 10k is reasonably fine, but trying to get 100k this starts to get troublesome.

I really only need to create the sequences once and store the output, so I'm really not that fussy about speed, but making 1 million at this rate today is just no possible.

Been trying to think of alternatives to speed up the process, like building the sequences is "blocks" of minimal ED and then combining, but haven't come up with any solution.

Wondering, do anyone have any smart idea/method that could be implemented to create such high number of sequences with minimal ED more time efficient?

Cheers, JB

1

There are 1 best solutions below

1
On

It seems, from wikipedia, that edit distance is one of three operations insertion, deletion, substitution; performed on a starting string. Why not systematically generate all strings up to N edits from a starting string then stop when you reach your limit?

There would be no need to check for the actual edit distance as they would be correct by generation. For randomness could you generate a number then shuffle them.