I am currently implementing the Mash algorithm for genome comparison. For this I need to create a sketch S(A) for each genome, which is a set of certain size s containing the lowest hash values corresponding to k-mers in the genome. To compare two genomes, I am computing the Jaccard index, for which I need an additional sketch of the union of the two genomes, i.e. S(A u B) for two genomes A and B. This should contain the s lowest hash values found in the union of A and B.
I am computing the sketches as a TreeSet because in the original algorithm to compute a sketch I need to remove the biggest value from the set whenever I add a new value that is lower and the sketch has already reached the maximum size s. This is very easily accomplished using TreeSet because the largest value will be in the last position of the set.
I have computed a union of two sketches and now want to remove the larger elements to reduce the sketch size to size s. I first implemented this using a while loop and removing the last element until reaching the desired size.
The following is my code using an example TreeSet and size s = 10:
SortedSet<Integer> example = new TreeSet<>();
for (int i = 0; i < 15; i++) {
example.add(i);
}
while (example.size() > 10) example.remove(example.last());
However, in the real application sketch sizes will be much larger and the size of the union can be up to two times the size of a single sketch. Trying to find a more efficient way to reduce the sketch size I found that you could convert the TreeSet to an array. Thus, my second approach would be the following:
Object[] temp = example.toArray();
int value = (int) temp[10];
example = example.headSet(value);
So, here I am getting the value at index s from the array, which I can then use to create a headSet from the TreeSet.
Now, I am wondering if there is a more efficient way to reduce the size of the TreeSet, where I don't need to iterate over the size of the TreeSet over and over again or generate an extra array.