I have an unsorted vector of doubles (Actually objects with a double member which is used in this case). From this vector I need to remove the smallest non-unique value. However, it is not guaranteed that a non-unique value exists. It is allowed to sort the range.
As always I started with looking for an std::algorithm and found std::unique. In my first idea I would use this in combination with std::sort to move all non-unique values to the end of the vector and then use min_element over the non-unique values. However, std::unique will leave the non-unique values at the end in an unspecified state. And indeed I lost all non POD members.
Does anyone have a suggestion how to do this efficiently? It is important to do it efficiently since the code is used in the bottleneck of the program (which is already a bit too slow).
(Am adding an additional answer, since 1) the focus of the first answer was using ready STL components, and 2) Howard Hinnant raised some interesting points since.)
Thanks to Howard Hinnant for the principle of benchmarking the different approaches (as well as a very unique solution)! It has led to some stuff I personally find interesting (and don't entirely understand).
The execution of the benchmark, however, was somewhat problematic, IMHO.
The question stated that the problem was
The test, however:
Performed the operation on
int
s, which is an advantage for the sorting-based mechanism; while bothdouble
comparisons and hashes are more expensive thanint
s', the number of comparisons is Theta(n log(n)), while the number of hashes is O(n).Took the body of my
main
function, and wrapped it up in a function (rather than a class object), and did not use a pool allocator. Frankly, I consider this a flaw that renders the results meaningless, as it basically established the well known fact that dynamic allocations + needless re-initializations of large containers, are expensive.Relied on the fact that the sorting algorithms could just return the
vector
on which they operated (which could not be done for the original problem). In the following I let this point slide, as the problem for avector
ofdouble
s is interesting in itself, but the OP should note that this might change things even more.So, in order to deal with the second issue, I initially used the probing-based hash-table from my own gcc libstdc++
pb_ds
extension. This by itself reduced the running time of solution #1 below that of solution #2 (sort
+adjacent_find
), but it still remained more expensive than #3 (make_heap
).In order to reduce this further, I used the most degenerate form of "hash table" that seemed relevant.
The data structure contains three
vector
s:a "status"
vector
, containing codes for 0, 1, and "many"a "values"
vector
, containing "hashed values"a "spillover" vector, for collisions
Objects with a "many" status are compared on the fly for the minimal. Collision objects (that is, those that have collided with other objects) are pushed to the "spillover"
vector
. The spillovervector
is scrutinized for the lowest duplicate using the method from #2. This is compared to the lowest found value from the "many" values.Here is the code for a benchmark that retests this benchmark, and here is the code producing the following graphs.
(Reminder #1 is hash-based, #2 is quicksort-based, and #3 is heap-based.)
Starting with the test performed by Howard Hinnant before (values are generated randomly from a range of size 1.5 the length of the values), here are the results:
So indeed his excellent heap-based algorithm performs best for this case, but it looks very different than before. In particular, the hash-based algorithm is not as abysmal when profiling its memory allocations.
However, suppose we change range to a completely random range. Then the results change to this:
In this case, the hash-based solution works best, the sort-based one next, and the heap-based solution works worst.
In order to verify the reason, here are two more tests.
Here is a test with completely random values + two zero values (that is, the lowest duplicate is zero):
and finally, here is the case where all the values are generated from 100 possible value (irrespective of the length):
What's happening is as follows. The heap-based solution is the most dependent of the three on the distribution. The MakeHeap algorithm is linear time, and if a duplicate is almost immediately encountered following that, it turned out to be a linear algorithm (but without any hashing). Conversely, taking the other extreme, there are no duplicates at all. Essentially, this algorithm then becomes heapsort. The inferiority of heapsort to quicksort is both understood theoretically as well as much verified in practice.
So, the heap-based algorithm is actually a surprising and nice algorithm. It does have high variance though, and it might be a consideration for avoiding it in practice.
Some observations:
The graphs don't seem to make sense: where is the n log(n) behavior, at least for solution #2?
Why does the Hinnant test work so similarly to the random + lower repeated tests? With a 1.5 X range, and given the fact that this is very similar to Bootstrap Resampling, with the known ~37% repetition result, I just don't see it.
As Howard Hinnant noted, it really depends on the distribution. However, the situation is very far from the previous benchmark.
Some practical points:
OP, you might want to re-time this for your original problem, using the true distributions and the overhead of twice-copying your original vector of structs to+from the sorting solutions.
I've thought much about how to parallelize this problem, without anything interesting. One way to do so (possibly raising more questions than answers) is run Howard Hinnant's solution on one thread, a different one on another, and use the first result found. Given that it is so much faster for some distributions, and so much slower for others, it might cover your bases.