Remove smallest non-unique value from vector

547 Views Asked by At

I have an unsorted vector of doubles (Actually objects with a double member which is used in this case). From this vector I need to remove the smallest non-unique value. However, it is not guaranteed that a non-unique value exists. It is allowed to sort the range.

As always I started with looking for an std::algorithm and found std::unique. In my first idea I would use this in combination with std::sort to move all non-unique values to the end of the vector and then use min_element over the non-unique values. However, std::unique will leave the non-unique values at the end in an unspecified state. And indeed I lost all non POD members.

Does anyone have a suggestion how to do this efficiently? It is important to do it efficiently since the code is used in the bottleneck of the program (which is already a bit too slow).

5

There are 5 best solutions below

2
On BEST ANSWER

(Am adding an additional answer, since 1) the focus of the first answer was using ready STL components, and 2) Howard Hinnant raised some interesting points since.)

Thanks to Howard Hinnant for the principle of benchmarking the different approaches (as well as a very unique solution)! It has led to some stuff I personally find interesting (and don't entirely understand).

The execution of the benchmark, however, was somewhat problematic, IMHO.

The question stated that the problem was

... objects with a double member which is used in this case ... It is important to do it efficiently since the code is used in the bottleneck of the program

The test, however:

  • Performed the operation on ints, which is an advantage for the sorting-based mechanism; while both double comparisons and hashes are more expensive than ints', the number of comparisons is Theta(n log(n)), while the number of hashes is O(n).

  • Took the body of my main function, and wrapped it up in a function (rather than a class object), and did not use a pool allocator. Frankly, I consider this a flaw that renders the results meaningless, as it basically established the well known fact that dynamic allocations + needless re-initializations of large containers, are expensive.

  • Relied on the fact that the sorting algorithms could just return the vector on which they operated (which could not be done for the original problem). In the following I let this point slide, as the problem for a vector of doubles is interesting in itself, but the OP should note that this might change things even more.


So, in order to deal with the second issue, I initially used the probing-based hash-table from my own gcc libstdc++ pb_ds extension. This by itself reduced the running time of solution #1 below that of solution #2 (sort + adjacent_find), but it still remained more expensive than #3 (make_heap).

In order to reduce this further, I used the most degenerate form of "hash table" that seemed relevant.

template<typename T, class Hash=std::hash<T>>
class smallest_dup_remover
{
public:
    explicit smallest_dup_remover(std::size_t max_size) 
    {
        while(m_mask < max_size)
            m_mask *= 2;

        m_status.resize(m_mask);
        m_vals.resize(m_mask);

        --m_mask;
    }

    void operator()(std::vector<T> &vals)
    {
        std::fill(std::begin(m_status), std::end(m_status), 0);
        bool has = false;
        T min_;
        std::vector<T> spillover;
        spillover.reserve(vals.size());
        for(auto v: vals)
        {
            const std::size_t pos = m_hash(v) & m_mask;
            char &status = m_status[pos];
            switch(status)
            {
            case 0:
                status = 1;
                m_vals[pos] = v;
                break;
            case 1:
                if(m_vals[pos] == v)
                {
                    status = 2;
                    min_ = has? std::min(min_, v): v;
                    has = true;
                }
                else
                    spillover.push_back(v);
                break;
            case 2:
               if(m_vals[pos] != v)
                    spillover.push_back(v);
            }
        }
        std::sort(std::begin(spillover), std::end(spillover));
        auto it = std::adjacent_find(std::begin(spillover), std::end(spillover));
        if(has && it == std::end(spillover))
            remove_min(vals, min_);
        else if(has && it != std::end(spillover))
            remove_min(vals, std::min(min_, *it));
        else if(!has && it != std::end(spillover))
            remove_min(vals, *it);
    }

private:
    void remove_min(std::vector<T> &vals, T t)
    {
        vals.erase(std::find(vals.begin(), vals.end(), t)); 
    }

private:
    size_t m_mask = 1;
    std::vector<char> m_status;
    std::vector<T> m_vals;
    Hash m_hash;
};

The data structure contains three vectors:

  • a "status" vector, containing codes for 0, 1, and "many"

  • a "values" vector, containing "hashed values"

  • a "spillover" vector, for collisions

Objects with a "many" status are compared on the fly for the minimal. Collision objects (that is, those that have collided with other objects) are pushed to the "spillover" vector. The spillover vector is scrutinized for the lowest duplicate using the method from #2. This is compared to the lowest found value from the "many" values.


Here is the code for a benchmark that retests this benchmark, and here is the code producing the following graphs.

(Reminder #1 is hash-based, #2 is quicksort-based, and #3 is heap-based.)

Starting with the test performed by Howard Hinnant before (values are generated randomly from a range of size 1.5 the length of the values), here are the results:

enter image description here

So indeed his excellent heap-based algorithm performs best for this case, but it looks very different than before. In particular, the hash-based algorithm is not as abysmal when profiling its memory allocations.

However, suppose we change range to a completely random range. Then the results change to this:

enter image description here

In this case, the hash-based solution works best, the sort-based one next, and the heap-based solution works worst.

In order to verify the reason, here are two more tests.

Here is a test with completely random values + two zero values (that is, the lowest duplicate is zero):

enter image description here

and finally, here is the case where all the values are generated from 100 possible value (irrespective of the length):

enter image description here

What's happening is as follows. The heap-based solution is the most dependent of the three on the distribution. The MakeHeap algorithm is linear time, and if a duplicate is almost immediately encountered following that, it turned out to be a linear algorithm (but without any hashing). Conversely, taking the other extreme, there are no duplicates at all. Essentially, this algorithm then becomes heapsort. The inferiority of heapsort to quicksort is both understood theoretically as well as much verified in practice.

So, the heap-based algorithm is actually a surprising and nice algorithm. It does have high variance though, and it might be a consideration for avoiding it in practice.


Some observations:

  • The graphs don't seem to make sense: where is the n log(n) behavior, at least for solution #2?

  • Why does the Hinnant test work so similarly to the random + lower repeated tests? With a 1.5 X range, and given the fact that this is very similar to Bootstrap Resampling, with the known ~37% repetition result, I just don't see it.

  • As Howard Hinnant noted, it really depends on the distribution. However, the situation is very far from the previous benchmark.

  • Some practical points:

    • OP, you might want to re-time this for your original problem, using the true distributions and the overhead of twice-copying your original vector of structs to+from the sorting solutions.

    • I've thought much about how to parallelize this problem, without anything interesting. One way to do so (possibly raising more questions than answers) is run Howard Hinnant's solution on one thread, a different one on another, and use the first result found. Given that it is so much faster for some distributions, and so much slower for others, it might cover your bases.

5
On

Well, here's an algorithm that actually removes the smallest non-unique item (as opposed to just prints it).

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

I chose this algorithm mainly because Lightness Races in Orbit didn't. I don't know whether this will be faster than sort/adjacent_find or not. And the answer almost certainly depends on the input.

For example if there are no duplicates, then this algorithm is definitely slower than sort/adjacent_find. If the input is very, very large, and the minimum unique is likely to be early in the sorted range, this algorithm is likely to be faster than sort/adjacent_find.

And everything I've said above is just guessing. I'm quitting prior to performing the needed measurements on statistically likely inputs to the actual problem.

Perhaps Omid can include this in his test and provide a summary answer. :-)

7 hours later ... Timings

I took Omid's code, corrected a minor mistake in it, corrected the other two algorithms to actually erase an element, and changed the test harness to vary the size and the number of duplicates more widely.

Here is the code I tested using clang / libc++ at -O3:

#include <unordered_map>
#include <iostream>
#include <vector>
#include <algorithm>
#include <random>
#include <chrono>
#include <cassert>

template <typename Container>
void
erase_using_hashTable(Container& vec)
{
    using T = typename Container::value_type;
    std::unordered_map<T, int> c;
    for (const auto& elem : vec){
        ++c[elem];
    }
    bool has = false;
    T min_;
    for (const auto& e : c)
    {
        if (e.second > 1)
        {
            min_ = has ? std::min(e.first, min_) : e.first;
            has = true;
        }
    }
    if (has)
        vec.erase(std::find(vec.begin(), vec.end(), min_));
}

template <typename Container>
void 
eraseSmallestNonunique(Container& c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it != std::end(c))
       c.erase(it);
}

template <typename Container>
void
removeSmallestNonunique(Container& c)
{
    using value_type = typename Container::value_type;
    if (c.size() > 1)
    {
        std::make_heap(c.begin(), c.end(), std::greater<value_type>{});
        std::pop_heap(c.begin(), c.end(), std::greater<value_type>{});
        for (auto e = std::prev(c.end()); e != c.begin(); --e)
        {
            std::pop_heap(c.begin(), e, std::greater<value_type>{});
            if (*e == e[-1])
            {
                c.erase(e);
                break;
            }
        }
    }
}

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        swap(*it, c.back());
        c.pop_back();
    }
}

int  main()
{
    const int MaxArraySize = 5000000;
    const int minArraySize = 5;
    const int numberOfTests = 3;

    //std::ofstream file;
    //file.open("test.txt");
    std::mt19937 generator;

    for (int t = minArraySize; t <= MaxArraySize; t *= 10)
    {
        const int range = 3*t/2;
        std::uniform_int_distribution<int> distribution(0,range);

        std::cout << "Array size = " << t << "  range = " << range << '\n';

        std::chrono::duration<double> avg{},avg2{}, avg3{}, avg4{};
        for (int n = 0; n < numberOfTests; n++)
        {
            std::vector<int> save_vec;
            save_vec.reserve(t);
            for (int i = 0; i < t; i++){//por kardan array ba anasor random
                save_vec.push_back(distribution(generator));
            }
            //method1
            auto vec = save_vec;
            auto start = std::chrono::steady_clock::now();
            erase_using_hashTable(vec);
            auto end = std::chrono::steady_clock::now();
            avg += end - start;
            auto answer1 = vec;
            std::sort(answer1.begin(), answer1.end());

            //method2
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            eraseSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg2 += end - start;
            auto answer2 = vec;
            std::sort(answer2.begin(), answer2.end());
            assert(answer2 == answer1);

            //method3
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            removeSmallestNonunique(vec);
            end = std::chrono::steady_clock::now();
            avg3 += end - start;
            auto answer3 = vec;
            std::sort(answer3.begin(), answer3.end());
            assert(answer3 == answer2);

            //method4
            vec = save_vec;
            start = std::chrono::steady_clock::now();
            remove_smallest_duplicate(vec);
            end = std::chrono::steady_clock::now();
            avg4 += end - start;
            auto answer4 = vec;
            std::sort(answer4.begin(), answer4.end());
            assert(answer4 == answer3);
        }
        //file << avg/numberOfTests <<" "<<avg2/numberOfTests<<'\n';
        //file << "__\n";
        std::cout <<   "Method1 : " << (avg  / numberOfTests).count() << 's'
                  << "\nMethod2 : " << (avg2 / numberOfTests).count() << 's'
                  << "\nMethod3 : " << (avg3 / numberOfTests).count() << 's'
                  << "\nMethod4 : " << (avg4 / numberOfTests).count() << 's'
                  << "\n\n";
    }

}

And here are my results:

Array size = 5  range = 7
Method1 : 8.61967e-06s
Method2 : 1.49667e-07s
Method3 : 2.69e-07s
Method4 : 2.47667e-07s

Array size = 50  range = 75
Method1 : 2.0749e-05s
Method2 : 1.404e-06s
Method3 : 9.23e-07s
Method4 : 8.37e-07s

Array size = 500  range = 750
Method1 : 0.000163868s
Method2 : 1.6899e-05s
Method3 : 4.39767e-06s
Method4 : 3.78733e-06s

Array size = 5000  range = 7500
Method1 : 0.00124788s
Method2 : 0.000258637s
Method3 : 3.32683e-05s
Method4 : 4.70797e-05s

Array size = 50000  range = 75000
Method1 : 0.0131954s
Method2 : 0.00344415s
Method3 : 0.000346838s
Method4 : 0.000183092s

Array size = 500000  range = 750000
Method1 : 0.25375s
Method2 : 0.0400779s
Method3 : 0.00331022s
Method4 : 0.00343761s

Array size = 5000000  range = 7500000
Method1 : 3.82532s
Method2 : 0.466848s
Method3 : 0.0426554s
Method4 : 0.0278986s

Update

I've updated results above with Ulrich Eckhardt's algorithm. His algorithm is quite competitive. Nice job Ulrich!

I should warn the readers of this answer that Ulrich's algorithm is vulnerable to the "quick-sort O(N^2) problem" wherein for specific inputs the algorithm can degenerate badly. The general algorithm is fixable, and Ulrich is obviously aware of the vulnerability as evidenced by this comment:

// pick pivot (TODO: randomize pivot?)

That is one defense against the O(N^2) problem, and there are others, such as detecting unreasonable recursion/iteration and switching to another algorithm mid-stream (such as Method 3 or Method 2). As written Method 4 is badly impacted when given an in-order sequence, and is disastrous when given a reverse-order sequence. On my platform Method 3 is also sub-optimal to Method 2 for these cases, though not nearly as bad as Method 4.

Finding the ideal technique for combating the O(N^2) problem for quick-sort-like algorithms is somewhat of a black-art, but well worth the time. I would definitely consider Method 4 as a valuable tool in the toolbox.

0
On

Firstly, concerning the task of removing the element, the hardest thing is to find it, but actually removing it is easy (swap with the last element and then pop_back()). I'll therefore only address the finding. Also, you mention that sorting the sequence is acceptable, but I take from that that not just sorting but any kind of reordering is acceptable.

Take a look at the quicksort algorithm. It picks a random element and then partitions the sequence to the left and right. If you write the partitioning so that you make a distinction between not just "less" and "not less", you can partially sort the sequence and locate the smallest duplicate on the run.

Following steps should do the job:

  • First, pick a random pivot and partition the sequence. At the same time, you can detect detect whether the pivot is a duplicate or not. Note that if you find a duplicate here, you can discard(!) anything greater, so you don't even have to invest into storage capacity and bandwidth for both partitions.
  • Then, recurse to the sequence of smaller elements.
  • If there was a set of duplicates in the smaller partition, those are your solution.
  • If the first pivot had duplicates, that is your solution.
  • Else, recurse to the larger elements looking for duplicates.

The advantage over regular sorting is that you don't sort the whole sequence if you find duplicates in the lower numbers. On a sequence of unique numbers, you will fully sort them though. Compared with the suggested counting of elements using a hashmap, this indeed has a higher asymptotic complexity. Whether it performs better depends on your implementation and input data though.

Note that this requires that that the elements can be sorted and compared. You mention that you use double values, which are notoriously bad to sort when you have NaNs in there. I could imagine that the hashing algorithms in the standard containers work with NaNs though, so one more point for counting with a hash map.

The following code implements above algorithm. It uses one recursive function to partition the input and to find duplicates, called from a second function that then finally removes the duplicate:

#include <vector>
#include <algorithm>
#include <iostream>

template<typename iterator>
iterator partition_and_find_smallest_duplicate(iterator begin, iterator end)
{
    using std::swap;

    std::cout << "find_duplicate(";
    for (iterator it=begin; it!=end; ++it)
        std::cout << *it << ", ";
    std::cout << ")\n";
    if (begin == end)
        return end; // empty sequence

    // The range begin,end is split in four partitions:
    // 1. equal to the pivot
    // 2. smaller than the pivot
    // 3. unclassified
    // 4. greater than the pivot

    // pick pivot (TODO: randomize pivot?)
    iterator pivot = begin;
    std::cout << "picking pivot: " << *pivot << '\n';

    iterator first = next(begin);
    iterator last = end;

    while (first != last) {
        if (*first > *pivot) {
            --last;
            swap(*first, *last);
        } else if (*first < *pivot) {
            ++first;
        } else {
            ++pivot;
            swap(*pivot, *first);
            ++first;
            std::cout << "found duplicate of pivot\n";
        }
    }

    // look for duplicates in the elements smaller than the pivot
    auto res = partition_and_find_smallest_duplicate(next(pivot), first);
    if (res != first)
        return res;

    // if we have more than just one equal to the pivot, it is the smallest duplicate
    if (pivot != begin)
        return pivot;

    // neither, look for duplicates in the elements greater than the pivot
    return partition_and_find_smallest_duplicate(last, end);
}

template<typename container>
void remove_smallest_duplicate(container& c)
{
    using std::swap;
    auto it = partition_and_find_smallest_duplicate(c.begin(), c.end());
    if (it != c.end())
    {
        std::cout << "removing duplicate: " << *it << std::endl;

        // swap with the last last element before popping
        // to avoid copying the elements in between
        swap(*it, c.back());
        c.pop_back();
    }
}

int main()
{
    std::vector<int> data = {66, 3, 11, 7, 75, 62, 62, 52, 9, 24, 58, 72, 37, 2, 9, 28, 15, 58, 3, 60, 2, 14};

    remove_smallest_duplicate(data);
}
2
On

Well, if you can sort the range then this is easy. Sort it in ascending order, then just iterate through until you encounter two equivalent, adjacent elements. Done.

Something like this:

T findSmallestNonunique(std::vector<T> v)
{
   std::sort(std::begin(v), std::end(v));
   auto it = std::adjacent_find(std::begin(v), std::end(v));
   if (it == std::end(v))
      throw std::runtime_error("No such element found");
   return *it;
}

Here's a demonstration:

#include <vector>
#include <algorithm>
#include <stdexcept>
#include <iostream>

template <typename Container>
typename Container::value_type findSmallestNonunique(Container c)
{
   std::sort(std::begin(c), std::end(c));
   auto it = std::adjacent_find(std::begin(c), std::end(c));

   if (it == std::end(c))
      throw std::runtime_error("No such element found");

   return *it;
}

int main(int argc, char** argv)
{
    std::vector<int> v;
    for (int i = 1; i < argc; i++)
        v.push_back(std::stoi(argv[i]));

    std::cout << findSmallestNonunique(v) << std::endl;
}

// g++ -std=c++14 -O2 -Wall -pedantic -pthread main.cpp \
// && ./a.out 1 2 2 3 4 5 5 6 7 \
// && ./a.out 5 2 8 3 9 3 0 1 4 \
// && ./a.out 5 8 9 2 0 1 3 4 7
// 
// 2
// 3
// terminate called after throwing an instance of 'std::runtime_error'
//   what():  No such element found

Note that here I'm not performing the search in-place, but I could do by taking the container by reference. (This relies on you being allowed to sort the original input.)

Due to the sort operation, this could be as "bad" as O(N×log(N)), but it's simple and easy to maintain, and does not require any allocations/copies (except for a single copy of the entire dataset which, as stated above, you may trivially completely avoid). You may want to use something else if your input is large or you are expecting to have a failed match in a majority of cases. As always: profile!

8
On

You can do this in (expected) linear time.

  • use an unordered_map to count the elements. This is (expected) linear in the number of values.

  • Find the smallest item among non-uniques using a naive loop.

Here is a possible implementation:

#include <unordered_map>
#include <iostream>
#include <vector>

using namespace std;

int main()
{
    const vector<double> elems{1, 3.2, 3.2, 2};
    unordered_map<double, size_t> c;
    for(const double &d: elems)
        ++c[d];
    bool has = false;
    double min_;
    for(const auto &e: c)
        if(e.second > 1)
        {
            min_ = has? min(e.first, min_): e.first;
            has = true;
        }
    cout << boolalpha << has << " " << min_ << endl;
    return 0;
}

Edit As Howard Hinnant & Lightness Races In Orbit have pointed out, this contains both allocations & hashes. It will therefore be linear but with a relatively large factor. Other, sorting-based solutions, might work better for small sizes. When/if profiling, it's important to use a good allocator, e.g., Google's tcmalloc.