Why multiset keeps separate instances of repeated elements instead of their count?

459 Views Asked by At

I recently found out that multiset<T> implementation in STL actually keeps different copies of the same repeated elements in the tree. My expectation before was for it to internally use a map<T, int> and just keep the count of the repeated elements.

What is the scenario where this implementation can be beneficial compared to just keeping the count? Is there any use case for multiset where the code will break if the internal implementation changes? Or is there any operation which will increase in complexity if it changes?

I was wondering what is the thought process behind that choice?

2

There are 2 best solutions below

4
On BEST ANSWER

By default std::multiset uses operator< to decide if two elements are equivalent (note: equiavent not equal!). Now consider this element type:

struct foo { 
    int x;
    int y;
    bool operator<(const foo& other) { return x < other.x; }
    bool operator==(const foo& other) { return (x==other.x) && (y==other.y);
};

Two elements are considered equivalent when !(a < b) && !(b < a), that does in general not imply that a == b. Hence, storing only the count is not sufficient.

Moreoever, even equality does not imply identity (a==b does not imply &a==&b). As the container owns its elements, two elements in the container cannot possibily be identical, they are always two different objects. Also, in the above foo we could add a member z that is neither considered for operator< nor for operator== but can be accessed via a getter.


The reason sorted containers typically check equivalence not equality is that they need a way to compare elements anyhow (they are sorted). Being able to compare two elements via < is sufficient to check equivalence, while requiring == in addition would make the container less generic without apparent gain. If you like you can always use a comparator that is such that equivalence does imply equality (in the above example: compare also y in operator<).


As already mentioned in a comment, if you want to have the behavior you describe you could use a std::map<foo,size_t> (uses operator< as default comparator for the keys as well). Instead of only storing the foos you can store a pair of foo and the count. Now, because two foos with same x are considered equivalent, they do map to the same key:

 foo f,g;
 f.x = 42; f.y = 0;
 g.x = 42; g.y = 100;
 std::map<foo,size_t> counter;
 ++counter[f];
 ++counter[g];

As a result, counter will hold 1 element: a copy of f and the count for that key will be 2.

0
On

A handful of situations where this could be an issue:

Your comparator doesn't distinguish between objects that are not equal.

Let's say I have a bunch of requests that need to be filled in order of their priority. I might write a comparator that just compares the priority of the requests. I would be unhappy if the result was just a structure which counted how many requests of each priority there were, with no way of accessing the actual requests.

Your multiset needs to own the objects

If the multiset holds actual objects, rather than numbers or pointers, what happens if you insert a new object that compares equal to an existing one? Should we just destruct the object that's already in there? The one we're adding? Should it be an expected consequence of inserting something into a collection that it might destruct immediately, or that it might destruct at some point in the future as a consequence of something being inserted?

(To be fair this can actually happen with containers when they reallocate, but it won't happen to objects that can be moved. And of course, there are objects that can be moved but not copied, e.g. objects that own a buffer.)

You want your multiset to be efficient

Comparators tell us whether one object is less than another. In order to determine whether two objects are equal, we'd have to do a second comparison, checking whether a < b and then whether b < a. The existing implementation avoids having to do this second check. Might not make a big performance difference in many cases, but if your data involves a lot of duplicates of the same few equivalence classes, this could essentially slow your code down by a factor of 2.