So the problem I am solving is about finding the majority element in a set of integers, size of set 0 < n < 10^5, and each element 0 < a < 10^9. So I just need to find out which element shows up strictly more than n/2 times.

I have two solutions that I believe are both correct, but they're not acting like I would expect with my understanding of their complexity. Could someone explain to me what I've done wrong/misunderstood?

int getMajElement(vector<int> a)
{
std::unordered_map<int, int> countMap;
std::unordered_map<int, int>::iterator it;

for (int i : a)
{
    it = countMap.find(i);

    if (it == countMap.end())
    {
        countMap.insert(std::pair<int, int>(i, 1));
    }
    else
    {
        it->second++;
    }
}

int mostOfNum = 0;
int mostOfNumCount = 0;

for (std::pair<int, int>  pair : countMap)
{
    if (pair.second > mostOfNumCount)
    {
        mostOfNumCount = pair.second;
        mostOfNum = pair.first;
    }
}

if (mostOfNumCount > floor(a.size() / 2))
{
    return mostOfNum;
}
else
{
    return -1;
}
}

From my understanding, the first "for(int i : a)" should run in O(n) time, while finding/incrementing the values should run in O(1) time for a hashmap. The second "for (std::pair pair : countMap)" loop should also run in O(n) time, as I'm just iterating through at most n pairs. This would be a total of O(n) time.

This function takes 2.4 seconds to run for n = 10^5, and each a = rand() % 10^9. I made sure to just time the functions, not setting up the initial values.

Then next one takes 0.70 seconds under the same conditions, but I would have expected the first one to be faster.

The second function uses a recursive divide-and-conquer method to solve the problem, and should take O(n log(n)) time. It basically splits the list in n individual parts, and then checks if the majority element on the left half is the same as the majority element on the right half. If not, it will scan the list to see which value is the overall majority (value > floor((right - left) / 2)) for that section and pass it back, else -1.

Could someone explain to me what's causing the time difference, is it just an implementation mistake I've made?

int get_majority_element(vector<int> &a, int left, int right) {

    if (left == right) return -1;
    if (left + 1 == right) return a[left];

    int mid = left + floor((right - left) / 2);

    int leftMajority = get_majority_element(a, left, mid);
    int rightMajority = get_majority_element(a, mid, right);

    if(leftMajority == rightMajority)
    {
        return leftMajority;
    }
    else if (rightMajority == -1)
    {
        return leftMajority;
    }
    else if (leftMajority == -1)
    {
        return rightMajority;
    }
    else
    {
        int leftCount = 0, rightCount = 0;
        for (int i = left; i < right; i++)
        {
            if (a[i] == leftMajority)
            {
                leftCount++;
            }
            else if (a[i] == rightMajority)
            {
                rightCount++;
            }
        }

        if (leftCount > floor((right - left) / 2))
        {
            return leftMajority;
        }
        else if (rightCount > floor((right - left) / 2))
        {
            return rightMajority;
        }
        else
        {
            return -1;
        }           
    }

    return -1;
}
2

There are 2 best solutions below

2
On

In the first solution, try initializing the countMap using a(n explicit) number of buckets set to at least 3/4 the size of the expected number of elements (given the fact you expect a majority to be present).

It is likely that you have quite a number of rehashing while filling that map. The unordered_map::insert warns about a worst time complexity of O(N): while it won't happen for all cases, it's enough to happen only a couple of times towards the end (with a pretty sizeable map) to screw the execution time. The linked says:

Rehashing occurs only if the new number of elements is greater than max_load_factor()*bucket_count().

But wait, there's more!!!
What happens while max_load_factor()*bucket_count() is greater than the element count? Well, most likely collisions. Given that any collision is going into a bucket implemented as a... wait for it... linked-list, your CPU cache locality is going to be blown into kingdom come, no chance for a "while finding/incrementing the values should run in O(1) time for a hashmap."

If you have time, watch this part of cppcon14 for more horror stories (watch it in full, it's weekend anyway).


Note: I'm not saying that doing this is going to make the first method faster than the second; it may be or it may not be so. What I'm saying is applying the suggestion is very likely to improve the speed of the first method.
(and I'd appreciate a comment telling "I tried; this is what happened, on the same data, constructing with and without explicit number of buckets")

2
On

This is too long for a comment.

Complexity theory is about what happens to a single algorithm as the size of the data grows. It is a limit as n --> infinity.

It is much less useful in comparing two different algorithms on the same size of data. Why? Because overhead can dominate the calculations. For instance, bubble sort is O(n^2). But on (very) small data sets, it can out-perform reasonable implementations of "faster" algorithms.

The proper comparison would be the speed of each algorithm on 10^5 elements, then 10^6, then 10^7. That is, how does the speed grow for a given algorithm.