Comparison sorting algorithms evaluating more than 2x elements at each step

384 Views Asked by At

Comparison sorting algorithms normally evaluate 2x elements at each step (and do something different if the first element is smaller/bigger than or equal to the second element.)

What are some examples of sorting algorithms, preferably stable ones, where more than 2x elements can be compared at each step?

How would their performance be evaluated?

For sorting 2 elements at a time, a comparison sort cannot perform better than O(n log n) on average, what's the performance limit for sorting 3, 4, 5+ elements at a time?

5

There are 5 best solutions below

1
On

Any n-element at a time (for a specified fixed n) comparisons can be emulated by a series of some fixed number of 2-element at a time comparisons. If you could do better than O(n log n) on n-at-a-time, it would automatically give you a better than O(n log n) algorithm for 2-at-a-time.

2
On

A k-way merge sort needs to determine which of k elements is the smallest. Using a standard 2x compare, it will take 3 compares to determine the smallest element. For a memory based sort, a 4-way merge sort will take the about the same number of operations as a 2-way merge sort, but gets a slight improvement in running time due to being more cache friendly. There is an example of a 4 way merge sort at the end of this answer:

Optimized merge sort faster than quicksort

On a system with 16 registers, 4-way is about as high as you can go. Plus trying to implement this without using a heap (which results in a slower algorithm) requires a large nested tree of if | else statements. It's already complex enough in the 4-way merge sort in that answer linked to above.

0
On

The search for efficient ways to sort using as many simultaneous comparisons as possible is done with "sorting networks", Wikipedia has a nice article about it Sorting Networks.

their sequence of comparisons is set in advance, regardless of the outcome of previous comparisons. In order to sort larger amounts of inputs, new sorting networks must be constructed. This independence of comparison sequences is useful for parallel execution

Here is the pseudocode for sorting networks for arrays from n=3 to n=7.

Compare(v, i, j) {
 if (v[j] < v[i]) Swap(v, i, j);
}

SortingNetwork3(v) {
    Compare(v, 0, 2);
    Compare(v, 0, 1);
    Compare(v, 1, 2);
}

SortingNetwork4(v) {
    Compare(v, 0, 2); Compare(v, 1, 3);
    Compare(v, 0, 1); Compare(v, 2, 3);
    Compare(v, 1, 2);
}

SortingNetwork5(v) {
    Compare(v, 0, 3); Compare(v, 1, 4);
    Compare(v, 0, 2); Compare(v, 1, 3);
    Compare(v, 0, 1); Compare(v, 2, 4);
    Compare(v, 1, 2); Compare(v, 3, 4);
    Compare(v, 2, 3);
}

SortingNetwork6(v) {
    Compare(v, 0, 5); Compare(v, 1, 3); Compare(v, 2, 4);
    Compare(v, 1, 2); Compare(v, 3, 4);
    Compare(v, 0, 3); Compare(v, 2, 5);
    Compare(v, 0, 1); Compare(v, 2, 3); Compare(v, 4, 5);
    Compare(v, 1, 2); Compare(v, 3, 4);
}

SortingNetwork7(v) {
    Compare(v, 0, 6); Compare(v, 2, 3); Compare(v, 4, 5);
    Compare(v, 0, 2); Compare(v, 1, 4); Compare(v, 3, 6);
    Compare(v, 0, 1); Compare(v, 2, 5); Compare(v, 3, 4);
    Compare(v, 1, 2); Compare(v, 4, 6);
    Compare(v, 2, 3); Compare(v, 4, 5);
    Compare(v, 1, 2); Compare(v, 3, 4); Compare(v, 5, 6);
}

Comparisons on the same row (step) can be done simultaneously, using multithreading, vectorization or specialized hardware.

The theoretical limit for sorting n elements is O(log(n)) number of steps, with n sufficient big, but to construct an optimal sorting network is a NP problem, there are algoritms to create sorting networks with O(log^2(n)) steps. Currently there are known optimal sorting networks for n up to 17. Note that the number of comparisons is still O(n log(n)).

0
On

For sorting four byte-elements at once, you can use lookup tables.

var = 4 bytes joined into 32bit
sorted = lut[var]; // buy RAM
extract bytes from sorted
O(1)

But RAM latency will make it slow unless byte values tend to hit L1 cache.

If element values are very limited like integers 0,1,2,3 then it needs only 2 bits per value. Then you can do sorting at 4 per operation using only 256 sized lookup table which fits L1.

If values are just 0 and 1, then you dont even need lookup table. Just count the 1s in the variable, there are cpu instructions to do it in O(1).

// 32 bits at once
count=popcnt(var)
sorted=(0xFFFFFFFF<<(32-count))
O(1) and fast

Cpu pipelines instructions so as long as you compare(&swap) independent elements, you get parallelism depending on depth of pipeline.

Cpu has simd units so that you can compare 2,4,8,16 32bit elements at once and selectively choose greater/lesser in each one. Possibly these are also pipelined so you get way higher parallelism.

0
On

It is the same. While you are talking about comparative sorting it is actually a decision tree. Therefore it doesn't really matter how many elements you are sorting the limit comes from the requirement of comparing 2 elements.

enter image description here