When we externally merge sort a large file, we split it into small ones, sort those, and then merge them back into a large sorted file.
When merging, we can either do many 2-way merge passes, or one multi-way merge.
I am wondering which approach is better? and why?
One multi-way merge is generally better. Consider three small files:
and
and finally
If you do a merge with
a
andb
, we're left with (say)and
A final merge would create the sorted list, but notice how in this final merge we have to visit the
a
andb
items again. It's this re-merging that is wasteful in cascading two-way merges.What you can do instead is a single multi-way merge. However, be careful how you do it. Specifically, avoid the naive double-loop that scans each cursor to see which has the minimum value. Use a min-heap instead. This will bring the complexity back down to
O(n log n)
.