External sorting of lines. Сount of files to merge?

80 Views Asked by At

I need to sort the rows in the file in the minimum time (tens of GB) on a PC. I should use N-way merge sorting, right? How do choose the number N (the number of files to merge at a time)? Should I measure delays when reading or writing and tune N? Or get disk information from the system? If i have SSD, could I merge all sorted part at once (The program need to somehow determine is it an SSD)? What other optimizations can be?

1

There are 1 best solutions below

3
rcgldr On

After the initial pass to create a set of sorted sub-files, for hard drives, typically a 16-way merge using a min-heap is used, which is still fast enough to keep the process I/O bound. To reduce random access overhead, large read / writes are needed, like 100MB if you have enough ram (16 input blocks, 1 output block, 1.7GB of buffer space).

For SSDs faster transfer rate, a smaller than 16 k-way merge may be best. For the very fast SAS or NVMe SSDs with read rates at 2GB/S and write rates over 1GB/S, a 2 or 4 way merge without heap may be all that is possible while keeping the drives close to I/O bound. For SATA SSDs with read write rates a bit over 500MB/S, something from a 6 to 16 way merge may be best.