Sort data file based on a separate key file

111 Views Asked by At
  1. We have 2 files: data.txt and keys.txt.
  2. data.txt is some proper unicode text with N lines.
  3. keys.txt is a list of newline-separated integers, N lines.
  4. Output a file sorted.txt where the lines in data.txt are sorted according to keys.txt without writing an intermediate file paste -d',' keys.txt data.txt.

I need to use this for large files (hundreds of GB) on machines with 16-32 GB of memory.

My first attempt was to do it in Python, which is a bit slow. It's simple enough, so we discussed doing it in C++. But I'd prefer if it uses readily available tools so there's no installation needed. This could well be impossible to do efficiently with GNU or Unix tools, but I don't know enough there to make a claim.

1

There are 1 best solutions below

1
On

You should be able to do this without buffering to a file. For performance, I guess calibrating sort --buffer-size would be the first move, and perhaps using parallel to sort in chunks the second.

paste keys.txt data.txt | sort -n -k1 | cut -f2-