- We have 2 files:
data.txt
andkeys.txt
. data.txt
is some proper unicode text withN
lines.keys.txt
is a list of newline-separated integers,N
lines.- Output a file
sorted.txt
where the lines indata.txt
are sorted according tokeys.txt
without writing an intermediate filepaste -d',' keys.txt data.txt
.
I need to use this for large files (hundreds of GB) on machines with 16-32 GB of memory.
My first attempt was to do it in Python, which is a bit slow. It's simple enough, so we discussed doing it in C++. But I'd prefer if it uses readily available tools so there's no installation needed. This could well be impossible to do efficiently with GNU or Unix tools, but I don't know enough there to make a claim.
You should be able to do this without buffering to a file. For performance, I guess calibrating
sort --buffer-size
would be the first move, and perhaps usingparallel
to sort in chunks the second.