Sort the content of a file based on second field, e.g.
Input file:
Jervie,12,M
Jaimy,11,F
Tony,23,M
Janey,11,F
Output file:
Jaimy,11,F
Janey,11,F
Jervie,12,M
Tony,23,M
We need to use external sort.
Input file can be of size 4GB. RAM is 1GB.
I used this but it does not work as it treats all the content as int
. Also I have doubt related to the buffer size in each turn of the external sort. How to decide on that?
This sorts file with integers only.
file = open("i2.txt","r")
temp_files = []
e = []
while True:
temp_file = tempfile.TemporaryFile()
e = list(islice(file,2))
if not e:
break
e.sort(key=lambda line: int(line.split()[0]))
temp_file.writelines(e)
temp_files.append(temp_file)
temp_file.flush()
temp_file.seek(0)
file.close()
with open('o.txt', 'w') as out:
out.writelines(imap('{}\n'.format, heapq.merge(*(imap(int, f) for f in temp_files))))
out.close()
I am able to create temporary files sorted on the second field, but how do I merge them based on that?
Try using out of the core processing with Blaze (http://blaze.readthedocs.io/en/latest/ooc.html)