split sorted file using python at change of value

416 Views Asked by At

I am new to python. My requirement, which is simple if I have to do it using awk, is as below,

File (test.txt) mentioned below is tab separated,

1 a b c
1 a d e
1 b d e
2 a b c
2 a d e
3 x y z

The output I want it like

file 1.txt should have below values

a b c
a d e
b d e

file 2.txt should have below values

a b c
a d e

file 3.txt should have below values

x y z

Original file is sorted on first column. I do not know the row number at which I have to split. It has to be on change of value. Using awk, I would write it like

awk -F"\t" 'BEGIN {OFS="\t";} {print $2","$3","$4 > $1}' test.txt 

(performance wise, will python be better?)

4

There are 4 best solutions below

0
On

Awk is perfect for this and should be a lot faster. Is speed really an issue though, how big is your input?

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file

Demo:

$ ls
file

$ cat file
1 a b c
1 a d e
1 b d e
2 a b c
2 a d e
3 x y z

$ awk '{print $2,$3,$4 > ("file"$1)}' OFS='\t' file

$ ls
file  file1  file2  file3

$ cat file1
a b c
a d e
b d e

$ cat file2 
a b c
a d e

$ cat file3
x y z
2
On

Something like this should do what you want.

import itertools as it

with open('test.txt') as in_file:
    splitted_lines = (line.split(None, 1) for line in in_file)
    for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
        with open(num + '.txt', 'w') as out_file:
            out_file.writelines(line for _, line in group)
  • The with statement allows to safely use resources. In this case they automatically close the files.
  • the splitted_lines = (...) line creates an iterable over the field that takes each line, and yield the pair first-element, rest of line.
  • The itertools.groupby function is the function that does most of the work. It iterates over the lines of the file and groups them according to the first element.
  • The (line for _, line in group) iterate over the "splitted lines". It simply drops the first element and writes only the rest of the lines. (the _ is just an identifier as any other. I could have used x or first, but I _ is often used to denote something that you have to assign, but you don't use)

We could probably simplify the code. For example the outermost with is unlikely to be useful since we are only opening the file in reading mode, not modifying it. Removing it we can take off an indent:

import itertools as it

splitted_lines = (line.split(None, 1) for line in open('test.txt'))
for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
    with open(num + '.txt', 'w') as out_file:
        out_file.writelines(line for _, line in group)

I have done a very simple benchmark to test the python solution versus the awk solution. The performance is about the same with python being slightly faster using a file where each line has 10 fields, and with 100 "line groups" each of random size between 2 and 30 elements.

Timing of the python code:

In [22]: from random import randint
    ...: 
    ...: with open('test.txt', 'w') as f:
    ...:     for count in range(1, 101):
    ...:         num_nums = randint(2, 30)
    ...:         for time in range(num_nums):
    ...:             numbers = (str(randint(-1000, 1000)) for _ in range(10))
    ...:             f.write('{}\t{}\n'.format(count, '\t'.join(numbers)))
    ...:             

In [23]: %%timeit
    ...: splitted_lines = (line.split(None, 1) for line in open('test.txt'))
    ...: for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
    ...:     with open(num + '.txt', 'w') as out_file:
    ...:         out_file.writelines(line for _, line in group)
    ...: 
10 loops, best of 3: 11.3 ms per loop

Awk timings:

$time awk '{print $2,$3,$4 > ("test"$1)}' OFS='\t' test.txt

real    0m0.014s
user    0m0.004s
sys     0m0.008s

Note that 0.014s is about 14 ms.

Anyway, depending on the OS load the timings can vary and effectively they are equally fast. In practice almost all the time is taken reading from/writing to files and this is done efficiently by both python and awk. I believe using C you wont see huge speed gains.

0
On

My version:

for line in open('text.txt', 'r'):
    line = line.split(' ')
    doc_name = line[0]
    content = ' '.join(line[1:]) 

    f = open('file' + doc_name, 'a+')
    f.write(content)
0
On

If you have a very large file in mind, awk will open and close a file on each line to do that append, won't it? If that's a problem, then C++ has the speed and container classes to nicely handle an arbitrary number of opened output files, so that each file gets opened and closed exactly once. This is tagged Python, though, which will be nearly as fast, assuming that I/O time will dominate.

A version to avoid the extra open/close overhead in Python:

# iosplit.py

def iosplit(ifile, ifname="", prefix=""):
    ofiles = {}
    try:
        for iline in ifile:
            tokens = [s.strip() for s in iline.split('\t')]
            if tokens and tokens[0]:
                ofname = prefix + str(tokens[0]) + ".txt"
                if ofname in ofiles:
                    ofile = ofiles[ofname]
                else:
                    ofile = open(ofname, "w+")
                    ofiles[ofname] = ofile
                ofile.write( '\t'.join(tokens[1:]) + '\n')
    finally:
        for ofname in ofiles:
            ofiles[ofname].close()

if __name__ == "__main__":
    import sys
    ifname = (sys.argv + ["test.txt"])[1]
    prefix = (sys.argv + ["", ""])[2]
    iosplit(open(ifname), ifname, prefix)

Commandline usage is python iosplit.py

The defaults to empty and will be prepended to each output file name. The calling program provides a file (or file-like object), so you can drive this with a StringIO object or even a list/tuple of strings.

Caveat: This will remove any spaces that precede or follow tab characters in the line. Internal spaces won't be touched. so "1\ta b \t c \t d" will be converted to "a b\tc\td" when written to 1.txt.