How do I use groupby on the output of a mapper?

64 Views Asked by At

This is a continuation of my previous question:

How to print only if a character is an alphabet?

I now have a mapper that is working perfectly, and it's giving me this output when I use a text file with the string `It's a beautiful life".

i 1 1 0
t 1 0 0
s 1 0 0
a 1 1 1
b 1 0 0
e 1 0 0
a 1 0 0
u 1 0 0
t 1 0 0
i 1 0 0
f 1 0 0
u 1 0 0
l 1 0 0
l 1 0 0
i 1 0 0
f 1 0 0
e 1 0 1

Now I am trying to send this output into a script to get an output like this:

a [(1, 0, 0), (1, 1, 1)]
b [(1, 0, 0)]
e [(1, 0, 0), (1, 0, 1)]
f [(1, 0, 0), (1, 0, 0)]
i [(1, 0, 0), (1, 0, 0), (1, 1, 0)]  
l [(1, 0, 0), (1, 0, 0)]
s [(1, 0, 0)]
t [(1, 0, 0), (1, 0, 0)]
u [(1, 0, 0), (1, 0, 0)]

so that each tuple is added each time the letter from the output of mapper is matched.

I have some code that was from a different but similar problem that I am trying to change around so it works with my mapper:

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file):
    for line in file:
        yield line.strip().split(' ')

#Call the function to read the input which is (<WORD>, 1)
data = read_mapper_output(sys.stdin)

#Each word becomes key and is used to group the rest of the values by it.
#The first argument is the data to be grouped
#The second argument is what it should be grouped by. In this case it is the <WORD>
for key, keygroup in groupby(data, itemgetter(0)):
    values = ' '.join(sorted(v for k, v in keygroup))
    print("%s %s" % (key, values))

I am having trouble changing the last block of code to work with my mapper. I know that I will have to print out a list of tuples for every instance of a letter occurring in the mapper.

1

There are 1 best solutions below

0
On

I was able to answer my own question doing this:

from itertools import groupby
from operator import itemgetter
import sys

def read_mapper_output(file):
    for line in file:
        yield line.strip().split(' ')

#Call the function to read the input which is (<WORD>, 1)
data = read_mapper_output(sys.stdin)

#Each word becomes key and is used to group the rest of the values by it.
#The first argument is the data to be grouped
#The second argument is what it should be grouped by. In this case it is the <WORD>
for key, keygroup in groupby(data, itemgetter(0)): # key = alphabetical letters, keygroup = groupby objects, need to be unpacked?
    values = [] 
    values.append(sorted((v,x,y) for k, v, x, y in keygroup))
    my_list = next(iter(values))
    print("%s %s" % (key, my_list))

I only had to change the last block, and I am sure this is spaghetti code that could be optimized, but I'm not very good at Python.