basic python but wierd problem in hadoop-stream text value changes in MapReduce

19 Views Asked by At

I am processing a 121983 rows txt file over hadoop. But I met some wierd problem in the mapReduce phase.

This is my mapper function:

#!/usr/bin/env python
import sys
import re
pattern = r'\b[a-zA-Z0-9]+\b'
numline=0
for line in sys.stdin:
    numline+=1
    line =line.strip()
    words=re.findall(pattern, line)
    for word in words:
        print ('%s\t%s' % (word,1))
print("%s\t%s" % ("num line",numline))

This is my reduce function:

#!/usr/bin/env python
import sys
worddict={}
nrow=0
totalwords=0
for line in sys.stdin:
    line=line.strip()
    word,count=line.split('\t')
    if word=="num line":
        nrow=int(count)
        continue
    
    totalwords+=1
    if word not in worddict.keys():
        worddict[word]=1
    else:
        worddict[word]+=1
wd_sorted=sorted(worddict.items(), key=lambda item: item[1],reverse=True)
print("There are %s lines in the text."%nrow)
print("The 100 most frequently used words are:")
for wd,cnt in wd_sorted[:100]:
    print("%s\t%s"%(wd,cnt))    
print("There are %s words in the text."%totalwords)
print("There are %s unique words in the text."%len(wd_sorted))

The question is, I've tested my mapper function works well and gives the right row number (121983) by using the code cat shakespere.txt | python WDmapper.py(WDmapper.py is my mapper function and shakespere.txt is the file i need to process) but when I process that and output that in reducer function, it becomes 60845. enter image description here

I am pretty sure this aint some limit for some numeric data type. And I am also pretty sure that the loop in reducer function went thru completely 121983 loops because in the end, there are correct number of words(910915). enter image description here

It might be some special feature of mapReduce process, but I am a super novice so can someone help me with it?

0

There are 0 best solutions below