I have an input file with one of the column as ID and another column as counter value. Based on the counter value, I am filtering data from input to output file. I made a task in DMExpress and checked for the counter and ID. I have 10 rows for each id in the input file. If counter value for each id is 3 then I will extract top 3 rows for this ID and then check for the next ID. While running this task in hadoop, Hadoop is taking the first 3 record of several IDs and creating the new file(when desired size reached) for other IDs.
Now, when hadoop is writing the record in file 0, it is extracting 3 records for ID X, but when it is writing the another part of the output file (file 1), it is writing the first record from the ID X of the previous file(which was at the last line of the file 0 . It is 4th record for the ID X). This in return increasing my record count in the output file.
Ex:this is the record in input file.
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
4|XXXX|3|BBBBBBB
5|XXXX|3|NNNDDDD
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
The output file that hadoop is creating is as below:
file 0 :
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
file 1:
4|XXXX|3|BBBBBBB
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
*line 4 for ID: XXXX should not be there! Why hadoop is not filtering the counter correctly?