Processing big data on distributed system

109 Views Asked by Yichuan Wang At 25 May 2021 at 16:47

I was asked to solve this problem in a interview:
Suppose there are 4 million comments each with its own id and timestamp. Design an efficient algorithm that finds the most recent 1000 comments. You have 40 servers, and each server can handle 10 thousand comments at a time.
I was thinking about using MapReduce. How do i implement map and reduce function in order to to solve this problem?

Original Q&A

There are 1 best solutions below

Detritus On 25 May 2021 at 16:55

As the question specifically asks about efficient algorithms, I suspect the interviewer cares less about techniques like MapReduce but more the underlying algorithm you will use. This seems like an application of Merge Sort. In this case you would divide the work load into 10K chunks, assign each chunk for sorting on the nodes and merge. Once complete you should have all 4 million entries sorted by date, you can than take the most recent 1000 entries. This algorithm would run in O(n log n)

Processing big data on distributed system

There are 1 best solutions below

Related Questions in HADOOP

Related Questions in MAPREDUCE

Related Questions in BIGDATA

Related Questions in GFS

Trending Questions

Popular # Hahtags

Popular Questions