The question is like this: Assume we have N machines, and each machine store and can manipulate its N elements, then, how can we find the median of all the N^2 elements in the lowest cost?
It really bothers me much, hope to get answer from you guys, thanks!
Sorry I just write it down too simple. The elements stored in each machine is random, and have no order. And the cost contains I/O cost, as well as communication between machines, RAM, time everything should be considered too. I just want to find the most efficient way to get the median.
These are some solutions I have come up with:
- use external sort like merge sort or something else, and find the median.
- use bucket sort, divide all the elements into X consecutive buckets according to its value, and so we can decide which bucket the median is in. Scan the bucket and we will get the median.
- I think the finding kth number in O(N) algorithm in "Introduction to Algorithms" should work here?
But still, all these solutions need an extra machine to do the job. I'm wondering whether there is a way that we can only use these N machines to get the median?
Thanks!
You'll need to have a process that counts all the values (total across all the stores). Pick the middle index. Adjust the index to be an offset from the start of items on the appropriate machine. Ask that machine to sort the items and return the value for that index.