There is something I don't understand about the algorithm of median of medians. One key step about this algorithm is to find an approximate median, and according to Wikipedia, we have the guarantee that this approximate median is greater than 30% of elements of the initial set.
To find this approximate median, we compute the median of each group of 5 elements, we gather these medians in a new set, and we recompute the medians until the obtained set have least than 5 elements. In this case, we get the median of the set. (see the wikipedia page if my explanations are not clear)
But, consider the following set of 125 elements :
1 2 3 1001 1002
4 5 6 1003 1004
7 8 9 1005 1006
1020 1021 1022 1023 1034
1025 1026 1027 1028 1035
10 11 12 1007 1008
13 14 15 1009 1010
16 17 18 1011 1013
1029 1030 1031 1032 1033
1036 1037 1038 1039 1040
19 20 21 1014 1015
22 23 24 1016 1017
25 26 27 1018 1019
1041 1042 1043 1044 1045
1046 1047 1048 1049 1050
1051 1052 1053 1054 1055
1056 1057 1058 1059 1060
1061 1062 1063 1064 1065
1066 1067 1068 1069 1070
1071 1072 1073 1074 1075
1076 1077 1078 1079 1080
1081 1082 1083 1084 1085
1086 1087 1088 1089 1090
1091 1092 1093 1094 1095
1096 1097 1098 1099 1100
So we divide the set in group of 5 elements, we compute and gather the medians, and so, we obtain the following set :
3 6 9 1022 1207
12 15 18 1031 1038
21 24 27 1043 1048
1053 1058 1063 1068 1073
1078 1083 1088 1093 1098
We redo the same algorithm, and we obtain the following set :
9 18 27 1063 1068
So we obtain that the approximate median is 27. But this number is greater or equals than only 27 elements. And 27/125 = 21.6% < 30%!!
So my question is : where am I wrong?? Why is the approximate median is in my case not greater than 30% of elements????
Thank you for your replies!!
The cause of your confusion about the median-of-medians algorithm is that, while median-of-medians returns an approximate result within 20% of the actual median, at some stages in the algorithm we also need to calculate exact medians. If you mix up the two, you will not get the expected result, as demonstrated in your example.
Median-of-medians uses three functions as its building blocks:
This function returns the exact median of five (or fewer) elements from (part of) an array. There are several ways to code this, based on e.g. a sorting network or insertion sort. The details are not important for this question, but it is important to note that this function returns the exact median, not an approximation.
This function returns an approximation of the median from (part of) an array, which is guaranteed to be larger than the 30% smallest elements, and smaller than the 30% largest elements. We'll go into more detail below.
This function returns the n-th smallest element from (part of) an array. This function too returns an exact result, not an approximation.
At its most basic, the overall algorithm works like this:
So this is where your calculation went wrong. After creating an array with the median-of-fives, you then used the median-of-medians function again on this array, which gives you an approximation of the median (27), but here you need the actual median (1038).
This all sounds fairly straightforward, but where it becomes complicated is that the function select() calls medianOfMedians() to get a first estimate of the median, which it then uses to calculate the exact median, so you get a two-way recursion where two functions call each other. This recursion stops when medianOfMedians() is called for 25 elements or fewer, because then there are only 5 medians, and instead of using select() to find their median, it can use medianOfFive().
The reason why select() calls medianOfMedians() is that it uses partitioning to split (part of) the array into two parts of close to equal size, and it needs a good pivot value to do that. After it has partitioned the array into two parts with the elements which are smaller and larger than the pivot, it then checks which part the n-th smallest element is in, and recurses with this part. If the size of the part with the smaller values is n-1, the pivot is the n-th value, and no further recursion is needed.
As you see, the select() function recurses (unless the pivot happens to be the n-th element), but on ever smaller ranges of the array, so at some point (e.g. two elements) finding the n-th element will become trivial, and recursing further is no longer needed.
So finally we get, in some more detail:
EXAMPLE
Input array (125 values, 25 groups of five):
Medians of groups of five (25 values):
Groups of five for approximate median:
Medians of five for approximate median:
Approximate median as pivot:
Medians of five partitioned with pivot 27 (depends on method):
The smaller group has 8 elements, the larger group 16 elements. We were looking for the middle 13th element out of 25, so now we look for the 13 - 8 - 1 = 4th element out of 16:
Groups of five:
Medians of groups of five:
Approximate median as pivot:
Range of medians of five partitioned with pivot 1058 (depends on method):
The smaller group has 7 elements. We were looking for the 4th element of 16, so now we look for the 4th element out of 7:
Groups of five:
Medians of groups of five:
Approximate median as pivot:
Range of medians of five partitioned with pivot 1031 (depends on method):
The smaller part has 2 elements, and the larger has 4, so now we look for the 4 - 2 - 1 = 1st element out of 4:
Median of five as pivot:
Range of medians of five partitioned with pivot 1043 (depends on method):
The smaller part has only one element, and we were looking for the first element, so we can return the small element 1038.
As you will see, 1038 is the exact median of the original 25 median-of-fives, and there are 62 smaller values in the original array of 125:
which not only puts it in the 30~70% range, but means it is actually the exact median (note that this is a coincidence of this particular example).