I'm trying to solve the following problem:
Suppose you are given two sorted arrays A[1 .. m] and B[1 .. n] and an integer k. Describe a fast algorithm to find the kth smallest element in the union A ∪ B. For example, given the input A[1 .. 8] = [0,1,6,9,12,13,18,20], B[1 .. 5] = [2,5,7,17,19], and k = 6, your algorithm should return the integer 7.
I've found many solutions to this online, but none of them made much sense to me. I found an answer that seems very interesting, but I'm having trouble understanding why the algorithm works.
First, the author provides this algorithm to find the median of the union of two arrays:
Median(A[1 .. n],B[1...n]) :
if n < 100:
use brute force
else if A[n/2] > B[n/2]:
return Median(A[1 . . n/2], B[n/2 + 1 . . n])
else
return Median(A[n/2 + 1 . . n], B[1 . . n/2])
The proof for this is:
Suppose A[n/2] > B[n/2]. Then A[n/2+1] is larger than all n elements in A[1 .. n/2]∪B[1 .. n/2], and therefore larger than the median of A ∪ B, so we can discard the upper half of A. Similarly, B[n/2 − 1] is smaller than all n + 1 elements of A[n/2 . . n] ∪ B[n/2 + 1 . . n], and therefore smaller than the median of A ∪ B, so we can discard the lower half of B. Because we discard the same number of elements from each array, the median of the remaining subarrays is the median of the original A ∪ B. Time complexity is O(log(n))
The author then proceeds to offer an algorithm for the problem I am trying to solve that uses the median algorithm above:
Select(A[1 .. m],B[1 .. n],k):
if k < (m + n)/2
return Median(A[1 .. k], B[1 .. k])
else
return Median(A[k−n .. m], B[k-m .. n])
Here, Median is the algorithm with one minor tweak. If Median wants an entry in either A or B that is outside the bounds of the original arrays, it uses the value −∞ if the index is too low, or ∞ if the index is too high.
The author does not provide a proof of correctness for this algorithm, and simply states the time complexity as: O(log min {k, m + n − k}) = O(log(m + n))
I have a few questions about this.
First, why are these bounds used: Median(A[k−n .. m], B[k-m .. n]). Specifically, why [k-n .. m] and [k-m .. n]? I would have appreciated a proof of correctness for the algorithm
Second, how is O(log(min{k, m + n − k})) equal to O(log(m + n))?
Here's a link to the answer I found
Considering this algorithm:
Observe that to apply the given
Medianalgorithm, we need component arrays of equal length. Now considerA[1 .. k]andB[1 .. k], as defined in the algorithm. These both have lengthk. Between the two of them, they must contain thekth largest overall and all smaller ones. Any elements added via an expansion of the original arrays are greater than thekth. Therefore, of the2kelements, exactlyk - 1are smaller than thekth overall, hence thekth is the median of these -- supposing that we define the median of2xitems as thexth, which is a bit unusual.Note well that the above does not depend on the magnitude of
k(as long as it does not exceedm + n, which would be invalid). But ifkis close tom + nthen the inputs to the Median algorithm are large, too, so we consider an alternative approach.If the
kth element overall is inAthen its index there can be no less thank - n(occurring when allnelements ofBare less), and likewise, if it is inBthen its index there can be no less thank - m. Therefore, consider alternative arraysA[k−n .. m]andB[k-m .. n]. These both have lengthm + n + 1 - k, and any elements added by the expansions relative to the original arrays (at indices less than 1) are smaller than thekth. The total number of elements is2(m + n + 1 - k), and exactlyn + m - kof them are greater than thekth overall. That leavesn + m - k + 1less -- oops. With the median definition required for the other case, we need it the other way around:n + m - kless andn + m - k + 1greater. This approach is a bust.I think you will see the idea now, but it doesn't quite work out after all.
Taking
k,m, andnall to be size parameters,min{k, m + n − k}is maximized whenk = (m + n) / 2, in which case the result is also(m + n) / 2. In this particular view, then,O(log(min{k, m + n − k})) = O(log((m + n) / 2)) = O(log(m + n)).