We are looking for an efficient algorithm to solve the following problem:
Given two increasingly sorted arrays. Find the closest corresponding elements in each array that difference is below a user given threshold. But only the closest of possible candidates (in the range of
array1[i] +/- threshold) should be returned. The second closest could be matched to another element but matches to more than one element are not allowed. If two elements inarray1have the same distance toarray2[j]the first (leftmost) match should be reported.The arrays can contain duplicated values. There the first (leftmost) match should be reported (and all the others ignored/not matched).
Examples:
x: 1, 3, 5, 6, 8
y: 3, 4, 5, 7
threshold: 1
output: NA, 1, 3, 4, NA
(index of y that matches best to x)
x: 1, 1.5, 2, 2.1, 5, 6.1, 7.2
y: 4.6, 4.7, 4.8, 4.9, 5, 6, 7, 8
threshold: 3
output: NA, NA, NA, 1, 5, 6, 7
(index of y that matches best to x)
x: 1, 1, 1, 2, 2, 2
y: 1, 2
threshold: 0
output: 1, NA, NA, 2, NA, NA
(index of y that matches best to x, for duplicates choose to first one)
x: 1, 2
y: 1, 1, 1, 2, 2, 2
threshold: 0
output: 1, 4
(index of y that matches best to x, for duplicates choose to first one)
We use this to find the closest matching values between two m/z-values (mass-to-charge ratios) while comparing mass spectra.
Currently we iterate through both arrays and lookahead the differences for the next two elements and correct the previous element if a closer one was found. But this fails for more than two duplicate elements in a row (second example):
Our current implementation (C code as part of an R package): https://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c#L73-L129
A commented version below:
SEXP C_closest_dup_closest(SEXP x, SEXP table, SEXP tolerance, SEXP nomatch) {
/* x is the first array of doubles */
double *px = REAL(x);
const unsigned int nx = LENGTH(x);
/* table is the second array of doubles where x should be matched against */
double *ptable = REAL(table);
const unsigned int ntable = LENGTH(table);
/* user given tolerance threshold */
double *ptolerance = REAL(tolerance);
/* integer array to store the results */
SEXP out = PROTECT(allocVector(INTSXP, nx));
int* pout = INTEGER(out);
/* integer that should returned if no valid match or a closer one was found */
const unsigned int inomatch = asInteger(nomatch);
/* indices */
unsigned int ix = 0, ixlastused = 1;
unsigned int itbl = 0, itbllastused = 1;
/* differences: current, difference to next element of x and table, respectively */
double diff = R_PosInf, diffnxtx = R_PosInf, diffnxttbl = R_PosInf;
while (ix < nx) {
if (itbl < ntable) {
/* difference for current pair */
diff = fabs(px[ix] - ptable[itbl]);
/* difference for next pairs */
diffnxtx =
ix + 1 < nx ? fabs(px[ix + 1] - ptable[itbl]) : R_PosInf;
diffnxttbl =
itbl + 1 < ntable ? fabs(px[ix] - ptable[itbl + 1]) : R_PosInf;
if (diff <= ptolerance[ix]) {
/* valid match, add + 1 to convert between R/C index */
pout[ix] = itbl + 1;
if (itbl == itbllastused &&
(diffnxtx < diffnxttbl || diff < diffnxttbl))
pout[ixlastused] = inomatch;
ixlastused = ix;
itbllastused = itbl;
} else
pout[ix] = inomatch;
if (diffnxtx < diff || diffnxttbl < diff) {
/* increment the index with the smaller distance */
if (diffnxtx < diffnxttbl)
++ix;
else
++itbl;
} else {
/* neither next x nor next table item offer a better match */
++ix;
++itbl;
}
} else
pout[ix++] = inomatch;
}
/* R provided MACRO to free allocated memory */
UNPROTECT(1);
return out;
}
Could anybody give us a hint for a better algorithm?