Use if/then for loop and amatch or match to find similar values and match two dataframe columns?

101 Views Asked by At

I have two dataframes, one with raw data labels and one with the correct adjusted values the data needs to be matched to. The labels are numeric but can differ up to +/- 2. I am trying to figure out how to write a coded if/then loop since amatch does not work well for numerics. The goal is to have a loop where for every value in the raw data, it will check against the values in the corrected data and match to the closest match if one is present where corrected - raw is between -2 and 2. I have pasted my attempted but very much nonfunctional attempt below.

My thought is that it may be possible to use amatch to select the best/closest match when one is found, since in some cases there are several data label values close together while in others there is a shift in the number up to 2.

Is there a way to write such a code, or another better way to accomplish this? The goal is to have a corrected column matched to the raw data labels that I can then use to merge with the raw data and the additional metadata attached to the corrected labels, but for my full list of labels only about 60% match without needing this adjustment (you can see in the sample data, for example, 1910 should match to 1911 and 2056 needs to match to 2057). Because of the nature of the data, the differences are not consistent and I want this to be a function that I can apply to more than just a single instance of data so that I do not have to go through and match every data label by hand.

raw <- c(1419, 1444, 1460, 1485, 1501, 1542, 1581, 1590, 
         1606, 1622, 1647, 1663, 1688, 1704, 1743, 1791, 
         1793, 1809, 1850, 1866, 1891, 1905, 1910, 1954, 
         1956, 1976, 1996, 2012, 2028, 2041, 2053, 2056, 
         2067, 2100, 2102, 2122)

corrected <- c(1419, 1444, 1460, 1485, 1501, 1542, 1562, 
               1581, 1590, 1606, 1622, 1630, 1647, 1663, 
               1688, 1704, 1743, 1792, 1793, 1809, 1825, 
               1834, 1850, 1866, 1891, 1905, 1911, 1914, 
               1938, 1954, 1955, 1971, 1976, 1996, 2012, 
               2019, 2028, 2053, 2057, 2100, 2101, 2122)


labelmatch <- function(x, y) {data.frame(glycan=x, glycan_name=
                                            (for(i in 1:length(x)) {
                                              for(n in 1:length(y)) {
                                                if (n-i <= 2 & n-i >=-2) {
                                                  match(x, y)} else{
                                                    if (n-i >= 2 | n-i <=-2){
                                                  next}}}}))
}

labelmatch(raw, corrected)
3

There are 3 best solutions below

3
Onyambu On BEST ANSWER

Since your corrected data is sorted, we can use that fact to quickly search through the vector. Inspired by np.searchsorted

searchsorted <- function(findIn, vec, isSorted = TRUE){
  if(!isSorted) findIn <- sort(findIn)
  idx <- rank(c(vec, findIn, -Inf),, 'first')[seq_along(vec)] - rank(vec)
  right_vals <- findIn[idx]
  left_vals <- findIn[(idx2<-idx - 1) + !idx2]
  right_vals[na_idx] <- left_vals[na_idx<- is.na(right_vals)]
  right_vals[idx2] <- left_vals[idx2<- abs(right_vals - vec) > abs(left_vals - vec)]
  is.na(right_vals) <- abs(right_vals - vec) > 2
  right_vals
}
searchsorted(corrected, raw)
[1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1792 1793 1809 1850 1866 1891 1905 1911 1954 1954 1976 1996 2012 2028   NA
[31] 2053 2057   NA 2100 2100 2122

--

Edit:

R does provide the function findInterval which could be used to simplify the task:

searchsorted <- function(x, vec){
  idx <- findInterval(x, vec, all.inside = TRUE)
  vals <- vec[idx]
  idx2 <- abs(vals - x) > 2
  vals2 <- vec[idx[idx2] + 1]
  is.na(vals2) <- vals2 - x[idx2] > 2
  replace(vals, idx2, vals2)
}

searchsorted(raw, corrected)
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704
[15] 1743 1792 1793 1809 1850 1866 1891 1905 1911 1954 1955 1976 1996 2012
[29] 2028   NA 2053 2057   NA 2100 2101 2122
2
Mike On

One way to do this would be to expand all options of raw and corrected using expand.grid() and keeping the matches where the difference is between -2 and 2. I also create a vector where the raw values do not have a match so you can look at those values manually.

library(dplyr)
labelmatch <- function(x,y){
expand.grid(x,y) %>% 
        mutate(diff = Var2-Var1) %>% 
        filter(between(diff,-2,2))
}

labels2 <- labelmatch(raw,corrected)

With expand grid being to computationally expensive maybe this solution is a bit more efficient. first I make a matrix subtracting the raw and corrected values, then I find the indices where they are between -2, and 2, from there I make a data frame of the pairs where the values are between -2, and 2, I used dplyr to keep unique pairs where the match is closest to the raw value.

x1 <- sapply(corrected, function(x){
  x - raw
})
row.names(x1) <- raw
colnames(x1) <- corrected

s1 <- which(x1 <= 2 & x1 >= -2, arr.ind = TRUE)
x2 <- data.frame(rows =  rownames(x1)[s1[,1]], cols =  colnames(x1)[s1[,2]], 
           values = x1[s1])

rawunlabeld <- setdiff(raw, x2$rows)

x3 <- x2 %>% 
      dplyr::group_by(rows) %>% 
      dplyr::filter(abs(values) == min(abs(values))) %>% 
      dplyr::distinct()
1
ThomasIsCoding On

Probably you can try

> raw * NA^(!colSums(abs(outer(corrected, raw, `-`)) <= 2))
 [1] 1419 1444 1460 1485 1501 1542 1581 1590 1606 1622 1647 1663 1688 1704 1743
[16] 1791 1793 1809 1850 1866 1891 1905 1910 1954 1956 1976 1996 2012 2028   NA
[31] 2053 2056   NA 2100 2102 2122