I know how to apply a function to a column in data.table, receive a single value and compare within the same data.table. But, since I am a novice, I do not know how to compare the the result from the function with another data.table. What is special in this case is that the result from the function is a vector. I am not good with words so here is an example instead.

Example data. I have values in a data.table, dt1:

## dt1
id <- c("p1","p2","p3","p3","p3",
        "p4","p4","p4","p4","p4",
        "p4","MJF2_157","MJF2_157","MJF3_30","MJF3_30",
        "MJF3_30","MJF3_30","MJF3_30","MJF3_30") 

mass_Da <- c(45659.18,45589.87,45589.19,45610.22,45802.85,
             45660.37,45634.28,45611.12,45589.83,45569.70,
             20838.62,20952.07,20951.89,23539.60,46860.72,
             46839.05,47851.63,46909.48,47555.12)

dt1 <- data.table(id,mass_Da)

This is the function I'd like to apply on each row in column "mass" results in a vector for each mass:

ld <- function(m,n=5:60){
  ldx <- (m+1.0078*n)/n
  return(ldx[ldx>1000 & ldx<2000])
} 

The resulting vector should be compare to values by id in another data.table, dt2. If the value in the column masstal also is found in the vector, the corresponding mass from the vector should be displayed in dt2. If no match is found a "no" should be the out put. If several, each vector should be compared to each masstal within the same id. Notice that the length is of the columns in dt1 is never the same as in dt2:

### dt2 
id <- c("p1","p2","p3","p3","p3",
        "p4","p4","p4","MJF2_157","MJF2_157",
        "MJF3_157","MJF3_30","MJF3_30","MJF3_30","MJF3_30",
        "MJF3_61","MJF3_61","MJF3_61","MJF3_61","MJF3_61")
length(id)

masstal <- c(1689.770,1341.931,1086.698,1304.001,1433.335,
             1425.521,1989.758,1173.885,1384.406,1065.691,
             1073.079,1742.961,1384.798,1090.666,1447.482,
             1057.163,1173.785,1304.204,1557.331,1149.469)
length(masstal)

dt2 <- data.table(id,masstal)

Lastly the match should be done with a tolerance of +-0.5. If it works out the ideal dt should look like this:

dt2$confirmed_mass <- c("no",45489.87,45489.19,45489.19,"no",
                        "X","no","no","no","no",
                        "no","no","no",46860.72,"no",
                        "NA","NA","NA","NA","NA")

But there is still a problem if the tolerance accepts masstal from two or more mass, here labelled with "X" .

I have tried merging the dt:s but it seems like a detour... and I am still stuck and can not figure out how to compare with a vector.

dt_merge <- merge.data.table(dt1,dt2, by=.EACHI, allow.cartesian=TRUE)

Any help is much appreciated.

1

There are 1 best solutions below

0
r2evans On

Up front, I'll assume some of your numbers are typos in the confirmed_mass column, some do not exist in any data you provided.

Try this:

dt1long <- dt1[, list(lapply(mass_Da, ld)), by = dt1
               ][, c(lapply(.SD, function(z) rep(z, times = lengths(V1))),
                     list(V1 = unlist(V1))),
                 .SDcols = c("id", "mass_Da")]
dt1long
#           id  mass_Da       V1
#       <char>    <num>    <num>
#   1:      p1 45659.18 1986.190
#   2:      p1 45659.18 1903.474
#   3:      p1 45659.18 1827.375
#   4:      p1 45659.18 1757.130
#   5:      p1 45659.18 1692.089
#   6:      p1 45659.18 1631.693
#   7:      p1 45659.18 1575.462
#   8:      p1 45659.18 1522.980
#   9:      p1 45659.18 1473.885
#  10:      p1 45659.18 1427.857
#  ---                          
# 380: MJF3_30 47555.12 1252.458
# 381: MJF3_30 47555.12 1220.370
# 382: MJF3_30 47555.12 1189.886
# 383: MJF3_30 47555.12 1160.889
# 384: MJF3_30 47555.12 1133.273
# 385: MJF3_30 47555.12 1106.941
# 386: MJF3_30 47555.12 1081.806
# 387: MJF3_30 47555.12 1057.788
# 388: MJF3_30 47555.12 1034.815
# 389: MJF3_30 47555.12 1012.819

That is each row of dt1 exploded into as many new values as your ld function produces. We then join that onto dt2 and find matches with tolerance. Since I'm inferring that dt2 does not have guaranteed-unique keys (using floating-point is not safe for that), I'll add one with rn, used to re-aggregate back to the original data.

tol <- 0.5
dt2[, c("rn", "masstal1", "masstal2") := .(.I, masstal - tol, masstal + tol)]

dt1long[dt2, on = .(id, V1 >= masstal1, V1 <= masstal2)
        ][, .(mass = na.omit(mass_Da)[1]), by = .(rn, id, masstal, confirmed_mass)
          ][, massfound := fifelse(id %in% dt1$id, !is.na(mass), NA)]
#        rn       id  masstal confirmed_mass     mass massfound
#     <int>   <char>    <num>         <char>    <num>    <lgcl>
#  1:     1       p1 1689.770             no       NA     FALSE
#  2:     2       p2 1341.931       45489.87 45589.87      TRUE
#  3:     3       p3 1086.698       45489.19 45589.19      TRUE
#  4:     4       p3 1304.001       45489.19 45589.19      TRUE
#  5:     5       p3 1433.335             no       NA     FALSE
#  6:     6       p4 1425.521              X 45589.83      TRUE
#  7:     7       p4 1989.758             no       NA     FALSE
#  8:     8       p4 1173.885             no       NA     FALSE
#  9:     9 MJF2_157 1384.406             no       NA     FALSE
# 10:    10 MJF2_157 1065.691             no       NA     FALSE
# 11:    11 MJF3_157 1073.079             no       NA        NA
# 12:    12  MJF3_30 1742.961             no       NA     FALSE
# 13:    13  MJF3_30 1384.798             no       NA     FALSE
# 14:    14  MJF3_30 1090.666       46860.72 46860.72      TRUE
# 15:    15  MJF3_30 1447.482             no       NA     FALSE
# 16:    16  MJF3_61 1057.163             NA       NA        NA
# 17:    17  MJF3_61 1173.785             NA       NA        NA
# 18:    18  MJF3_61 1304.204             NA       NA        NA
# 19:    19  MJF3_61 1557.331             NA       NA        NA
# 20:    20  MJF3_61 1149.469             NA       NA        NA
#        rn       id  masstal confirmed_mass     mass massfound

Notes:

  • I kept your confirmed_mass column (and used it as a "key" for aggregation) solely to do a side-by-side comparison with your data; I assume it is not really part of the original dt2, so it should be removed from the by= vector.

  • I also kept rn just to show what it's doing ... it can safely be removed after the join with dt1long.

  • Per the comments, I broke the results into two columns so that we don't combine numbers and non-numbers. Feel free to re-combine them if you really need them to be. I chose a logical for the second column, since it seemed your intent was ternary: found, not found, or id-not-present.

I don't know the likelihood of this with your real data, but ... if your ld(.) function can produce numbers such that two or more are within the 0.5 tolerance, then it is possible that the join would find multiples. In its current state, the code above will silently mask that, with no indication that it occurred nor which of the mass_Da matches were returned. If you think this is possible, then we can change na.omit(.)[1] (which will always return length-1 something) with a list.

dt1long[dt2, on = .(id, V1 >= masstal1, V1 <= masstal2)
        ][, .(mass = list(na.omit(mass_Da))), by = .(rn, id, masstal, confirmed_mass)
          ][, massfound := fifelse(id %in% dt1$id, lengths(mass) > 0, NA)]
#        rn       id  masstal confirmed_mass              mass massfound
#     <int>   <char>    <num>         <char>            <list>    <lgcl>
#  1:     1       p1 1689.770             no                       FALSE
#  2:     2       p2 1341.931       45489.87          45589.87      TRUE
#  3:     3       p3 1086.698       45489.19 45589.19,45610.22      TRUE
#  4:     4       p3 1304.001       45489.19 45589.19,45610.22      TRUE
#  5:     5       p3 1433.335             no                       FALSE
#  6:     6       p4 1425.521              X 45589.83,45569.70      TRUE
#  7:     7       p4 1989.758             no                       FALSE
#  8:     8       p4 1173.885             no                       FALSE
#  9:     9 MJF2_157 1384.406             no                       FALSE
# 10:    10 MJF2_157 1065.691             no                       FALSE
# 11:    11 MJF3_157 1073.079             no                          NA
# 12:    12  MJF3_30 1742.961             no                       FALSE
# 13:    13  MJF3_30 1384.798             no                       FALSE
# 14:    14  MJF3_30 1090.666       46860.72 46860.72,46839.05      TRUE
# 15:    15  MJF3_30 1447.482             no                       FALSE
# 16:    16  MJF3_61 1057.163             NA                          NA
# 17:    17  MJF3_61 1173.785             NA                          NA
# 18:    18  MJF3_61 1304.204             NA                          NA
# 19:    19  MJF3_61 1557.331             NA                          NA
# 20:    20  MJF3_61 1149.469             NA                          NA
#        rn       id  masstal confirmed_mass              mass massfound

Note the change to both na.omit and to how massfound was calculated. Here, mass is a list-column, which means you cannot (currently) just do "math" on it without some minor tweaks. Over to you how you want to resolve when multiple matches are found. (For instance, would it be better to return multiple rows for each rn-row?)