I know how to apply a function to a column in data.table, receive a single value and compare within the same data.table. But, since I am a novice, I do not know how to compare the the result from the function with another data.table. What is special in this case is that the result from the function is a vector. I am not good with words so here is an example instead.
Example data. I have values in a data.table, dt1:
## dt1
id <- c("p1","p2","p3","p3","p3",
"p4","p4","p4","p4","p4",
"p4","MJF2_157","MJF2_157","MJF3_30","MJF3_30",
"MJF3_30","MJF3_30","MJF3_30","MJF3_30")
mass_Da <- c(45659.18,45589.87,45589.19,45610.22,45802.85,
45660.37,45634.28,45611.12,45589.83,45569.70,
20838.62,20952.07,20951.89,23539.60,46860.72,
46839.05,47851.63,46909.48,47555.12)
dt1 <- data.table(id,mass_Da)
This is the function I'd like to apply on each row in column "mass" results in a vector for each mass:
ld <- function(m,n=5:60){
ldx <- (m+1.0078*n)/n
return(ldx[ldx>1000 & ldx<2000])
}
The resulting vector should be compare to values by id in another data.table, dt2. If the value in the column masstal also is found in the vector, the corresponding mass from the vector should be displayed in dt2. If no match is found a "no" should be the out put. If several, each vector should be compared to each masstal within the same id. Notice that the length is of the columns in dt1 is never the same as in dt2:
### dt2
id <- c("p1","p2","p3","p3","p3",
"p4","p4","p4","MJF2_157","MJF2_157",
"MJF3_157","MJF3_30","MJF3_30","MJF3_30","MJF3_30",
"MJF3_61","MJF3_61","MJF3_61","MJF3_61","MJF3_61")
length(id)
masstal <- c(1689.770,1341.931,1086.698,1304.001,1433.335,
1425.521,1989.758,1173.885,1384.406,1065.691,
1073.079,1742.961,1384.798,1090.666,1447.482,
1057.163,1173.785,1304.204,1557.331,1149.469)
length(masstal)
dt2 <- data.table(id,masstal)
Lastly the match should be done with a tolerance of +-0.5. If it works out the ideal dt should look like this:
dt2$confirmed_mass <- c("no",45489.87,45489.19,45489.19,"no",
"X","no","no","no","no",
"no","no","no",46860.72,"no",
"NA","NA","NA","NA","NA")
But there is still a problem if the tolerance accepts masstal from two or more mass, here labelled with "X" .
I have tried merging the dt:s but it seems like a detour... and I am still stuck and can not figure out how to compare with a vector.
dt_merge <- merge.data.table(dt1,dt2, by=.EACHI, allow.cartesian=TRUE)
Any help is much appreciated.
Up front, I'll assume some of your numbers are typos in the
confirmed_masscolumn, some do not exist in any data you provided.Try this:
That is each row of
dt1exploded into as many new values as yourldfunction produces. We then join that ontodt2and find matches with tolerance. Since I'm inferring thatdt2does not have guaranteed-unique keys (using floating-point is not safe for that), I'll add one withrn, used to re-aggregate back to the original data.Notes:
I kept your
confirmed_masscolumn (and used it as a "key" for aggregation) solely to do a side-by-side comparison with your data; I assume it is not really part of the originaldt2, so it should be removed from theby=vector.I also kept
rnjust to show what it's doing ... it can safely be removed after the join withdt1long.Per the comments, I broke the results into two columns so that we don't combine numbers and non-numbers. Feel free to re-combine them if you really need them to be. I chose a logical for the second column, since it seemed your intent was ternary: found, not found, or id-not-present.
I don't know the likelihood of this with your real data, but ... if your
ld(.)function can produce numbers such that two or more are within the0.5tolerance, then it is possible that the join would find multiples. In its current state, the code above will silently mask that, with no indication that it occurred nor which of themass_Damatches were returned. If you think this is possible, then we can changena.omit(.)[1](which will always return length-1 something) with a list.Note the change to both
na.omitand to howmassfoundwas calculated. Here,massis a list-column, which means you cannot (currently) just do "math" on it without some minor tweaks. Over to you how you want to resolve when multiple matches are found. (For instance, would it be better to return multiple rows for eachrn-row?)