I'm trying to follow this answer that join two tables with range: https://stackoverflow.com/a/46341899/6636572
I want to join two tables where one has some ranges and another are numbers and I want to annotate the numbers by matching which range those numbers are in from another dataframe.
However the length of the table surprisingly increased. I expect it to stay the same as 2832. What's happening? How to trouble shoot this? The result doesn't change if I use fuzzy_left_join
> head(gene_list_selected)
chr start_pos end_pos gene_name
1 1 55013806 55100417 ACOT11
2 1 55074849 55089200 FAM151A
3 1 55107412 55175940 MROH7
4 1 55107412 55208328 MROH7-TTC4
5 1 55181494 55208328 TTC4
6 1 55222570 55230226 PARS2
> head(df)
rsid pos
1 1:55013860:C:T 55013860
2 1:55013957:G:A 55013957
3 1:55014013:C:T 55014013
4 1:55014095:C:T 55014095
5 1:55014099:C:T 55014099
6 1:55014100:G:A 55014100
> nrow(gene_list_selected)
[1] 21
> nrow(df)
[1] 2832
> df_with_gene_name<-df %>%
+ fuzzy_inner_join(gene_list_selected, by = c("pos"="start_pos","pos"="end_pos"), match_fun = list(`>=`, `<=`))
>
> nrow(df_with_gene_name)
[1] 3298
I want the end result to keep rsid, pos, from, start while having additional columns of one-hot coded binary table of gene_name.
> result
rsid, chr, pos, from, start, ACOT11, FAM151A, MROH7, MROH7-TTC4, TTC4, PARS2
1:55013860:C:T 1 55013860 55013806 55100417 1 0 0 0 0 0
The reason df_with_gene_name has more rows than df is because the start_pos and end_pos of your gene_name values are not mutually exclusive. Therefore, a record in df has the potential to qualify for more than one gene_name. If so, one record will be returned for each match. See the example below as an illustration. You haven't indicated how you want to handle the multiple matches, but
pivot_wider()is useful in these cases and will give you single rows per rsid. Note I have modified your example data to highlight an instance where multiple matches are found.Update: new, much faster answer:
It is possible to achieve the desired result using only
dplyrandtidyr, asfuzzyjoinis very slow and can cause issues with RAM usage for lower-spec computers. This new solution is 9 times faster than usingfuzzyjoin(according tomicrobenchmak):Original answer: