How to do ranges with string data type in R data.table?

151 Views Asked by At

Below is a sample dataset (column A and column B), and I would like to create a new flag column C to group row 3 to row 8 into a range "91345-912350" (In the real dataset, I have multiple ranges to group to). I cannot convert column A into integer, because it would cause the codes with leading 0's disappear (e.g. 0450 would become 450). Is there any way to identify a range with string data type?

enter image description here

Please feel free to use the following code to generate the sample dataset:

DT = data.table(
  Code = c("0450", "912345", "912346","912347","912348","912349","912350","7860","X7960"),
  Description  = c("Desc A", "Desc B", "Desc C","Desc D","Desc E","Desc F","Desc G","Desc H","Desc I")
)    

Additional Information:

(1) The codes do not need to be consecutive as the ranges are predefined. As long as a code is in between a certain range, it would go into that range group, regardless if there are any missing code(s). (e.g. If 912347 is missing , 912345, 912346, 91238, 912349, 912350 would still go to the range 912345-912350)

(2) The predefined ranges are all number ranges without letters, if the code contains letters, the Code Group column would be the same as the original Code column just like what's shown in cell C10 and cell A10.

(3) Not all codes need to be in a range, I have a list of predefined ranges I need to group the codes to. In this sample dataset, my predefined range is 912345-912350. There could be codes like 7860, 7861, 7862, but I do not need to group them into 7860-7862 as they are not in my predefined ranges. If a code does not have a range group, the Code Group would return the same value as the Code column, just like what's shown in cell C9 and cell A9.

3

There are 3 best solutions below

3
James_D On

The requirements here aren't entirely clear to me, but you can create a function to map the code to a range. E.g.

mapRange <- function(code, min=912345, max=912350) { 
  return (sapply(code, function(c) {
    ic <- as.integer(c)
    if (is.na(ic)) return(c)
    if (ic >= min & ic <= max) return(paste(min, max, sep="-"))
    return(c)
  }))
}

And then

DT$`Code Range` <- mapRange(DT$Code)

or with tidyverse

DT <- DT %>% mutate(`Code Range` = mapRange(Code))
0
TarJae On

I can only provide a dplyr solution: I have used row 2 to 7. We could adpat as needed.

library(dplyr)

DT %>% 
  filter(row_number() >= 2 & row_number() <= 7) %>%
  type.convert(as.is = TRUE) %>% 
  mutate(Cod_Group = paste(Code, last(Code)-1 +row_number(), sep = " - ")) %>% 
  mutate(across(everything(), ~as.character(.))) %>% 
  bind_rows(DT) %>% 
  distinct(Code, .keep_all = TRUE) %>% 
  mutate(Cod_Group = coalesce(Code, Cod_Group)) %>% 
  arrange(Description)
     Code Description Cod_Group
1:   0450      Desc A      0450
2: 912345      Desc B    912345
3: 912346      Desc C    912346
4: 912347      Desc D    912347
5: 912348      Desc E    912348
6: 912349      Desc F    912349
7: 912350      Desc G    912350
8:   7860      Desc H      7860
9:  X7960      Desc I     X7960
0
Merijn van Tilborg On

I made a bit more extensive solution, covering a few more scenarios. See my extended example with some edge cases to test as well. I do not allow gaps in the ranges though, but it supports ID's that have a prefix (including the padding zero's). To support that we use some helper columns and consider all ending digits the number (excluding the padding zeros) and all before that I consider a prefix (including the padding zeros). This way we prevent different ID prefix falling in the same groups. With some tricks we assign the group id and then take the first value and the last value of each group and paste them together (if more than 1 id in the group).

solution

library(data.table)
library(stringr)

DT[, num := as.numeric(gsub("(.*)[^\\d](\\d*$)", "\\2", Code, perl = T))]
DT[, prefix := str_remove(Code, as.character(num))][prefix == "", prefix := NA_character_]
DT[, s := abs(num - shift(num, 1, "lag")), prefix][is.na(s) | s > 1, grp := .I][, s := NULL]
setnafill(DT, "locf", cols = "grp")
DT[, `Code Group` := fifelse(.SD[1] != .SD[.N], paste(.SD[1], .SD[.N], sep = " - "), unlist(.SD[1])), 
   by = grp, .SDcols = "Code"
]

results with helper columns

DT

      Code Description    num prefix grp      Code Group
 1:      1      Desc A      1   <NA>   1           1 - 3
 2:      2      Desc B      2   <NA>   1           1 - 3
 3:      3      Desc C      3   <NA>   1           1 - 3
 4:    448      Desc D    448   <NA>   2             448
 5:   0449      Desc E    449      0   3     0449 - 0450
 6:   0450      Desc F    450      0   3     0449 - 0450
 7: 912345      Desc G 912345   <NA>   4 912345 - 912350
 8: 912346      Desc H 912346   <NA>   4 912345 - 912350
 9: 912347      Desc I 912347   <NA>   4 912345 - 912350
10: 912348      Desc J 912348   <NA>   4 912345 - 912350
11: 912349      Desc K 912349   <NA>   4 912345 - 912350
12: 912350      Desc L 912350   <NA>   4 912345 - 912350
13:   7860      Desc M   7860   <NA>   5            7860
14:  X7960      Desc N   7960      X   6   X7960 - X7961
15:  X7961      Desc O   7961      X   6   X7960 - X7961
16:  Y7962      Desc P   7962      Y   7           Y7962
17:  Z7963      Desc Q   7963      Z   8           Z7963

results cleaned up

DT[, .SD, .SDcols = c(1, 2, 6)]

      Code Description      Code Group
 1:      1      Desc A           1 - 3
 2:      2      Desc B           1 - 3
 3:      3      Desc C           1 - 3
 4:    448      Desc D             448
 5:   0449      Desc E     0449 - 0450
 6:   0450      Desc F     0449 - 0450
 7: 912345      Desc G 912345 - 912350
 8: 912346      Desc H 912345 - 912350
 9: 912347      Desc I 912345 - 912350
10: 912348      Desc J 912345 - 912350
11: 912349      Desc K 912345 - 912350
12: 912350      Desc L 912345 - 912350
13:   7860      Desc M            7860
14:  X7960      Desc N   X7960 - X7961
15:  X7961      Desc O   X7960 - X7961
16:  Y7962      Desc P           Y7962
17:  Z7963      Desc Q           Z7963

sample data

codes <- c(1,2,3, "448", "0449", "0450", "912345", "912346", "912347", "912348", "912349", "912350", "7860", "X7960", "X7961", "Y7962", "Z7963")
DT = data.table(Code = codes, Description  = paste("Desc", LETTERS[1:length(codes)]))