Removing duplicates inside the nested lists of a tibble column in R

172 Views Asked by At

I have a tibble, in which 1 character column contains a string I want to parse. I want to store results of the parsing in a new list column, with no duplicates in each row.

The tibble is created by the following code:

my_tibble <- input_data_tibble |>
  group_by(tissue) |>
  summarize(id = str_flatten(id, ","))

The output I get looks like this - notice id type is chr:

my_tibble_bad <- tibble(
  tissue = c("Duodenum", "Ileum"),
  id = c("1, 2, 5, 5", "17, 17, 10, 10, 20, 20")
)
my_tibble_bad

The output I want looks like this

  • notice id is a list column, each list contains numbers, there are no duplicates):
my_tibble_good <- tibble(
  tissue = c("Duodenum", "Ileum"),
  id = list(c(1, 5), c(17, 10, 20))
  )
my_tibble_good

Does anyone know how I can get the result I want either by editing the original code, or by editing the output of the original code

I've tried a few options, and the best I can arrive at looks like this

test_string = "1, 1, 5, 5"
unique(as.numeric(gsub("\\D", "", unlist(strsplit(test_string, ",")))))

However, when I try to build this in to the code I get as far as:

my_tibble_bad |>
  mutate(x = strsplit(id, ",")) |>
  select(!id)

Once I add unlist, I get the error "x must be size 2 or 1, not 10.":

my_tibble_bad |> mutate(x = unlist(strsplit(id, ","))) |> select(!id)

2

There are 2 best solutions below

0
bioinfguru On

Thank you @MrFlick

So simple, I don't know how I didn't see it

my_tibble <- input_data_tibble |>
  group_by(tissue) |>
  summarize(id = str_flatten(id, ","))

Solves the problem by not creating the problem.

0
Mark On

Assuming your data looks something like this:

# A tibble: 5 × 2
  tissue      id
  <chr>    <dbl>
1 Duodenum     1
2 Duodenum     5
3 Ileum       17
4 Ileum       10
5 Ileum       20

Then the way to get my_tibble_good is to use list():

summarize(input_data_tibble, id = list(unique(id)), .by = tissue)

Output:

# A tibble: 2 × 2
  tissue   id       
  <chr>    <list>   
1 Duodenum <dbl [2]>
2 Ileum    <dbl [3]>

If the '1 character column' you speak of is the id column, then it's easy to convert it to integer format with as.integer()