Add recalculated columns in the data.frame (attribution)

57 Views Asked by At

there! I have data.frame looks like:

        date id s int
1 2021-01-01  1 A   1
2 2021-05-03  1 B   1
3 2021-05-05  1 C   1
4 2021-05-02  2 A   1
5 2021-06-01  2 A   1
6 2021-02-01  2 B   1
7 2021-03-02  3 C   1
8 2021-06-01  3 C   1

I need to add two columns with recalculated values:

  • int1 - for each ID find last, but not "C", set it 1, other 0. IF all C set 1 for last C;
  • int2 - for each ID count t$s and set 1 devided by number of rows

So, the result is:

        date id s int int1    int2
1 2021-01-01  1 A   1    0    0.33
2 2021-05-03  1 B   1    1    0.33
3 2021-05-05  1 C   1    0    0.33
4 2021-05-02  2 A   1    0    0.33
5 2021-06-01  2 A   1    1    0.33
6 2021-02-01  2 B   1    0    0.33
7 2021-03-02  3 C   1    0    0.50
8 2021-06-01  3 C   1    1    0.50

I know how solve it with "for" cycle, but this works to slow with large data. I know about "mutate", but still don`t understand it, please help with my example.

My data:

t <- data.frame(date=c('2021-01-01','2021-05-03','2021-05-05','2021-05-02','2021-06-01','2021-02-01','2021-03-02','2021-06-01')
                ,id=c('1','1','1','2','2','2','3','3')
                ,s=c('A','B','C','A','A','B','C','C')
                ,int=c('1','1','1','1','1','1','1','1')
                )
2

There are 2 best solutions below

0
On

which database are you on? so you want the sql update command ?

1
On

I hope I understood correctly what you're trying to do:

Since there are different possible cases I think the easiest is to split the initiable table according to the different possibilites of the "s" variable.

  • 1: Only non C
  • 2: Only C;
  • 3: A mix of non C and C;

Then create the two columns you want using mutate, and put them back together.

library(tidyverse)
t1 <- t %>%
  group_by(id)%>%
  mutate(id_row = row_number())%>%
  filter(!("C" %in% s)) %>%
  mutate(id1 = ifelse(id_row==max(id_row),1,0),
         id2 = round(1/n(),2))

t2 <- t %>%
  group_by(id)%>%
  mutate(id_row = row_number())%>%
  filter(length(unique(s))==1 & s=="C")%>%
  mutate(id1 = ifelse(id_row==max(id_row),1,0),
         id2 = round(1/n(),2))

t3 <-  t %>%
  group_by(id)%>%
  filter(length(unique(s))>1 & "C" %in% s)%>%
  mutate(id_row = ifelse(s!="C",row_number(),NA),
         id1 = case_when(
           id_row==max(id_row,na.rm = TRUE)~1,
           TRUE~0),
         id2 = round(1/n(),2))


t <- bind_rows(t1,t2,t3) %>%
  select(-id_row) %>%
  arrange(id,s)
t

# A tibble: 8 x 6
# Groups:   id [3]
  date       id    s     int     id1   id2
  <chr>      <chr> <chr> <chr> <dbl> <dbl>
1 2021-01-01 1     A     1         0  0.33
2 2021-05-03 1     B     1         1  0.33
3 2021-05-05 1     C     1         0  0.33
4 2021-05-02 2     A     1         0  0.33
5 2021-06-01 2     A     1         0  0.33
6 2021-02-01 2     B     1         1  0.33
7 2021-03-02 3     C     1         0  0.5 
8 2021-06-01 3     C     1         1  0.5