R - adding a new column based on binary data across many columns

247 Views Asked by At

I cannot get my data frame to add an additional column. I have reviewed so many stack overflows, but here is a subset (Adding a new column in a matrix in R, adding new column to data frame in R, new column not added to dataframe in R,R: complete a dataset with a new column added, R: add a new column to dataframes from a function)

I need a single column that tells us if there is a positive or "1" in any of the viral rows I have.

I am trying to determine probability and from what I see, I will need this column to do further calculations, so please help if able!

Sample data

Filovirus (MOD) PCR   :    Phlebo (Sanchez-Seco) PCR
0                          0         
0                          1            
0                          0            
0                          0        
0                          0         
0                          0        
0                          0       
0                          0         
0                          0        
0                          0   


species code  forest site
<fctr>  <dbl> <fctr>
SM      1     UMNP-mangabey
SM      1     UMNP-mangabey
RC      9     UMNP-hondohondoc
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod
BWC     9     UMNP-hondohondod

The closest I have gotten is getting base R to call which rows have the positive value

I followed the solution here but have yet to get it to work for me.

tmp=which(data==1,arr.ind=T)    
tmp=tmp[order(tmp[,"row"]),]
c("positive","negative")[tmp[,"col"]] -> data$new

Any advice is greatly appreciated.

Dput

structure(list(`Filovirus (MOD) PCR` = c("0", "0", "0", "0", 
"0", "0", "0", "0", "0", "0"), `Filovirus (A) PCR` = c("0", "0", 
"0", "0", "0", "0", "0", "0", "0", "0"), `Filovirus (B) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Filo C PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Filovirus (D) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Coronavirus   (Quan) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Coronavirus (Watanabe) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Paramyxo  (Tong)  PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Flavivirus Moureau PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Flavivirus  Sanchez-seco PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Arena Lozano 1 PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Retrovirus Courgnard PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Simian Foamy Goldberg (Pol) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Simian Foamy Goldberg (LTR Region) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Influenza (Anthony) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Influenza (Liang) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Rhabdo (CII) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Enterovirus CII I PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Enterovirus CII-II PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Alphav   (Sanchez-Seco) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Lyssavirus (Vasquez-Moron) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Seadornavirus (CII) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Hantavirus (Raboni) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Hantavirus (Klempa) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Nipah (Wacharapleusadee) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Henipa (Feldman) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Bunya S (Briese) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Bunya L (Briese) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), `Phlebo (Sanchez-Seco) PCR` = c("0", 
"0", "0", "0", "0", "0", "0", "0", "0", "0"), species = structure(c(3L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("SM", "SY", "BWC", 
"YB", "RC"), class = "factor"), code = c(2, 5, 5, 5, 5, 5, 5, 
5, 5, 5), forestsite = structure(c(3L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L, 14L, 14L), .Label = c("Magombera1", "Magombera2", "NDUFR", 
"Ndundulu1", "Ndundulu2", "Ndundulu3", "Nyumbanitu", "UMNP-campsite3", 
"UMNP-hondohondoa", "UMNP-hondohondob", "UMNP-hondohondoc", "UMNP-hondohondod", 
"UMNP-hondohondoe", "UMNP-HQ", "MamaGoti", "UMNP-mangabey", "UMNP-njokamoni", 
"UMNP-Sanje1", "UMNP-Sanje2", "UMNP-Sanje3", "Sonjo", "SonjoRoad"
), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))
3

There are 3 best solutions below

4
TarJae On BEST ANSWER

Update: Your 0 and 1 are character type. Transforming to number with type.convert(as.is = TRUE) will make the code work:

library(dplyr)

df %>%
  type.convert(as.is=TRUE) %>% 
  mutate(new_column = if_else(rowSums(select(., contains("PCR"))) > 0, "positive", "negative"))
   Filovirus (…¹ Filov…² Filov…³ Filo …⁴ Filov…⁵ Coron…⁶ Coron…⁷ Param…⁸ Flavi…⁹ Flavi…˟ Arena…˟ Retro…˟ Simia…˟ Simia…˟ Influ…˟
           <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>   <int>
 1             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 2             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 3             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 4             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 5             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 6             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 7             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 8             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
 9             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
10             0       0       0       0       0       0       0       0       0       0       0       0       0       0       0
# … with 18 more variables: `Influenza (Liang) PCR` <int>, `Rhabdo (CII) PCR` <int>, `Enterovirus CII I PCR` <int>,
#   `Enterovirus CII-II PCR` <int>, `Alphav   (Sanchez-Seco) PCR` <int>, `Lyssavirus (Vasquez-Moron) PCR` <int>,
#   `Seadornavirus (CII) PCR` <int>, `Hantavirus (Raboni) PCR` <int>, `Hantavirus (Klempa) PCR` <int>,
#   `Nipah (Wacharapleusadee) PCR` <int>, `Henipa (Feldman) PCR` <int>, `Bunya S (Briese) PCR` <int>,
#   `Bunya L (Briese) PCR` <int>, `Phlebo (Sanchez-Seco) PCR` <int>, species <chr>, code <int>, forestsite <chr>,
#   new_column <chr>, and abbreviated variable names ¹​`Filovirus (MOD) PCR`, ²​`Filovirus (A) PCR`, ³​`Filovirus (B) PCR`,
#   ⁴​`Filo C PCR`, ⁵​`Filovirus (D) PCR`, ⁶​`Coronavirus   (Quan) PCR`, ⁷​`Coronavirus (Watanabe) PCR`, …
# ℹ Use `colnames()` to see all variable names

First answer: The dplyr pendant would be: Data taken from @langtang(many thanks):

library(dplyr)

df %>%
  mutate(new_column = if_else(rowSums(select(., contains("PCR"))) > 0, "positive", "negative"))

   species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR
1       SM    1    UMNP-mangabey            negative                  negative
2       SM    1    UMNP-mangabey            negative                  positive
3       RC    9 UMNP-hondohondoc            negative                  negative
4      BWC    9 UMNP-hondohondod            negative                  negative
5      BWC    9 UMNP-hondohondod            negative                  negative
6      BWC    9 UMNP-hondohondod            negative                  negative
7      BWC    9 UMNP-hondohondod            negative                  negative
8      BWC    9 UMNP-hondohondod            negative                  negative
9      BWC    9 UMNP-hondohondod            negative                  negative
10     BWC    9 UMNP-hondohondod            negative                  negative
2
langtang On

Updated, given character columns, and new 32 column example

df["new"] = apply(df[, -c(29:32)], 1,\(x) ifelse(sum(as.numeric(x))>0, "positive", "negative"))

Original answer (assuming numeric columns):

You can simply do this:

df["new"] =ifelse(rowSums(df[,-(1:3)])>0, "positive", "negative")

Output:

   species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR      new
1       SM    1    UMNP-mangabey                   0                         0 negative
2       SM    1    UMNP-mangabey                   0                         1 positive
3       RC    9 UMNP-hondohondoc                   0                         0 negative
4      BWC    9 UMNP-hondohondod                   0                         0 negative
5      BWC    9 UMNP-hondohondod                   0                         0 negative
6      BWC    9 UMNP-hondohondod                   0                         0 negative
7      BWC    9 UMNP-hondohondod                   0                         0 negative
8      BWC    9 UMNP-hondohondod                   0                         0 negative
9      BWC    9 UMNP-hondohondod                   0                         0 negative
10     BWC    9 UMNP-hondohondod                   0                         0 negative

Input:

structure(list(species = c("SM", "SM", "RC", "BWC", "BWC", "BWC", 
"BWC", "BWC", "BWC", "BWC"), code = c(1L, 1L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L), forest_site = c("UMNP-mangabey", "UMNP-mangabey", 
"UMNP-hondohondoc", "UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod", 
"UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod", "UMNP-hondohondod"
), `Filovirus (MOD) PCR` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `Phlebo (Sanchez-Seco) PCR` = c(0, 
1, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 
-10L))
0
akrun On

Another option is if_any

library(dplyr)
df1 %>%
 type.convert(as.is = TRUE) %>%
 mutate(new_column = c("negative", "positive")[if_any(contains("PCR")) + 1])

-output

  species code      forest_site Filovirus (MOD) PCR Phlebo (Sanchez-Seco) PCR new_column
1       SM    1    UMNP-mangabey                   0                         0   negative
2       SM    1    UMNP-mangabey                   0                         1   positive
3       RC    9 UMNP-hondohondoc                   0                         0   negative
4      BWC    9 UMNP-hondohondod                   0                         0   negative
5      BWC    9 UMNP-hondohondod                   0                         0   negative
6      BWC    9 UMNP-hondohondod                   0                         0   negative
7      BWC    9 UMNP-hondohondod                   0                         0   negative
8      BWC    9 UMNP-hondohondod                   0                         0   negative
9      BWC    9 UMNP-hondohondod                   0                         0   negative
10     BWC    9 UMNP-hondohondod                   0                         0   negative