How convert contingency tables (counts) to individuals for GLM in R

192 Views Asked by At

I have my info as in this photograph:

enter image description here

You can download it here: https://drive.google.com/file/d/1pgO51NXtjpVSz-VxQEDNFFuQXVc4jVkt/view?usp=sharing

What i want is to transform this data to individuals,

For example

enter image description here

Will transform into this

enter image description here

Another example

enter image description here

will turn into this

enter image description here

So, if we say that n="sum of all numbers in the original data.frame", i.e., the number of all individuals, the final output will be a data.frame with 6 columns and n rows.

I want to do this in R but i don't have any idea how. Once I have this, what i want to do is apply a generalized linear model with family binomial and link = probit.

Now, this page can explain some of what I tried to do:

https://www.datanalytics.com/libro_r/la-funcion-melt-y-datos-en-formato-largo.html

2

There are 2 best solutions below

0
On BEST ANSWER

Okay... I have an answer, but... i was wondering if there exists any generalization. Here it goes:

library(readxl)
library(dplyr)

# Información original ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
byssinosis <- read_xls(path = "byssinosis.xls",range = "B4:K27",col_names = F)
names(byssinosis) <- c("Employment","Smoking","Sex","Race",
                       "W1y","W1n","W2y","W2n","W3y","W3n")
# View(byssinosis)

# Procesando la información a individuos ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Primero pasamos las columnas a una sola.
datos <- reshape2::melt(byssinosis)
# Separamos estas columnas en las dos características deseadas.
datos <- datos %>%
  mutate(Workplace = ifelse(variable %in% c("W1y", "W1n"),1,
                            ifelse(variable %in% c("W2y", "W2n"),2,3)),
         Byssinosis = ifelse(variable %in% c("W1y", "W2y", "W3y"),"yes","no"))
# Repetimos con base en value.
individuos=rep(seq_len(nrow(datos)),datos$value)
datos <- datos[individuos,]
# Nos quedamos solo las columnas deseadas
datos <- datos %>% select(-c(variable,value))
# View(datos)

# Comprobación ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tabla <-
  table(datos) %>%
  as.data.frame() %>%  
  arrange(Employment, desc(Smoking), desc(Sex), desc(Race), Workplace, desc(Byssinosis))
# View(tabla)
1
On

Try this

library(readxl)
library(dplyr)
library(tidyr)

df <- read_xls("byssinosis.xls", range = cell_rows(c(4L, NA_integer_)), col_names = FALSE)
raw_nms <- read_xls("byssinosis.xls", range = cell_rows(c(1L, 3L)), col_names = FALSE)

names(df) <- with(
  fill(as.data.frame(t(raw_nms)[, -2L]), V1, V2), # replace any missing value in V1 and V2 (i.e. row 1 and 3 in your excel) with the last observation carrired forward
  trimws(paste(V1, if_else(is.na(V2), "", V2))) # collapse these names into a single vector
)

df %>% 
  pivot_longer(contains(" "), names_to = c("Workplace", "byssinosis"), names_pattern = "(\\d+) (.+)") %>% 
  slice(inverse.rle(list(lengths = value, values = seq_along(value)))) %>% 
  select(-value)

Output

# A tibble: 5,419 x 6
   Employment Smoking Sex   Race  Workplace byssinosis
   <chr>      <chr>   <chr> <chr> <chr>     <chr>     
 1 <10        yes     M     W     1         yes       
 2 <10        yes     M     W     1         yes       
 3 <10        yes     M     W     1         yes       
 4 <10        yes     M     W     1         no        
 5 <10        yes     M     W     1         no        
 6 <10        yes     M     W     1         no        
 7 <10        yes     M     W     1         no        
 8 <10        yes     M     W     1         no        
 9 <10        yes     M     W     1         no        
10 <10        yes     M     W     1         no        
# ... with 5,409 more rows