Point Biserial Correlation across Multiple variables in R

195 Views Asked by At

I am trying calculate a correlation across multiple variables. Most of my variables are continuous, but one of them is binary. I would like to produce a single matrix with all of my variables in the matrix. But my code is not working. Here is my code:

cor.test(df[, c('sat','pba','cte_certs', 'course_credits', "college.going")], use="complete.obs")

Here is what my data frame looks like:

enter image description here

How should I set up my code to calculate the point biserial correlation across these vars?

1

There are 1 best solutions below

0
On

One way to do it:

1. Check for normality:

library(ggplot2)
library(dplyr)
library(tidyr)
library(broom)
library(see)

# Create a fake sample data frame: 

set.seed(123)
df <- data.frame(
  sat = rnorm(100, 100, 50),            
  pba = rnorm(100, 50, 75),            
  cte_certs = rnorm(100, 2, 55),       
  course_credits = rnorm(100, 20, 45),   
  college_going = rbinom(100, 1, 0.5)   
)

# test for normality with shapiro wilk test and qq plot: 
# long format
df_long <- df %>% 
  pivot_longer(-college_going, names_to = "variable", values_to = "value")

# Perform Shapiro-Wilk test
test_results <- df_long %>% 
  group_by(variable) %>% 
  summarize(
    shapiro_p = shapiro.test(value)$p.value,
    .groups = 'drop'
  ) %>%
  mutate(across(starts_with("shapiro_p"), ~paste("SW:", round(., 3))))

# Merge test results with the long data
df_long <- left_join(df_long, test_results, by = "variable")

# Create QQ plot with test results
p <- ggplot(df_long, aes(sample = value)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~variable) +
  geom_text(aes(label = paste(shapiro_p), y = Inf, x = Inf), 
            hjust = 1.1, vjust = 2, size = 6, check_overlap = TRUE)+
  theme_minimal()

We assume all our interval variables are normally distributed: enter image description here

2. Do the point biserial correlation:

#create a function 
point_biserial_cor <- function(binary_var, continuous_var) {
  cor.test(binary_var, continuous_var, method = "pearson")$estimate
}

# Identify binary and continuous variables
binary_var <- "college_going"
continuous_vars <- setdiff(names(df), binary_var)

# Do the correlation matrix
cor_matrix <- matrix(NA, nrow = length(continuous_vars), ncol = 1, dimnames = list(continuous_vars, binary_var))

# Calculate correlations
for (var in continuous_vars) {
  cor_matrix[var, binary_var] <- point_biserial_cor(df[[binary_var]], df[[var]])
}

# View the correlation matrix
print(cor_matrix)

The correlation matrix

               college_going
sat              -0.13065092
pba               0.09670408
cte_certs        -0.18555204
course_credits    0.11620321

3. As an add-on: Visualize it:

df_long %>% 
  mutate(college_going = factor(college_going)) %>% 
  mutate(id = row_number()) %>% 
  ggplot(aes(x=id, y=value, color=college_going, label = college_going)) +
  geom_point(size = 4, alpha=0.5)+
  geom_text(color = "black", hjust=0.5, vjust=0.5)+
  scale_color_manual(values = c("steelblue", "purple"), labels = c("No", "Yes"))+
  scale_x_continuous(breaks = 1:200, labels = 1:200)+
  scale_y_continuous(breaks= scales::pretty_breaks())+
  facet_wrap(. ~ variable, 
             nrow = 2, strip.position = "bottom", scales = "free") +
  labs(y = "value", 
       color="College going vs. not")+
  theme_modern()+
  theme(
    aspect.ratio = 2,
    strip.background = element_blank(),
    strip.placement = "outside",
    legend.position = "bottom",
    axis.title.x=element_blank(),
    axis.text.x=element_blank(),
    axis.ticks.x=element_blank(),
    text=element_text(size=16)
  )
# Note see here: https://stackoverflow.com/questions/76098463/assistence-in-creating-point-biserial-correlation-plot
# It is a matter of choice whether one likes the plot or not!

enter image description here