How would I automate computing correlations within a tibble for various countries and store effectively?

85 Views Asked by At

Somewhat of a beginner in R and I am working on a relatively large dataset (for me at least) of around 500,000 rows.

I am trying to find the correlation between variables for various countries (measuring the effects of bullying specifically) for the PISA dataset (education based survey).

I am able to compute the correlation matrix for countries on a case by case basis.

I wanted to record the correlation between two variables (so not the entire matrix necessarily) across all these countries - automating this and storing the results all in a tibble so that I don’t need to spend time doing this manually.

correl_countries = tibble()

for (each in list_countries){
  countries_bullying %>% #tibble subset of the original data 
    filter(CNTRYID == each)%>%
    select(reading_score, bullied_index)%>%
    correl = cor(use = "pairwise.complete.obs") #something to store the correlation values
    correl_countries %>% add_row(x = each, y = correl) #wanted to add these results to a tibble
}

Currently nothing seems to happen and I receive this error.

Error in is.data.frame(x) : argument "x" is missing, with no default

It may have something to do with the fact that "pairwise.complete.obs" generates a correlation matrix and not a single vector.

Grateful for your recommendations!

2

There are 2 best solutions below

0
On BEST ANSWER

You don't really need the loop here, the tidyverse has got you covered... The following returns a tibble with 2 columns: CNTRYID and correl:

library(tidyverse)

# get only the correlations
countries_bullying %>%
  group_by(CNTRYID) %>%
  summarise(correl = cor(reading_score, bullied_index, use = "pairwise.complete.obs"))
0
On

New user here- somehow can't place comments. If I understood correctly, you want to compute the correlation between 2 variables, per country, and store it in a separate tibble. Replace "df" with the name of your dataset, and "countries" with the variable in your dataset containing all the countries. For large datasets, a more elegant solution is likely available (i.e subsetting less variables each loop).

correl_countries <- c()
vec <- unique(df$countries)
for (i in 1:length(vec)) {
    new <- df[df$countries == vec[i],]
    correl_countries[i] <- cor(new$var1, new$var2)
}
tibble(vec, correl_countries)