Reproducing a cool chart - Better understanding reference classes in mutlinomial logistic regression

23 Views Asked by At

I am trying to reproduce this excellent data visualisation from the Economist https://www.economist.com/interactive/uk-general-election/build-a-voter

They have used multinomial logistic regression, and have created a "feature contribution" water-fall style chart to build-up voting intentions of defined demographic groups. I have a similar set of data (unrelated to politics!), and am trying to perform a similar analysis. The example below is a made-up, reproducible example in similar vein to the economist version. (Big thanks to chatgpt for writing the example quickly!)

Can you help me interpret the co-efficients from the model output so that I can calculate the "waterfall values" of each level of the predictor variables?

(NB. My understanding is that this is done in comparison with a reference class, which is what the model has done here. What I like about the economist version is that it presents as agnostic to a reference class, at least in the viz. I expect there is some dummy variable shenanigans going on, but I can't figure it out, and would love someone smarter to help!)

###### Load packages ###### 
require(tidyverse)
require(nnet)

###### Create example data set ###### 

set.seed(123) # Set seed for reproducibility

# Function to generate voting intention with different probabilities for each party
generate_voting_intention <- function(age_bracket, gender) {
  # Define base probabilities for Labour by age bracket
  base_prob_labour <- c("18-29"=0.45, "30-39"=0.40, "40-49"=0.38, "50-64"=0.36, "65+"=0.35)
  # Adjust Labour probability by gender
  gender_mod_labour <- ifelse(gender == "Female", 0.05, 0)
  prob_labour <- base_prob_labour[age_bracket] + gender_mod_labour
  
  # Conservative base probabilities (smaller than Labour's to ensure Labour majority)
  base_prob_conservative <- c("18-29"=0.25, "30-39"=0.30, "40-49"=0.32, "50-64"=0.34, "65+"=0.35)
  # Adjust Conservative probability by gender (no gender modification in this example)
  prob_conservative <- base_prob_conservative[age_bracket]
  
  # The remaining probability is for Lib Dems
  prob_lib_dems <- 1 - prob_labour - prob_conservative
  
  # Sample voting intention based on calculated probabilities
  sample(c("Labour", "Conservative", "Lib Dems"), size = 1, prob = c(prob_labour, prob_conservative, prob_lib_dems))
}

# Create a data frame with the specified variables
create_survey_data <- function(n) {
  age_brackets <- c("18-29", "30-39", "40-49", "50-64", "65+")
  genders <- c("Male", "Female")
  
  # Sample age and gender for each respondent
  ages <- sample(age_brackets, n, replace = TRUE)
  genders <- sample(genders, n, replace = TRUE)
  
  # Generate voting intention for each respondent
  voting_intention <- mapply(generate_voting_intention, age_bracket = ages, gender = genders)
  
  # Combine into a data frame
  data.frame(
    Age = ages,
    Gender = genders,
    VotingIntention = voting_intention,
    stringsAsFactors = FALSE
  )
}

# Generate survey data for 1000 respondents
survey_data <- create_survey_data(1000)

###### Build Voting Intention model ###### 

vote_mod<-multinom(VotingIntention~Age + Gender, data=survey_data)

tidy_vote_mod<-tidy(vote_mod,exponentiate = TRUE)
1

There are 1 best solutions below

1
wzbillings On

I suspect this will be closed as it is mainly a programming question, but here is a quick answer.

That plot on the left side of that article is showing the actual model predictions, not just the coefficients. In order to get those, you need to do two steps. First you need to create a grid of data on which to evaluate your predictions.

You can do that with expand.grid() (or the slightly more versatile tidyr::expand_grid() if you use tidyverse). This is like a data frame, but it "crosses" all of the inputs to create a data frame with all the combinations. For example, expand.grid(a = 1:2, b = 1:2) will give you a data frame with four rows that have all possible combinations of the values of a and b. So for your problem, you would want something like this:

prediction_grid <- expand.grid(
    age = unique(survey_data$age),
    gender = unique(survey_data$gender),
    region = ...
)

and so on. Then you pass this as the newdata argument to predict() like so: predict(vote_mod, newdata = prediction_grid, type = "response"). The argument type = response is necessary to get outcomes that are probabilities so you don't have to backtransform them. This will give you a numeric matrix with one row for each row of prediction_grid (this would be equal to the number of points in the triangle diagram on the right side of that article) and one column for each level of the outcome, where the i, j value is the probability of a person with the characteristics of the ith row of prediction_grid to get outcome j.

You could also use broom::augment() to get the predictions in a tidy format (rather than tidy() giving you the coefficients in a tidy format).