Is it possible to filter data based on its plot/predicted curve?

125 Views Asked by At

I had a question regarding excluding/filtering data points. I currently have coded a logistic regression that generates a decision boundary that is wrapped up into a function in which I am able to run over subsets of my data frame.

I was wondering, if I were to plot all of the predicted curves based on these outcomes, if it is possible to filter these decision boundaries even further based on their generated plot/curve. Or if it is possible to set requirements in order for a curve to “qualify” and track the corresponding data in the data frame...

## glm that generates a midpoint/decision boundary, wrapped into a function

get_midpoint = function(data){
      glm.1 = glm(coderesponse~stimulus, family = binomial(link="logit"), data=data, na.action = na.exclude)
      rtn = -glm.1$coefficients[1]/glm.1$coefficients[2]
rtn
}

## a mini dummy dataframe 

subject <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
stimulus = c(1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37, 1, 5, 50, 35, 23, 2, 4, 22, 15, 6, 20, 40, 45, 10, 37, 43, 48, 7, 19, 21, 29, 49, 26, 11, 36, 30, 39, 41, 16, 37)
stim <- c('bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm', 'bd', 'nd', 'nm')
block <- c('mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose', 'mouth', 'mouth', 'mouth', 'nose', 'nose', 'nose')
coderesponse <- c(1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0)

df = data.frame(subject, stimulus, stim, block, coderesponse)

## running the function over defined subgroups of ~80 rows each [for the real data]
## but for the dummy dataframe, only ~5 rows

df = df %>% 
  nest(data=-c(subject, stim, block)) %>%
  mutate(midpoint=map_dbl(data, get_midpoint)) %>%
  unnest()

## basic code that plots and creates a curve based on a single glm result
## QUESTION: want to be able to run this over the same subgroups as above to create curves for every midpoint generated and then possibly filter based on the curve?
plot(df$stimulus,df$coderesponse,xlab="stimulus",ylab="Probability of d responses")
curve(predict(glm.1,data.frame(stimulus=x),type="response"),add=TRUE)

I’m quite new and confused with this part of R, so thanks for any help or insight!

1

There are 1 best solutions below

0
On

I think what you're trying to do is the following:

library(ggplot2)
library(dplyr)

df %>%
  ggplot() +
  aes(x = stimulus, y = coderesponse, colour = subject %>% as.factor()) +
  geom_point() +
  geom_smooth(method = 'glm', method.args = list(family = binomial(link='logit')), se = F) +
  scale_colour_discrete(name = "Subject") +
  theme(legend.position = "bottom")

enter image description here

This takes your original df and simply plots the data, colored by subject, then runs the glm model over both subject groups in your data. You can run each glm outside of the geom_smooth() statement if you need to use them to predict. There could be a way to use the ggplot-produced models without burning additional computations on remodeling.