I own a dataset that contains marks for 3 tests. The first test has done before the experiment. Second and third has done after the experiment. I want to say since this experiment students marks have been improving, in a graphical way. I selected a boxplot for this. Using that I am going to say that maximum and minimum values in each test and their improvements after the experiment. Is that a good way?
Analyzing data set which contains 30 observations
380 Views Asked by Dinuka At
2
There are 2 best solutions below
1

Your data is longitudinal. Therefore, it is better to show the individual changes over time.
Multiple boxplots ignore the individual changes over time and treat each time point as a separate and unconnected group. Longitudinal line plots can show more information in the data.
Consider the following simulated data.
set.seed(1)
x1 <- rnorm(30, mean=50, sd=20)
x2 <- x1+rnorm(30, mean=5, sd=10)
x3 <- x2+rnorm(30, mean=5, sd=5)
data <- data.frame(x1, x2, x3)
library(tidyverse)
data %>%
mutate(id=row_number()) %>%
pivot_longer(-id, names_prefix="x", names_to="time") %>%
ggplot(aes(y=value, x=time, group=id)) +
geom_point() +
geom_line() +
stat_summary(aes(group=1), fun=mean, geom="line",lwd=2, col=2)
data %>%
pivot_longer(everything(), names_prefix="x", names_to="time") %>%
ggplot(aes(y=value, x=time))+
geom_boxplot()
Those who scored poorly in the first test continued to do poorly in the second and third tests, something that the boxplot has missed.
You can use a Boxplot to see if the students as a group have improve. But imagine the good and students improve a lot, the moderate students get worse and the bad students improve. The Boxplot will show that the students in genrell improved, but you'll miss the information about the moderate students which actually got worse. For this, you can use a parallel coordinate plot. There is an implementation in the GGally package. For 30 observation this is still pretty well-arranged.