lets say I have a data frame that looks like this:
P Q1 Q2 ...
1 1 4 1
2 2 3 4
3 1 1 4
where the columns tell me which person answered which of the questions q1, q2, ... accordingly. Those questions require an answer on a 4 point likert scale (e.g. "approve" means 1, "slightly approve" means 2 and so on). How do I plot e.g. both question results in a stacked bar plot (in %)?
It should look somewhat like this.
All I find online is very complex code I can't handle or fail to understand ... Isn't there just a simple function that does what I want?
Thank you!
I am sure I am not the only one who would take issue with this part of your question:
"Very complex code" is quite subjective. However, I can understand that learning code and trying to figure out how to do what it is you want to do (which may seem simple at first) can be daunting and frustrating. I'll try to show you how to approach this in a very logical and clear manner, so that you can understand that the code shown here is actually not too complex.
The Dataset
OP did not provide a dataset, but I'll demonstrate a random one here. This is also a good opportunity to showcase how you can generate this type of data via code (and have it scalable). Let's assume we have 20 people answering 20 questions. I'll create the data in a data frame structure by providing first only one column of people, then adding 20 columns of questions to that. Each cell for the answers to the questions will randomly select an answer from 1 to 5.
That gives us a data frame of 20 rows and 21 columns (1 column for Person + 20 columns for questions).
Prepare the Data
When preparing to generate a plot, you will almost always have to prep the data in some way. There are only two things I want to do here first before we plot. The first step is to make our data into a format which is referred to as Tidy Data. In the format we have it in now... it's okay to plot in Excel, but if we want to have a quality way of organizing and summarizing this data, we want to organize it to be in a "longer" table format. What we need is to organize in a way that has columns organized as:
You can do that a few ways. Here I'm using
dplyr
andtidyr
packages and thegather()
function, but other ways exist (namely usingpivot_longer()
):The final thing I want to do here is to convert our column
questions$Answer
into a categorical variable, not a continuous number. Why? Well, the participants could only answer 1, 2, 3, 4, or 5. An answer of "3.4" would not make sense, so our data should be discrete, not continuous. We will do that by convertingquestions$Answer
into a factor. This also allows us to do two things at the same time that are quite useful here:levels
- this indicates which order you want the levels of the factor.labels
- this allows you to remap1
to be"Approve"
and2
to be"Slightly Approve"
and so on.You can then check the data after and see that
questions$Answer
column is now composed of ourlabels()
values, not numbers.Make the Plot
We can then make the plot using the
ggplot2
package. GGplot draws your data onto the plot area usinggeoms
. In this case, we can usegeom_bar()
which will draw a barplot (totaling up the number/count of each item), and requires anx
aesthetic only. If we set thefill
color of each bar to be equal to theAnswer
column, then it will color-code the bars to be associated with the number of each answer for each question. By default, the bars are stacked on top of one another in the order that we set previously for thelevels
argument of thequestions$Answer
column.There's a lot of things that are right with this plot and the general layout looks good. All that's left is to change the appearance in a few ways. We can do that by extending our plot code to change those aspects of the plot. Namely, I want to do the following:
The full plot code now looks like this shown below. You should be able to identify which parts of the code are doing each thing referenced above.
Pretty cool, eh?
As for "is there a simple function that does what I want?". The answer is "no". You can write one, but that might depend on how your data is initially formatted. If you're going to need to make these plots often, setup an R script to do that automatically for you :).
EDIT: Percentages maybe???
OP had a request in the comment on displaying the same info via percentages. This is also fairly straightforward to do and often what one wants to do with a likert plot... so let's do it! We'll convert the counts into percentages in two stages. First, we'll get the axis and the bars setup to do that. Second, we'll overlay text on top of each bar to display the % answering that way for each question.
First, let's set the bars and y axis to be percentages, not counts. Our line to draw the bar geom was
geom_bar(aes(fill=Answer))
. There's a hidden default value for theposition = "stack"
inside that function as well (which we don't have to specify). Theposition
argument deals with howggplot
should handle the situation when more than one bar needs to be drawn at that particular x value. In this case, it determines what to do with the 5 bars that correspond to each value ofquestions$Answer
corresponding to each question."Stack", as you might assume, just stacks them on top of each other. Since we have 20 people answering each question, all of our bars are the same total height (20) for every question. What if you had only 19 people answering question #3? Well, that total bar height would be shorter than the rest.
Normally, likert plots all show the bars the same height, because they are stacked according to the proportion of the whole they occupy for the total. In this case, we want each stack of bars to total up to 1. That means that 10 people answering one way should be mapped to a bar height of 0.5 (50%).
This is where the other
position
values come into play. We want to useposition = "fill"
to reference that we want the bars that need to be drawn at the same x axis position to be stacked... but not according to their value, but according to the proportion of the total value for that x axis position.Finally, we want to fix our scale. If we just use
position="fill"
our y axis scale would have values of "0, 0.25, 0.50, 0.75, and 1.0" or something like that. We want that to look like "0%, 25%, 50%, 75%, 100%". You can do that within thescale_y_continuous()
function and specify thelabels
argument. In this case, thescales
package has a convenientpercent_format()
function for just this purpose. Putting this together, you get the following:Getting text on top
To put the text on top as percentages, that's unfortunately not quite as simple. For this, we need to summarize the data, and in this case the most simple way to do that would be to summarize before hand in a separate dataset, then use that to label the text using a text geom mapped to our summary data frame.
The summary data frame is created by specifying how we want to group our data together, then assigning
n()
, or the count of each answer, as thefreq
column value.We then use that to map to a new geom:
geom_text
. They
value needs to be represented as a proportion again. Just like forgeom_bar
and the reasons above, we have to use the"fill"
position. I also want to make sure the position is set to the "middle" vertically for each bar, so we have to specify a bit further by usingposition_fill(vjust=0.5)
instead of just"fill"
.You'll notice a final critical piece is that we're using a
group
aesthetic. This is very important. For the text geom,ggplot
needs to know how the data is to be grouped. In the case of the bar geom, it was "obvious" (so-to-speak) that since the bars are colored differently, each color of bar was the separation. For text, this always needs to be specified (how to split the values) and we do this through thegroup
aesthetic.Voila!