I'm fairly new to R and I'm having some trouble with creating a heatmap using the geom_raster() function. So I am working on the tidytuesday challenge this week and I would like to create a heat map to show if hosting the race poses an advantage to the host team. I look at the metrics: team_name and pole for the x and y values respectively. I then fill the graph with the host variable to see if there are any trends with each team, its finishing position, and if they were the host of the race.
Below is a snippet of code I used to create the heatmap and the heatmap itself. I tidy'd up the data by this point which is the reason for the funky data name.
pole_position <- c("P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10", "P11", "P12", "P13", "P14", "P15", "P16")
ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
geom_raster() +
scale_y_discrete(limits = pole_position) +
coord_flip() +
labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?")
At first I thought this was a pretty cool graphic, but I soon realized that it was missing some of the 'Yes' hosts. There should be sixteen different teal boxes in this graphic but there's only 11.
I then faceted the graph to figure out if it recognizes the data that was entered. Below is the code and a photo of the produced graphic. pole_position's value does not change between the two graphs.
ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
geom_raster() +
scale_y_discrete(limits = pole_position) +
coord_flip() +
labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?") +
facet_wrap(~host)
As you can see, all sixteen of the blue tiles appear in the 'Yes' area. I am thoroughly confused as to why the previous graphic only recorded 11 of the 16 blue tiles.
My question is: Why don't all of the blue tiles appear in the first graphic?
Any help and/or constructive criticism is appreciated. Thanks!
Here is a link to the tidytuesday Github repository: here.
EDIT:
Here's what I did to Tidy the data, please don't berate me for doing anything wrong, I would love to learn any way to boost my coding efficiency.
# Read in the data from the github repo
marbles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-02/marbles.csv')
# Set the correct point & pole values
marbles$points[marbles$pole == 'P1'] = 25
marbles$pole[marbles$points == 25] = 'P1'
marbles$points[marbles$pole == 'P2'] = 18
marbles$pole[marbles$points == 18] = 'P2'
marbles$points[marbles$pole == 'P3'] = 15
marbles$pole[marbles$points == 15] = 'P3'
marbles$points[marbles$pole == 'P4'] = 12
marbles$pole[marbles$points == 12] = 'P4'
marbles$points[marbles$pole == 'P5'] = 10
marbles$pole[marbles$points == 10] = 'P5'
marbles$points[marbles$pole == 'P6'] = 8
marbles$pole[marbles$points == 8] = 'P6'
marbles$points[marbles$pole == 'P7'] = 6
marbles$pole[marbles$points == 6] = 'P7'
marbles$points[marbles$pole == 'P8'] = 4
marbles$pole[marbles$points == 4] = 'P8'
marbles$points[marbles$pole == 'P9'] = 2
marbles$pole[marbles$points == 2] = 'P9'
marbles$points[marbles$pole == 'P10'] = 1
marbles$pole[marbles$points == 1] = 'P10'
marbles$points[marbles$pole == 'P11'] = 0
marbles$pole[marbles$points == 0] = 'P11'
# replace any excess and incorrect pole/point values to align with my scale.
marbles[186, 8] = 'P10'
marbles[186, 9] = 1
# Replace the pole values for the 0 point scores
# This was done for many more values than what is seen here.
marbles[252,8] = 'P12'
marbles[253,8] = 'P13'
marbles[254,8] = 'P14'
marbles[255,8] = 'P15'
marbles[256,8] = 'P16'
# Remove the notes and source sections of the tidy data
clean_marbles = subset(marbles, select = -c(notes, source))
# Create a clean subset without any NA values
clean_marbles_2 = na.omit(clean_marbles)
I am aware that this is extremely tedious. You can see the points and pole corresponding values in the code i've included above. I was attempting to make the data more uniform thinking that it would be easier to visualize afterwards, but I guess not.
Here's an approach with
geom_tile
instead ofgeom_raster
using filter and two calls togeom_tile
:We need to use
geom_tile
becausegeom_raster
will shift around the rows.Here's an approach to cleaning up the data with
dplyr::recode
. The!!!
operator expands a list into arguments to be passed to a function. This is needed becauserecode
expects individual arguments.We can use
ifelse
to only replace the NA inpole
. Since we aren't using score, I didn't bother recoding that one, but you could easily go in reverse.