Using geom_raster(), why is it that some tile values(colors) do not appear correctly?

643 Views Asked by At

I'm fairly new to R and I'm having some trouble with creating a heatmap using the geom_raster() function. So I am working on the tidytuesday challenge this week and I would like to create a heat map to show if hosting the race poses an advantage to the host team. I look at the metrics: team_name and pole for the x and y values respectively. I then fill the graph with the host variable to see if there are any trends with each team, its finishing position, and if they were the host of the race.

Below is a snippet of code I used to create the heatmap and the heatmap itself. I tidy'd up the data by this point which is the reason for the funky data name.

pole_position <- c("P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10", "P11", "P12", "P13", "P14", "P15", "P16")

ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?")

The above code provides this graphic.

At first I thought this was a pretty cool graphic, but I soon realized that it was missing some of the 'Yes' hosts. There should be sixteen different teal boxes in this graphic but there's only 11.

I then faceted the graph to figure out if it recognizes the data that was entered. Below is the code and a photo of the produced graphic. pole_position's value does not change between the two graphs.

ggplot(data = clean_marbles_2, mapping = aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?") +
  facet_wrap(~host)

faceted graphic

As you can see, all sixteen of the blue tiles appear in the 'Yes' area. I am thoroughly confused as to why the previous graphic only recorded 11 of the 16 blue tiles.

My question is: Why don't all of the blue tiles appear in the first graphic?

Any help and/or constructive criticism is appreciated. Thanks!

Here is a link to the tidytuesday Github repository: here.

EDIT:

Here's what I did to Tidy the data, please don't berate me for doing anything wrong, I would love to learn any way to boost my coding efficiency.

# Read in the data from the github repo

marbles <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-06-02/marbles.csv')

# Set the correct point & pole values

marbles$points[marbles$pole == 'P1'] = 25

marbles$pole[marbles$points == 25] = 'P1'

marbles$points[marbles$pole == 'P2'] = 18

marbles$pole[marbles$points == 18] = 'P2'

marbles$points[marbles$pole == 'P3'] = 15

marbles$pole[marbles$points == 15] = 'P3'

marbles$points[marbles$pole == 'P4'] = 12

marbles$pole[marbles$points == 12] = 'P4'

marbles$points[marbles$pole == 'P5'] = 10

marbles$pole[marbles$points == 10] = 'P5'

marbles$points[marbles$pole == 'P6'] = 8

marbles$pole[marbles$points == 8] = 'P6'

marbles$points[marbles$pole == 'P7'] = 6

marbles$pole[marbles$points == 6] = 'P7'

marbles$points[marbles$pole == 'P8'] = 4

marbles$pole[marbles$points == 4] = 'P8'

marbles$points[marbles$pole == 'P9'] = 2

marbles$pole[marbles$points == 2] = 'P9'

marbles$points[marbles$pole == 'P10'] = 1

marbles$pole[marbles$points == 1] = 'P10'

marbles$points[marbles$pole == 'P11'] = 0

marbles$pole[marbles$points == 0] = 'P11'

# replace any excess and incorrect pole/point values to align with my scale.

marbles[186, 8] = 'P10'

marbles[186, 9] = 1

# Replace the pole values for the 0 point scores
# This was done for many more values than what is seen here.

marbles[252,8] = 'P12'

marbles[253,8] = 'P13'

marbles[254,8] = 'P14'

marbles[255,8] = 'P15'

marbles[256,8] = 'P16'

# Remove the notes and source sections of the tidy data

clean_marbles = subset(marbles, select = -c(notes, source))

# Create a clean subset without any NA values

clean_marbles_2 = na.omit(clean_marbles)

I am aware that this is extremely tedious. You can see the points and pole corresponding values in the code i've included above. I was attempting to make the data more uniform thinking that it would be easier to visualize afterwards, but I guess not.

2

There are 2 best solutions below

0
On

Here's an approach with geom_tile instead of geom_raster using filter and two calls to geom_tile:

ggplot(data = clean_marbles_2 %>% filter(host == "No"), mapping = aes(x = team_name, y = pole)) +
  geom_tile(fill = "#F8766D") +
  geom_tile(data = clean_marbles_2 %>% filter(host == "Yes"), fill = "#00BFC4") +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", title = "Does hosting the race affect finish placement?")

enter image description here

We need to use geom_tile because geom_raster will shift around the rows.

Here's an approach to cleaning up the data with dplyr::recode. The !!! operator expands a list into arguments to be passed to a function. This is needed because recode expects individual arguments.

We can use ifelse to only replace the NA in pole. Since we aren't using score, I didn't bother recoding that one, but you could easily go in reverse.

clean_marbles_2 <- marbles %>% 
  mutate(pole = 
           ifelse(is.na(pole),
                  recode(marbles2$points,
                         !!!c(`26` = "P1", `25` = "P1", `19` = "P2",
                              `18` = "P2", `16` = "P3", `15` = "P3",
                              `13` = "P4", `12` = "P4", `11` = "P5",
                              `10` = "P5", `8` = "P6", `6` = "P7",
                              `4` = "P8", `2` = "P9", `1` = "P10",
                              `0` = "P11")),
                        pole)) %>%
  dplyr::select(-notes, -source)
2
On

There seems to be a problem with how you have tidied the data. If we plot with the raw data in this reprex, your error doesn't appear:

library(ggplot2)

url <- paste0("https://raw.githubusercontent.com/rfordatascience/",
              "tidytuesday/master/data/2020/2020-06-02/marbles.csv")

raw_marbles  <- read.csv(url)
pole_position <- paste0("P", 1:16)

p <- ggplot(raw_marbles, aes(x = team_name, y = pole, fill = host)) +
  geom_raster() +
  scale_y_discrete(limits = pole_position) +
  coord_flip() +
  labs(x = "Team name", y = "Finish placement", 
       title = "Does hosting the race affect finish placement?")

p

enter image description here

It appears some of the tiles are "missing", but that's because they have no position assigned in the raw data. We can also confirm the correct number of blue squares is displayed here:

p + facet_wrap(.~host)

enter image description here So I guess the question is "what have you done to the raw data?". Showing how you got to clean_marbles_2 in your question would probably allow us to solve this issue.

Incidentally, there does seem to be an effect of hosting versus non hosting. You can do a Wilcox test to show it:

NoYes <- lapply(split(raw_marbles$pole, raw_marbles$host), 
                function(x) na.omit(as.numeric(substr(x, 2, 3))))

wilcox.test(NoYes[[1]], NoYes[[2]])

#>  Wilcoxon rank sum test with continuity correction
#> 
#> data:  NoYes[[1]] and NoYes[[2]]
#> W = 280, p-value = 0.04911
#> alternative hypothesis: true location shift is not equal to 0

So it seems the pole numbers were significantly higher (i.e. closer to P16) for the hosts.