I have a great looking geom_tile plot, but I need a way to highlight specific rows or label specific rows based on a binary value.
Here is a small subset of data in wide format and resulting output:
df <- structure(list(bin_level = c(0,1), sequence = c("L19088.1", "chr1_43580199_43586187"), X236 = c("G", "."), X237 = c("G", "."), X238 = c("A", "a"),
X239 = c("T", "C"), X240 = c("A", "c"), X241 = c("G", "G"
)), class = "data.frame", row.names = 1:2)
> df
bin_level sequence X236 X237 X238 X239 X240 X241
1 0 L19088.1 G G A T A G
2 1 chr1_43580199_43586187 . . a C c G
The actual dataset is much larger, with 1045 observations of 3096 variables.
My goal is to plot this massive dataset as a heatmap with colors for each different nucleotide and be able to differentiate between rows with bin_levels of 0 and 1.
The following code makes a great plot, but doesn't include the bin_level differences I need to see. I would like to highlight the entire row if the bin_level is 1, but I haven't been able to find anything on how to do such a thing. I am already using nucleotides for the aes fill variable, so I need something else. The best option I've come up with so far is to color the row labels. I used info from this post to try an ifelse statement to color based on the bin_level variable.
The biggest problems here are
- Row axis titles are much too long and too many to look good
- There are only 53 bin_level rows with a 1 (of 1045 total), so why does it look like a LOT more red than there should be?
- I want the red labels (bin_level =1's) at the top of the plot, and the mix of black/red makes me think my arrange(bin_level) piece isn't working right.
Please let me know if you know of a better way to accomplish what I'm trying to accomplish, or can help make my code work better than it is currently. Thank you!
df %>%
## reshape to long table
## (one column each for sequence, position and nucleotide):
pivot_longer(-c("Sequence", "bin_level"), ## stack all columns *except* sequence and bin_level
names_to = 'position',
values_to = 'nucleotide'
) %>%
arrange(bin_level) %>%
## create the plot:
ggplot() +
geom_tile(aes(x = position, y = Sequence, fill = nucleotide),
height = 1 ## adjust to visually separate sequences
) +
scale_fill_manual(values = c('a'='#ea0064', 'c'='#008a3f', 'g'='#116eff',
't'='#cf00dc', '\U00B7'='#000000', 'X' ='#ffffff'
)
) +
labs(x = 'x-axis-title', y='Sequence') +
## remove x-axis (=position) elements: they'll probably be too dense:
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.y = element_text(colour = ifelse(levels(df$bin_level)==1, "red", "black"))
)
While passing a vector of colors to
element_text()
is a quick option in some cases IMHO in more general cases it is error prone and requires to keep an eye on the way you ordered your data. Instead I would suggest to have a look at theggtext
package which introduces the theme elementelement_markdown
and allows for styling text using some HTML, CSS and markdown.Moreover, besides the issue already pointed out by @I_O another issue is that you wrangle the data manipulation steps together with the plotting code in one pipeline. As a consequence while you arrange your data by
bin_level
you use the original unmanipulated, unarranged datasetdf
which by the way is still in wide format for the color assignment. That's why personally I would always recommend to split the data wrangling and the plotting except for very simple cases.Finally, while your arranged your data by
bin_level
what really matters is the order ofsequence
, i.e. you have to set the order ofsequence
after arranging for which I useforecast::fct_inorder
.Note: To make your example more realistic I duplicated your dataset to add two more rows.
DATA