My goal is to create a Sankey diagram that has each column of nodes parallel to each other and not by default nodes that aren't aligned with the next column(s). I posted asking for help but I have not received an answer: geom_sankey in R: spacing and aligning nodes.
Here is the output of my attempt using geom_sankey
and the issues with it:
This post: Sankey diagram in R: How to change the height (Y) of individual sections related to each node? has me convinced that I am going about this the wrong way and that I should try the ggforce
package.
The crux of the issue: I cannot figure out how to format the data so that ggplot
's flag of split
and the fill
flag of geom_parallel_sets
is satisfied with the data that I am using. Here is a made up example, but my data is of a similar 'flavor.'
Example
#Making df
Years <- data.frame(Earlier = c(rep(2012, 2), paste(2013), paste(2014), rep(2015, 2), rep(2018, 2), rep(2022, 2), rep(NA, 31)),
Latest = c(rep(2023, 4), rep(2022, 6), rep(2021, 10), rep(2020, 3), rep(2019, 6), rep(2018, 3), rep(2017, 3), rep(2013, 4), rep(NA, 2)),
Current = c(rep(2023, 10), rep(2022, 12), rep(2021, 11), rep(2020, 1), rep(NA, 7)))
#Shuffling
set.seed(123)
Years[sample(1:nrow(Years)), ]
#Changing all data.frame to numeric
ix <- 1:3
Years[ix] <- lapply(Years[ix], as.numeric)
#putting it in ggforce
format
Years2 <- gather_set_data(Years, 1:3)
This gives the following output (1st 10 rows)
According to the posts (like the one I linked above) doing sankey's with ggforce
, I need to fulfill the split
and fill
flags, but as you can see, splitting by column x will not give me the desired output. Additionally, I would like to fill
by the years, with each year having it's unique color and I also would like the column names to appear on the graph, like the image above.
Here is the code I am using and I am putting ??? where I am stuck.
library(ggplot2); library(ggforce)
ggplot(Years2,
aes(x = x, id = id, split = ???, value = ???)) +
geom_parallel_sets(aes(fill = ???), alpha = 0.3,
axis.width = aw, sep = sp) +
geom_parallel_sets_axes(axis.width = 0.1, sep = 0.1) +
geom_parallel_sets_labels(colour = "white",
angle = 0, size = 3,
axis.width = aw, sep = sp) +
theme_minimal()
I have tried many, many things - some notable efforts include: adding another column called 'split' on the Years2 df and pasting a 1,2,3 for when the 'Earlier', 'Latest', and 'Current' numbers start turning to NA's; using the melt
function from reshape2
, and using the Years %>% make_long(Earlier, Latest, Current)
command needed for the geom_sankey
command.
Extra info: sessionInfo() R version 4.3.0 (2023-04-21) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Ventura 13.6
Any help navigating this quagmire would be greatly appreciated. Thank-you.
Hopefully this is what you're looking for. According to the documentation,
geom_parallel_sets
requirevalue
to be provided as an aesthetic. I assume thevalue
represent frequency of connections between nodes (or the thickness of links). You may get these counts usingtable
andreshape2::melt()
Created on 2023-11-10 with reprex v2.0.2