Making data format for sankey plot in R

33 Views Asked by At

So this is my overall data frame.

dput(gene_data)
structure(list(Addiction = c("PMC6202276", "PMC10542560", "PMC8357835", 
"PMC8463497", "PMC10497570", "PMC10256172", "PMC10616319", "PMC4224688", 
"PMC10247417"), Parkinson = c("PMC6868244", NA, NA, NA, NA, NA, 
NA, NA, NA), Autism = c("PMC7102905", "PMC9668748", NA, NA, NA, 
NA, NA, NA, NA), Aggression = c("PMC4896746", NA, NA, NA, NA, 
NA, NA, NA, NA), depression = c("PMC5734943", NA, NA, NA, NA, 
NA, NA, NA, NA), schizophrenia = c("PMC8761210", "PMC8938529", 
NA, NA, NA, NA, NA, NA, NA), reward = c("32059760", "37657442", 
NA, NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA, 
-9L))

So Here under each column i have various PMID which is from various publications.

Now as an small example I will take schizophrenia = ("PMC8761210", "PMC8938529") and Autism = ("PMC7102905", "PMC9668748")

So one of my objective was to find out no of genes that was significant in our analysis how many of them overlap with similar studies.

Genes which are significant from our studies the list is fixed which 910 genes.

For schizophrenia

  • this study PMC8761210 = Total genes 1800 genes when we overlapped with our studies we got 50 genes matching or overlapping.
  • PMC8938529 = Total 800 genes, overlapped genes 75

For Autism

  • This study PMC7102905 = Total 600 genes , overlapped genes 100.
  • This study PMC9668748 = Total 200 genes, overlapped genes 150

I followed this tutorial for sankey plot, but i'm not able to formulate how to make the data frame to fit into the sankey function framework

The x axis should be Disorders such as the schizophrenia or Autism then it should go for the given publication or PMCID and then it should show me the overlap percentage between our gene and the given publication gene.

Any help or suggestion would be really appreciated

0

There are 0 best solutions below