sorry for maybe the stupid question but can't wrap my head around this one. I want to create a smooth linear plot or a histogram, for my 8 samples looking at the distribution of length of their gene transcription (x axis), and how frequently this length appears in each of the samples (y).
I've binned the log10 of my gene lengths, but then I'm stuck. I cant plot a histogram as then it just says they're all the same (all the genes appear in all of the samples), and I'm not sure how to include the expression value for the experiments in this.
Any suggestions would be appreciated!
Example of dataframe
Gene.ID Length ND_R1 ND_R2 NP_R1 NP_R2 dD_R1 dD_R2 dP_R1 dP_R2 log10_length log10_length_bin
1 ENSG00000273901 7999 44 48 122 15 79 61 74 107 3.903036 1
2 ENSG00000165392 23499 1246 1851 1065 106 1755 1787 1291 2169 4.371049 3
3 ENSG00000110172 44999 646 969 945 68 1252 1278 1515 2566 4.653203 4
4 ENSG00000148498 9499 21 33 49 3 135 139 113 202 3.977678 1
5 ENSG00000123473 11499 271 460 381 35 585 560 512 892 4.060660 2
6 ENSG00000081721 229335 4461 6963 6068 467 6211 6198 5674 9733 5.360470 7
df <- mutate(df, log10_length_bin = cut(log10_length, breaks = seq(3.75, 5.5, by = 0.25), labels = FALSE))
df <- filter(df, log10_length >= 3.75 & log10_length <= 5.5)
df_long <- tidyr::pivot_longer(df,
cols = starts_with(c("ND_", "dD_", "NP_", "dP_")),
names_to = "Sample",
values_to = "Expression")
counts <- df_long %>%
group_by(Sample, log10_length_bin) %>%
summarise(Count = n(), .groups = "drop")
ggplot(df, aes(x = log10_length)) +
geom_histogram(binwidth = 0.1, aes(fill = ND_R1), position = "dodge") +
labs(x = "Log10 of DoG Length", y = "Frequency", title = "Distribution of Log10 DoG Lengths") +
scale_fill_discrete(name = "Sample") +
theme_minimal()