So I have a distance matrix of pairwise distances between DNA sequences where the name of each sequence is something identical to ID0001|Species name. The code I used to generate it is the following:
library(ape)
myseq <- read.dna("sequences.fasta", format = "fasta")
mydist <- dist.dna(myseq)
The resulting matrix looks like this:

What I am trying to do is to calculate the average distance within each genus. The way I have found to do this is by converting the matrix to a data frame, then separating the sequence ID and the species name, and finally extracting the genus (i.e. the first word of the species name):
library(reshape2)
library(stringr)
library(tidyr)
df_dist<-setNames(reshape2::melt(as.matrix(mydist)), c('rows', 'vars', 'values'))
names(df_dist)<-c("record1","record2","dist")
df_dist2<-separate(df_dist,record1,into = c("ID1","species1"),sep = "\\|",remove = FALSE,extra = "merge")
df_dist3<-separate(df_dist2,record2,into = c("ID2","species2"),sep = "\\|",remove = FALSE,extra = "merge")
df_dist3$genus1<- str_extract(df_dist3$species1, '[A-Za-z]+')
df_dist3$genus2<- str_extract(df_dist3$species2, '[A-Za-z]+')
Then what I do is I retain only the distances between the same genus, and after that I remove the instances where the distance was calculated between sequences belonging to the same species within a genus (not useful for my analysis):
congeneric<-df_dist3[df_dist3$genus1==df_dist3$genus2,]
congeneric2<-congeneric[congeneric$species1 != congeneric$species2,]
Then I calculate the mean distance of each genus and save them as a data frame:
congeneric_distances<-congeneric2 %>%
group_by(genus1) %>%
summarise_at(vars(dist), list(name = mean))
This code works for some matrices, but for very large ones, I am not being able to run the code, I get returned a message such as this:
Error: cannot allocate vector of size X Gb
I am wondering if there is a way to smooth out the code and make it less memory intensive. One of the steps where the code mostly "freezes" is when using the separate function, but sometimes it fails in other places.
I have also been told that I shouldn't convert it to a data frame, as it is more memory intensive than the matrix itself.
It is a bit hard to answer your question without a reproducible example or some information about your computer. Below is my suggestion.
Some dummy data resembling your dataset:
This is the output:
Now we can try to answer your question.
1] First I would suggest to list all unique genera you have in your dataset. I will use the library
stringr(but it can also be done in base R) and the fact that all the genera are delimited by"|"and" ".2] Now we can create a data.frame that will store all your average distances (preallocation is always a good thing when dealing with large datasets)
3] Compute the average distance. Within each iteration we only need the upper triangular matrix (diagonal is always 0, lower triangular is equal to upper triangular matrix). Instead of recreating new data.frames at each iteration, we will try to subset the existing matrix.
4] Our results
Let me know if this solve your problem.
UPDATE
Using the profiling tool provided by the library
profviswe can see that te loop itself doesn't use up much memory even with a distance matrix 10.000x10.000