Cosine similarity between rows of two large dataframes in R

189 Views Asked by At

I've two dataframes DF1 and DF2. One of them is a very large DF.

I've created examples DF1 and 2 like this:

library(tidyverse)

A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000)
DF1<-data.frame(A)

B<-rep(c('Rockets', 'Pacers', 'Warriors', 'Suns', 'Celtics'), 1000)
DF2<-data.frame(B)

I want to compute cosine similarity and levenshtein distance of each word in DF1 to each word of DF2 and store it in a DataFrame. To do that in a "tidy way", I used the package "fuzzyjoin". I'm trying something like this:

library(fuzzyjoin)
DF1 <- DF1 %>% stringdist_full_join (DF2, by = c('A' = 'B'), 
                              method = "cosine", 
                              distance_col = "distance Cos")

This works fine with small datasets. But the problem is the large amount of data of DF1 and DF2. R gives me the message of Error: cannot allocate vector of size N Gb.

Is there a simple way to solve this problem? Is possible to calculate the levenshtein distance too?

I will appreciate help! Thanks!

0

There are 0 best solutions below