I've two dataframes DF1 and DF2. One of them is a very large DF.
I've created examples DF1 and 2 like this:
library(tidyverse)
A<-rep(c('Mavs', 'Spurs', 'Lakers', 'Cavs', 'Suns'), 1000000)
DF1<-data.frame(A)
B<-rep(c('Rockets', 'Pacers', 'Warriors', 'Suns', 'Celtics'), 1000)
DF2<-data.frame(B)
I want to compute cosine similarity and levenshtein distance of each word in DF1 to each word of DF2 and store it in a DataFrame. To do that in a "tidy way", I used the package "fuzzyjoin". I'm trying something like this:
library(fuzzyjoin)
DF1 <- DF1 %>% stringdist_full_join (DF2, by = c('A' = 'B'),
method = "cosine",
distance_col = "distance Cos")
This works fine with small datasets. But the problem is the large amount of data of DF1 and DF2. R gives me the message of Error: cannot allocate vector of size N Gb.
Is there a simple way to solve this problem? Is possible to calculate the levenshtein distance too?
I will appreciate help! Thanks!