I'm trying to join two data sets using fuzzy matching through the stringdist_left_join function from the library fuzzy join, but I keep getting the error message "Error: vector memory exhausted (limit reached?)." Does anybody know why this may be occuring? I would not say that either data set is extremely large.
I expected the two data sets to be joined, but I get an error
Generally, these errors occur because although each dataset may be small (fewer than 1 million observations each), the
stringdist_(.*)_joinfunctions use memory proportional to the product of the number of observations in each dataset, which can be quite large. This is because the functions compute the distance between each pair of rows across the two datasets, which takes O(mn) space to store, where m and n are the number of rows in the two datasets.One would be to split the first dataframe into a set of partitions, join each partition to the second individually, and then aggregate the joined dataframes together. This would use memory usage proportional to the product of the number of rows in each partition and the number of rows in the datframe you are joining to (O(m_i n)), and could get you out of a pinch.
Alternatively, I have written an R package, zoomerjoin which uses a set of randomized algorithms to try and bring down the memory usage + runtime of large joins. The package supports fewer distance metrics than the
fuzzyjoinpackage, but more will be added in the future.