Fixing fuzzyjoin error message: vector memory exhausted

298 Views Asked by yankees_fan At 13 April 2023 at 17:43

I'm trying to join two data sets using fuzzy matching through the stringdist_left_join function from the library fuzzy join, but I keep getting the error message "Error: vector memory exhausted (limit reached?)." Does anybody know why this may be occuring? I would not say that either data set is extremely large.

I expected the two data sets to be joined, but I get an error

Original Q&A

There are 1 best solutions below

beniaminogreen On 06 February 2024 at 19:35

Generally, these errors occur because although each dataset may be small (fewer than 1 million observations each), the stringdist_(.*)_join functions use memory proportional to the product of the number of observations in each dataset, which can be quite large. This is because the functions compute the distance between each pair of rows across the two datasets, which takes O(mn) space to store, where m and n are the number of rows in the two datasets.

One would be to split the first dataframe into a set of partitions, join each partition to the second individually, and then aggregate the joined dataframes together. This would use memory usage proportional to the product of the number of rows in each partition and the number of rows in the datframe you are joining to (O(m_i n)), and could get you out of a pinch.

Alternatively, I have written an R package, zoomerjoin which uses a set of randomized algorithms to try and bring down the memory usage + runtime of large joins. The package supports fewer distance metrics than the fuzzyjoin package, but more will be added in the future.

Fixing fuzzyjoin error message: vector memory exhausted

There are 1 best solutions below

Related Questions in R

Related Questions in STRINGDIST

Related Questions in FUZZYJOIN

Trending Questions

Popular # Hahtags

Popular Questions