How to do Latent Semantic Analysis on a very large dataset

1.1k Views Asked by At

I am trying to run LSA or Principal component analysis on a very large dataset, about 50,000 documents and over 300,000 words/terms, to reduce the dimensionality so I can graph the documents in 2-d.

I have tried in Python and in MATLAB but my system runs out of memory and crashes in either instance because of the dataset's volume. Does anyone know how I can reduce the load, or do some sort of approximate LSA/PCA which can run quicker and more efficiently? My overall goal is big dimensionality reduction over the 300k words.

1

There are 1 best solutions below

0
On

You could have a look at Oja's rule. It defines an iterative procedure to learn PCA. Now you just have to implement that you don't load the whole dataset at once from disk to prevent overloading your memory.