How to do Latent Semantic Analysis on a very large dataset

1.1k Views Asked by genekogan At 03 June 2025 at 23:16

I am trying to run LSA or Principal component analysis on a very large dataset, about 50,000 documents and over 300,000 words/terms, to reduce the dimensionality so I can graph the documents in 2-d.

I have tried in Python and in MATLAB but my system runs out of memory and crashes in either instance because of the dataset's volume. Does anyone know how I can reduce the load, or do some sort of approximate LSA/PCA which can run quicker and more efficiently? My overall goal is big dimensionality reduction over the 300k words.

Original Q&A

There are 1 best solutions below

denahiro On 30 August 2012 at 10:33

You could have a look at Oja's rule. It defines an iterative procedure to learn PCA. Now you just have to implement that you don't load the whole dataset at once from disk to prevent overloading your memory.

How to do Latent Semantic Analysis on a very large dataset

There are 1 best solutions below

Related Questions in MATLAB

Related Questions in NLP

Related Questions in PCA

Related Questions in DIMENSIONALITY-REDUCTION

Related Questions in LATENT-SEMANTIC-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions