How do i retain numbers while preprocessing data using gensim in python?

731 Views Asked by piñatabreaker At 09 May 2021 at 13:21

I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?

Original Q&A

There are 1 best solutions below

gojomo On 10 May 2021 at 08:21 BEST ANSWER

You don't have to use simple_preprocess() - it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.

So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split() on whitespace.

If you want to look at what simple_preprocess() does, as a model, you can view its Python source at:

https://github.com/RaRe-Technologies/gensim/blob/351456b4f7d597e5a4522e71acedf785b2128ca1/gensim/utils.py#L288

How do i retain numbers while preprocessing data using gensim in python?

There are 1 best solutions below

Related Questions in NLP

Related Questions in GENSIM

Related Questions in PREPROCESSOR

Related Questions in LDA

Related Questions in LATENT-SEMANTIC-ANALYSIS

Trending Questions

Popular # Hahtags

Popular Questions