I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?
How do i retain numbers while preprocessing data using gensim in python?
729 Views Asked by piñatabreaker At
1
There are 1 best solutions below
Related Questions in NLP
- command line parameter in word2vec
- Annotator dependencies: UIMA Type Capabilities?
- term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?
- Stanford Entity Recognizer (caseless) in Python Nltk
- How to interpret scikit's learn confusion matrix and classification report?
- Detect (predefined) topics in natural text
- Amazon Machine Learning for sentiment analysis
- How to Train an Input File containing lines of text in NLTK Python
- What exactly is the difference between AnalysisEngine and CAS Consumer?
- keywords in NEGATIVE Sentiment using sentiment Analysis(stanfordNLP)
- MaxEnt classifier implementation in java for linguistic features?
- Are word-vector orientations universal?
- Stanford Parser - Factored model and PCFG
- Training a Custom Model using Java Code - Stanford NER
- Topic or Tag suggestion algorithm
Related Questions in GENSIM
- How to save gensim LDA topics output to csv along with the scores?
- Gensim LDA - Default number of iterations
- LDA generated topics
- Do I need to transform unseen documents before projecting them onto model topics?
- top_topics Gensim NameError: global name 'np' is not defined
- Fitting LDA to corpus in LDA-C format in gensim
- LDA Results Errors
- AttributeError: 'numpy.ndarray' object has no attribute 'A'
- Gensim with MinGW
- ValueError: setting an array element with a sequence. Scikit learn
- Access key value pairs in gensim dictionary
- Word2vec training using gensim starts swapping after 100K sentences
- KeyError: “word 'word' not in vocabulary” in word2vec
- gensim on EC2: installation issue
- Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'
Related Questions in PREPROCESSOR
- What preprocessor can I used to detect if QT is used to build my codes
- Modifying individual field form inputs (CSS, Placeholder, etc) Drupal 7 registration form preprocess
- Cxx11 ABI for a single function call?
- How to adjust scaled scikit-learn Logicistic Regression coeffs to score a non-scaled dataset?
- endPosTable already set when adding sourceSet
- Pre process arguments passed to all instance methods in Python
- How to write similar functions in Common Lisp?
- gdb API preprocessor macro
- Copy SRC directory with adding a prefix to all c-library functions
- Mechanics Fortran Preprocessor
- Add preprocessor macro to a target in xcode 6
- NASM Assembler, how to define label twice?
- Stripping code of #define
- Why isn't my cocoapods post_install hook updating my preprocessor macros?
- lexical and preprocessor issues ios on project rename
Related Questions in LDA
- LDA generated topics
- Do I need to transform unseen documents before projecting them onto model topics?
- LDA with tm package in R using bigrams
- How to find the number of documents (and fraction) per topic using LDA?
- Fitting LDA to corpus in LDA-C format in gensim
- Manually Specifying a Topic Model in R
- LDA Results Errors
- Create hierarchical relations between a set of terms
- How to match ngrams for each document in Spark LDA code
- How can I perform LDA (latent Dirichlet allocation) on Noun Phrases in R instead of words?
- MALLET Topic Modeling: Inconsistent Estimations
- LDA cross validation and variable selection
- install package lda and pyprind
- What kind of LDA performs 'fitcdiscr' function?
- Mallet LDA ArrayIndexOutOfBoundsException while training the model
Related Questions in LATENT-SEMANTIC-ANALYSIS
- Using the lsa package in R - Error in Ops.simple_triplet_matrix(m, 1) : Incompatible dimensions
- choose the proper clustering method for Latent Semantic Analysis
- Extracting word features from BERT model
- In Latent Semantic Analysis, how do you recombine the decomposed matrices after truncating the singular values?
- LSA Similarity interface
- How Sklearn Latent Dirichlet Allocation really Works?
- AttributeError: 'int' object has no attribute 'toarray'
- How do i retain numbers while preprocessing data using gensim in python?
- probabilistic latent semantic analysis R
- LSA - Feature selection
- Which formula of tf-idf does the LSA model of gensim use?
- Unsupervised commands classification
- How Latent Semantic Analysis Handle Semantics
- R Supervised Latent Dirichlet Allocation Package
- Finding Semantic Coherence between sentences in a text
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
You don't have to use
simple_preprocess()- it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a
.split()on whitespace.If you want to look at what
simple_preprocess()does, as a model, you can view its Python source at:https://github.com/RaRe-Technologies/gensim/blob/351456b4f7d597e5a4522e71acedf785b2128ca1/gensim/utils.py#L288