I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton
How to find most similar terms/words of a document in doc2vec?
3.4k Views Asked by pankaj jha At
2
There are 2 best solutions below
0
gojomo
On
@TrnKh's answer is good, but there is an additional option made available when using Doc2Vec.
Some gensim Doc2Vec training modes – either the default PV-DM (dm=1) or PV-DBOW with added word-training (dm=0, dbow_words=1) train both doc-vectors and word-vectors into the same coordinate space, and to some extent that means doc-vectors are near related word-vectors, and vice-versa.
So you could take an individual document's vector, or the average/centroid vectors you've synthesized, and feed it to the model to find most_similar() words. (To be clear that this is a raw vector, rather than a list of vector-keys, you should use the form of most_similar() that specifies an explicit list of positive examples.)
For example:
docvec = d2v_model.docvecs['doc77145'] # assuming such a doc-tag exists
similar_words = d2v_model.most_similar(positive=[docvec])
print(similar_words)
Related Questions in PYTHON
- How to store a date/time in sqlite (or something similar to a date)
- Instagrapi recently showing HTTPError and UnknownError
- How to Retrieve Data from an MySQL Database and Display it in a GUI?
- How to create a regular expression to partition a string that terminates in either ": 45" or ",", without the ": "
- Python Geopandas unable to convert latitude longitude to points
- Influence of Unused FFN on Model Accuracy in PyTorch
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Writes to child subprocess.Popen.stdin don't work from within process group?
- Conda has two different python binarys (python and python3) with the same version for a single environment. Why?
- Problem with add new attribute in table with BOTO3 on python
- Can't install packages in python conda environment
- Setting diagonal of a matrix to zero
- List of numbers converted to list of strings to iterate over it. But receiving TypeError messages
- Basic Python Question: Shortening If Statements
- Python and regex, can't understand why some words are left out of the match
Related Questions in CLUSTER-ANALYSIS
- Cluster Analysis after a process
- Threshold scaling along a straight line
- create a bubble plot (or something similar) from cluster analysis in R
- Project idea about clustering and sentences similarity
- Mahalanobis distance computation in Python
- Adding a Bubble Plot as a Complex Heatmap Annotation
- Clustering Medium length (100bp) DNA Sequences
- Indicating the same clusters by colour between two Igraph plots using k mean clustering
- how to specify the maximum number of clusters for the STC algorithm in Solr admin console?
- Text clustering based on “stance” rather than the distribution of embeddings as the basis for clustering
- R ComplexHeatmap cannot reproduce exact row orders when apply row clusters to new matrix
- Principal Component Analysis and Clustering - Better Discrimination between Classes
- Recreating a spectral analysis and cluster graph example from RPUBS using K-means algorithm
- flowMatch metaclustering throws unexpteced error
- How to change 2D k-means algorithm to 2D EM-algorithm?
Related Questions in GENSIM
- How to save gensim LDA topics output to csv along with the scores?
- Gensim LDA - Default number of iterations
- LDA generated topics
- Do I need to transform unseen documents before projecting them onto model topics?
- top_topics Gensim NameError: global name 'np' is not defined
- Fitting LDA to corpus in LDA-C format in gensim
- LDA Results Errors
- AttributeError: 'numpy.ndarray' object has no attribute 'A'
- Gensim with MinGW
- ValueError: setting an array element with a sequence. Scikit learn
- Access key value pairs in gensim dictionary
- Word2vec training using gensim starts swapping after 100K sentences
- KeyError: “word 'word' not in vocabulary” in word2vec
- gensim on EC2: installation issue
- Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'
Related Questions in WORD2VEC
- command line parameter in word2vec
- Are word-vector orientations universal?
- How can a sentence or a document be converted to a vector?
- semantic matching strings - using word2vec or s-match?
- How to apply word2vec on images?
- Word2vec training using gensim starts swapping after 100K sentences
- KeyError: “word 'word' not in vocabulary” in word2vec
- how to find similar words for a certain word in tensorflow_word2vec like using model.most_similar in gensim?
- Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'
- How to use glove pretrained vectors to find similarity between two words and later in two documnets?
- AttributeError: module 'tensorflow.models.embedding.gen_word2vec' has no attribute 'skipgram_word2vec'
- TensorFlow : How and where to specify save path in word2vec?
- gen_word2vec in tensorflow is not found
- Gensim Word2Vec model: Cut dimensions
- UnicodeDecodeError: 'ascii' codec can't decode, with gensim, python3.5
Related Questions in DOC2VEC
- Solution to solve problem different results when run Doc2vec gensim?
- TypeError: 'int' object is not iterable" and PCA Assertion Error in Python Clustering Function
- Does Doc2vec support multiple languages?And does transvec lib use for Doc2vec model?
- How to query questions with high similarity based on the input question content?
- Identifying Redundancy in Operations within doc2vec Model
- How to train doc2vec with pre-built vocab in gensim
- How to get most similar words to a tagged document in gensim doc2vec
- Detecting semantic dissimilarity in sentences with same words
- Why do I get inconsistent results between Fasttext, Longformer, and Doc2vec?
- How to get doc2vec to reliably work with UMAP?
- Infer document vectors for pretrained word vectors
- S3 object as gensim LineSentence
- sentiment classification using doc2vec and LSTM Models
- What would be the best way to compare different parts of a document in just one doc2vec embedding?
- Runtime Error in doc2vec model for a preprocessed dataset
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.
Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
Take a look at Section 5.1 here for more details on the use of TF-IDF.