Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document? If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the same type (using cosine/Euclidean distance)?
Reverse TF-IDF vector (vec2text)
178 Views Asked by first_question_magnus At
1
There are 1 best solutions below
Related Questions in HASH
- How can py tuple implicit cast to int?
- How to properly set hashes in script-src CSP policy header?
- Algorithm for finding the largest common substring for n strings using Rabin-Karp function
- Lua: is there a need to use hash of string as a key in lua tables
- When the key values are the same, the memory limit is exceeded when making a hash join
- Short for creating an array of hashes in powershell malfunction?
- LC347: Top K Frequent Elements; final result returns an extra element in list/array
- Hashing vertices of a Graph in C
- Is there a limit on the message size for SHA3?
- When hashing an API key, should I hash the suffix / prefix as well?
- Cmake error : Configuring incomplete, errors occurred
- murmur3 hashing function in postgres
- Hashing the password if it is not hashed in django
- Order of a set in Python
- Comparing the hash of a file, containing a list of hashes of multiple files instead of each file, is it good?
Related Questions in DATA-SCIENCE
- KEDRO - How to specify an arbitrary binary file in catalog.yml?
- Struggling to set up a sparse matrix problem to complete data analysis
- How do I remove slashes and copy the values into many other rows in pandas?
- Downloading full records from Entrez
- Error While calling "from haystack.document_stores import ElasticsearchDocumentStore"
- How to plot time series from 2 columns (Date and Value) by Python google colab?
- How to separate Hijri (Arabic) and Gregorian date ranges from on column to separate columns
- How to wait the fully download of a file with selenium(firefox) in python
- Survey that collects anonymous results, but tracks which recipient have responded
- Dataframe isin function Buffer was wrong number of dimensions error
- How to add different colours in an Altair grouped bar chart in python?
- Python Sorting list of dictionaries with nested list
- Float Division by Zero Error with Function Telling Greatest Power of a Number Dividing Another Number
- If a row contains at least two not NaN values, split the row into two separate ones
- DATA_SOURCE_NOT_FOUND Failed to find data source: mlflow-experiment. Please find packages at `https://spark.apache.org/third-party-projects.html
Related Questions in TF-IDF
- How to select text data based on benchmark using TF-IDF weighted Jaccard similarity?
- IS there any ways TfidfVectorizer to NER tagging?
- Coco.names dataset with text descriptions of objects
- Making TF-IDF vector from one hot encoding in Dataframe
- text classification based on TF-IDF and CNN
- Lookup Error while running the .ipynb file in kaggle
- How does elasticsearch count tf-idf? That looks weird
- Incremental Inverse Document Frequency without storing the past information
- plot color by author but cluster by kmeans/tf-idf python
- Problem with SHAP plots for textual data that has been vectorized using tfidf
- I do not understand the working of tfidfvectorizer of sckit-learn
- How to extract calculations using tf-idf
- Kernel crashing when computing SHAP values
- TM TF-IDF Summary Max Value is Above 1
- Prediction done on tf-idf array, how to merge with original data frame
Related Questions in DOC2VEC
- Solution to solve problem different results when run Doc2vec gensim?
- TypeError: 'int' object is not iterable" and PCA Assertion Error in Python Clustering Function
- Does Doc2vec support multiple languages?And does transvec lib use for Doc2vec model?
- How to query questions with high similarity based on the input question content?
- Identifying Redundancy in Operations within doc2vec Model
- How to train doc2vec with pre-built vocab in gensim
- How to get most similar words to a tagged document in gensim doc2vec
- Detecting semantic dissimilarity in sentences with same words
- Why do I get inconsistent results between Fasttext, Longformer, and Doc2vec?
- How to get doc2vec to reliably work with UMAP?
- Infer document vectors for pretrained word vectors
- S3 object as gensim LineSentence
- sentiment classification using doc2vec and LSTM Models
- What would be the best way to compare different parts of a document in just one doc2vec embedding?
- Runtime Error in doc2vec model for a preprocessed dataset
Related Questions in LSH
- Pyspark BucketedRandomProjectionLSH - count() after approxsimilarityjoin gives different results when i persist output
- MinHash Query Parser for Solr: "sim" param not working as expected & How to normalize "hash_score" result?
- How to do CBIR / Reverse image search system
- Reverse TF-IDF vector (vec2text)
- How to select in SQL by LSH distance?
- Pyspark LSH Followed by Cosine Similarity
- LSH and minhasing - Why does hashing the signature matrix make sense?
- Is there any hashing function which generates same results for nearly similar input?
- Give a hex color to a string so that a string similar to another string will get a similar color
- How to choose Elastiknn LSH Jaccard similarity index parameters L and k ? In my case I have minhash size = 100, and jaccard Similarity = 0.8
- Questions about LSH (Locality-sensitive hashing) and minihashing implementation
- Best approach for building an LSH table using Apache Beam and Dataflow
- How to work with BucketedRandomProjectionLSH
- Spark LSH pipeline, performance issues when increasing text length
- LSHModel on spark structured streaming
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
It's unclear why you've mentioned "TF-IDF vector" in your question title, but then asked about a
Doc2Vecvector – which is very different from a TF-IDF approach. I'll assume your main interest isDoc2Vecvectors.In general, a
Doc2Vecvector has far too little information to actually reconstruct the document for which the vector was calculated. It's essentially a compressed summary, based on evolving (in training or inference) a vector that's good (within the limits of the model) at predicting the document's words.For example, one commonly-used dimensionality for
Doc2Vecvectors is 300. Those 300 dimensions are each represented by a 4-byte floating-point value. So the vector is 1200 bytes in total - but could be the summary vector for a document of many hundreds or thousands of words, far far larger than 1200 bytes.It's theoretically plausible that with a
Doc2Vecvector, and the associated model from which it was trained or inferred, you could generate a ranked list of words most-likely to be in the document. There's a pending feature-request to offer this in Gensim (#2459), but not yet implementing code. But such a list-of-words wouldn't be grammatical, and the top 10 words in such a list might not be in the document at all. (It might be entirely made up of other similar words.)With a large set of calculated vectors, as you get when training of a model has finished, you could take a vector (from that set, or from inferring a new text), and look through the set-of-vectors for whichever one has a vector closest to your query vector. That would point you at one of your known documents - but that's more of a lookup (when you already know many example documents) than reversing a vector into a document directly.
You'd have to say more about your need for a 'irreversible' vector that is still good for document-to-document comparisons for me to make further suggestions to meet that need.
To an extent,
Doc2Vecvectors already meet that need, as they can't regenerate an exat document. But given that they could generate a list of likely words (per above), if your needs are more rigorous, you might need extra steps. For example, if you used a model to calcualte all needed vectors, but then threw away the model, even that theoretical capability to list most-likely words would go away.But to the extent you still have the vectors, and potentially their mappings to full documents, a vector still implies one, or a few, closest-documents from the known set. And even if you somehow had a novel vector, without its text, simply looking among your known documents that are closest would be highly suggestive (but not dispositive) about what words are in the source document.
(If your needs are very demanding, there might be something in the genre of 'Fully Homomorphic Encryption' and/or 'Private Information Retrieval' that would help. Those use advanced cryptography to allow queries on encrypted data that only reveal final results, hiding the details of what you're doing even from the system answering your query. But those techniques are far more new & complicated, with few if any sources of ready-to-use code, and adapting them specifically for vector-similarity style calculations might require significant custom advanced-cryptography work.)