Essentially, we want to be able to uniquely assign IDs to all the N grams contained in a large set of documents. So, if I have 10 million documents to process, I would read the text from each one of the document and get N grams (mostly trigrams) and should be able to assign unique IDs to these N-grams. Somehow, I would need to store these unique IDs so that I can fetch them fast.
Assigning Unique Ids to large set of documents
71 Views Asked by user965692 At
1
There are 1 best solutions below
Related Questions in DICTIONARY
- How to sort these using Javascript or Jquery Most effectively
- Ajax jQuery firing multiple time display event for the same result
- .hover() seems to overwrite .click()
- Check for numeric value with optional commas javascript
- Extending Highmaps Side Effect
- Array appending after each onclick and loop in javascript
- how can i append part of a table based on how many tr it has?
- Play multiple audio files in a slider
- Remove added set of rows
- Access property of an object of type [Model] in JQuery
Related Questions in UNIQUE-ID
- How to sort these using Javascript or Jquery Most effectively
- Ajax jQuery firing multiple time display event for the same result
- .hover() seems to overwrite .click()
- Check for numeric value with optional commas javascript
- Extending Highmaps Side Effect
- Array appending after each onclick and loop in javascript
- how can i append part of a table based on how many tr it has?
- Play multiple audio files in a slider
- Remove added set of rows
- Access property of an object of type [Model] in JQuery
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Based on comments above, I would suggest that you simply use the N-gram as it's own identifier. That way there's no need to maintain a separate mapping from IDs to N-grams.
For example, say you have a document containing the text "hello", which contains the trigrams "hel", "ell", and "llo" (assuming you're not including word boundaries). Instead of first setting up an ID mapping like 1="hel", 2="ell", 3="llo" and having the document signature be the set { 1, 2, 3 }, you could use the N-grams directly as the document signature { "hel", "ell", "llo" }. This way you can even combine the scan and processing phases to just a single pass over a document.