Essentially, we want to be able to uniquely assign IDs to all the N grams contained in a large set of documents. So, if I have 10 million documents to process, I would read the text from each one of the document and get N grams (mostly trigrams) and should be able to assign unique IDs to these N-grams. Somehow, I would need to store these unique IDs so that I can fetch them fast.
Assigning Unique Ids to large set of documents
78 Views Asked by user965692 At
1
There are 1 best solutions below
Related Questions in DICTIONARY
- C++ using std::vector across boundaries
- Linked list without struct
- Connecting Signal QML to C++ (Qt5)
- how to get the reference of struct soap inherited in C++ Proxy/Service class
- Why we can't assign value to pointer
- Conversion of objects in c++
- shared_ptr: "is not a type" error
- C++ template using pointer and non pointer arguments in a QVector
- C++ SFML 2.2 vectors
- Lifetime of temporary objects
Related Questions in UNIQUE-ID
- C++ using std::vector across boundaries
- Linked list without struct
- Connecting Signal QML to C++ (Qt5)
- how to get the reference of struct soap inherited in C++ Proxy/Service class
- Why we can't assign value to pointer
- Conversion of objects in c++
- shared_ptr: "is not a type" error
- C++ template using pointer and non pointer arguments in a QVector
- C++ SFML 2.2 vectors
- Lifetime of temporary objects
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Based on comments above, I would suggest that you simply use the N-gram as it's own identifier. That way there's no need to maintain a separate mapping from IDs to N-grams.
For example, say you have a document containing the text "hello", which contains the trigrams "hel", "ell", and "llo" (assuming you're not including word boundaries). Instead of first setting up an ID mapping like 1="hel", 2="ell", 3="llo" and having the document signature be the set { 1, 2, 3 }, you could use the N-grams directly as the document signature { "hel", "ell", "llo" }. This way you can even combine the scan and processing phases to just a single pass over a document.