Essentially, we want to be able to uniquely assign IDs to all the N grams contained in a large set of documents. So, if I have 10 million documents to process, I would read the text from each one of the document and get N grams (mostly trigrams) and should be able to assign unique IDs to these N-grams. Somehow, I would need to store these unique IDs so that I can fetch them fast.
Assigning Unique Ids to large set of documents
80 Views Asked by user965692 At
1
There are 1 best solutions below
Related Questions in DICTIONARY
- Difference between list() and dict() with generators
- Python program to produce dictionary of file extensions and sizes
- How to sort a nested dictionary by the a nested value?
- Renaming the keys of a dictionary
- VB.NET KeyNotFoundException from String()
- Numpy Vs nested dictionaries, which one is more efficient in terms of runtime and memory?
- Multiple parameters in a Dictionary
- ComboBox Not Being Filled With Unique Field Values Via Dictionary Learning
- Batch file: map a FTP server
- How to put objects into a dictionary using Dapper in C#?
- Pyparsing - Trouble parsing file to dictionary structure
- convert tuple keys of dict into a new dict
- Change the values of a list without using index
- Dictionary values missing
- How to create and add values to Dictionary in swift
Related Questions in UNIQUE-ID
- Giving unique ID to the dropdown and use it for response
- Diskpart UniqueID - C# how to get that id
- hashids: ReferenceError: require is not defined
- bash: meaningful unique id generation
- Microsoft Project: Is it now possible to set or change a Unique ID of a task?
- Row bind dataframes and keep unique IDs incrementing
- How to generate unique 8 digit Organization ID in MongoDB?
- Alternate for select <sequence>.nextval from dual of Oracle in SQL Server?
- How to get the unique ID of the website running computer
- t-sql string unique ID (Northwind database)
- Windows Phone 8.1: DeviceExtendedProperties or DeviceStatus for Device Unique ID
- PHP returns floats instead of strings from a database?
- Assigning Unique Ids to large set of documents
- How do I update an existing table with a few hundred records with a unique ID
- Unique ID based on Current Date Format yymmddHHmmss in Java
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Based on comments above, I would suggest that you simply use the N-gram as it's own identifier. That way there's no need to maintain a separate mapping from IDs to N-grams.
For example, say you have a document containing the text "hello", which contains the trigrams "hel", "ell", and "llo" (assuming you're not including word boundaries). Instead of first setting up an ID mapping like 1="hel", 2="ell", 3="llo" and having the document signature be the set { 1, 2, 3 }, you could use the N-grams directly as the document signature { "hel", "ell", "llo" }. This way you can even combine the scan and processing phases to just a single pass over a document.