Assigning Unique Ids to large set of documents

71 Views Asked by At

Essentially, we want to be able to uniquely assign IDs to all the N grams contained in a large set of documents. So, if I have 10 million documents to process, I would read the text from each one of the document and get N grams (mostly trigrams) and should be able to assign unique IDs to these N-grams. Somehow, I would need to store these unique IDs so that I can fetch them fast.

1

There are 1 best solutions below

0
On

Based on comments above, I would suggest that you simply use the N-gram as it's own identifier. That way there's no need to maintain a separate mapping from IDs to N-grams.

For example, say you have a document containing the text "hello", which contains the trigrams "hel", "ell", and "llo" (assuming you're not including word boundaries). Instead of first setting up an ID mapping like 1="hel", 2="ell", 3="llo" and having the document signature be the set { 1, 2, 3 }, you could use the N-grams directly as the document signature { "hel", "ell", "llo" }. This way you can even combine the scan and processing phases to just a single pass over a document.