I am looking at a sample Hunspell dictionary like this Sanskrit one (or this 800,000+ line zip file one), which has stuff like this:

युयुक्ष्ये/3,4,33,34,53,63,76,86,88,94,158,178,179,182,184,185
युयुम्/48,68,445,2789,4633
युयून्/9
युयुङ्ग/300,422,450,606,641,1559,2836,2859,2869,2870,2871,2908,2910,3124,3420,3421,3423,3424,3425,3429,3430,3431,3432,3433,3434,3435
युयुङ्गान
युयुङ्गाना/1,2,3,17,18,22,24,25,26,27,28,29,30,31,32,34,35,36,38,45,46,47
युयुङ्गाने/3,4,12
युयुङ्गे/522,561,562
युयुङ्गिम/81
युयुङ्गिव/81
युयुङ्गुषी/1,2,3,4,17,18,49,58,59,60,61,64,66,67,128
युयुप/300,422,450,606,641,2836,2859,2869,2870,2871,2908,2910,3124,3420,3421,3423,3424,3425,3429,3430,3431,3432,3433,3434,3435
युयुपान

The characters before the slash is Devanagari script defining the Sanskrit word, the thing after the slash is all the prefixes/suffixes that that word can combine with (defined in another .aff file). The Hunspell docs are sparse but they cover the main ground for those who are curious. It is basically 2 files, a text file listing the dictionary words like I just showed, and an affix file for specifying how affixes/substitutions work. But kind of unrelated to my question.

My question stems from the way these files are put together. They separated the affix/substitution stuff from the base words, because making every combination of base word with affixes would take up way more space (there are 800k base words for example, but probably 10 million combinations at least).

If you had an autocomplete input/widget, and you typed Devanagari, how could you architect the data so it would be able to autocomplete all the possible derived word/affix combinations? MUST you "hydrate" a trie with the derived word/affix combinations at runtime? Or is there some magical algorithmic/data-structure trick in which you could keep the data in these two files (or in some sort of simple JSON structure, keeping base words separated from affixes like it is now), and you can check at runtime, thereby not having to build a large memory trie?

It seems like you'd have to write some custom non-generic code to dynamically build the affixed word as you query for some autocomplete results, but there are "prefixes" in addition to "suffixes", so it seems you would need to at least precompile those prefixes + base words? I am not really sure what can be done to not blow up the memory in this situation. Is there a typical way to handle this at a high level?

1

There are 1 best solutions below

3
On

The appropriate data structure is a patricia tree/radix tree, except you should compress it by having only one copy of each distinct subtree. This makes it into a DAG instead of a tree.

Because this reuses both common prefixes and common suffixes, it can replicate the kind of compression that you have in your source files.

To actually build the structure, you'll have to actually generate all the words, make a radix tree, and then compress it by redirecting references to duplicate nodes to a single 'canonical' copy.

The compression is easily performed by processing the tree recursively in DFS order, assigning an integer ID to each distinct node as you go, and making a canonical mapping.

The IDs make the mapping practical, by providing a concise way to represent the content of each node in terms of its children.

In pythonish, it looks like this:

def compressNode(mapping, node):
    key=node.text
    for i in range(len(node.children)):
        # compress children
        child = compressNode(mapping, node.children[i])
        node.children[i] = child
        # build key out of node contents
        key += "," + str(child.id)
    # do we already have an equivalent node
    canonical = mapping.get(key)
    if canonical == None:
        # nope.  Assign a new id. Use the mapping size
        node.id = len(mapping)
        mapping[key] = node
        canonical = node
    return canonical

Once you've created the DAG for a Hunspell file, you'll probably want to serialize it so you can read it back without going through this whole process again. The sequential integer IDs assigned to each node will be very handy for this.