practical implementation detail of edge labels/splitting in radix trees

109 Views Asked by At

This is a question about what's usually done in practice.

Say we had a radix tree with one entry (for whatever reason, consider this to be a single entry for demonstration):

"tests are really hard, no one likes taking tests, they're the worst"

And then we want to put in a second entry

"team"

We want to end up with an edge from root with

"te"

and two edge's from that, one with

"sts are really hard, no one likes taking tests, they're the worst"

and one with

"am"

Naively, we could create new strings (character arrays, whatever) for "te" and "sts are really...". This requires many operations, even if our new word "team" is short.

Alternatively, we could have the "te" and "sts are really..." labels could contain a reference to the same original string, as well as start/end values, i.e.:

[0, 2] for "te"

and

[2, <whatever it is>] for "sts are really..."

This way we avoid any copying, and the insertion time of "team" is only dependent on the length of "team", and not on lengths of other strings, namely "tests are really..." in this case.

So it's a matter of whether k in O(k) means the length of the string being inserted, or the longest string so far.

Obviously the latter implementation is harder and may in practice use more memory (because of the storing of endpoints), but it seems like the theoretical worst case time is improved.

I'm wondering if anyone knows what is generally done in practice?

Thanks

EDIT: I guess one problem with the latter implementation would arise with deletions. If you later inserted "telepathy", but then deleted "tests are really hard...", the "te" edge would remain, and still have a reference to a string much longer than what's needed.

1

There are 1 best solutions below

1
On

The algorithm for deleting a key is to find its leaf node, delete it, and, if that node's parent has exactly one other child, then splice the two.

Assuming that edge labels are stored as pairs of indices into a whole key, then the splicing can be accomplished by adjusting the lower index for the remaining child. To "garbage collect" all other instances of the deleted key, one can scan rootward, rewriting the edge labels involving the deleted key to reference a sibling. In the worst case, we have to scan all the way back to the root, with no increase in the asymptotic running time, but for random operations, the expected number of rewrites is constant. It's unclear whether this overhead is worth the saved copies.