Fastest way to match an abbreviation against a dictionary

Question

Fastest way to match an abbreviation against a dictionary

790 Views Asked by Wayne Hong At 17 August 2025 at 15:01

An abbreviation is a string of alphanumeric characters. The numbers stand for number of letters to skip, for example i18n is an abbreviation of internationalization. So is inter15 and 20. Say you have a dictionary of words, what is the fastest way to find all words in the dictionary that match a given abbreviation? You can use any data structure you like for your dictionary but the algorithm to find matching words must be better than O(n) where n is the number of words in the dictionary.

Original Q&A

There are 3 best solutions below

**beny1700** · Answer 1

A solution would be to categorise all the words in multiple arrays. Each arrays contains all the words that have a specific amount of letters. For example an array with the words of 4 letters like: food, wood, like, pike, etc. and an other with the 5 letters words etc. etc.

So with your abreviation i18n, you split chars and numbers: "in" "18" and you add the chars amount to the number : 2+18 =20. Now you know that your word is in the array of the words with 20 letters.

I think we can do better, but this is better than looking for a word in the whole dictionnary.

**M Oehm** · Answer 2

Here's an idea for an a10n tree (abbreviation tree, or should that be a10n t2e?).

Missing letters are replaced by the length of the missing chunk, so the length of the match is known beforehand. It makes sense to subdivide the dictionary into sub-dictionaries that contain words of a constant length:

dict -> dict2 -> {ad, ah, am, an, as, at, ...}
     -> dict3 -> {abe, abo, abu, ace, act, ada, add, ... }
     -> dict4 -> {abba, abbe, abby, ...}
     -> ...

Each of this dictionaries contains an ordered list of words. If the abbreviation is, say "5", return that list for the length-5 sub-dict. If the abbreviation is "green", i.e. no abbreviation at all, check whether that word is in the list using binary search.

With the trivial cases catered for, we must find a way to search this list fast. Let's introduce a position-and-letter tree. The first level of the tree refers to the position of the letter, the second level to the letter itself, much like in a trie, for example:

dict3 ->  i == 0 -> a   -> {ant}
                 -> b   -> {bat, bee}
                 -> c   -> {cat, cod, cow}
                 -> ...
          i == 1 -> a   -> {bat, cat, rat}
                 -> b   -> {}
                 -> c   -> {}
                 -> ...

Now find the intersection of the list for all given letters. If we are looking for "c2", we take the list at (0, c). If we are looking for "b1t", we take the intersection of the lists at (0, b) and (2, t). When the lists are ordered, finding the intersection should be reasonably fast.

There are other approaches. Tries are a data structure that allows us to search dictionaries fast. I've said in a comment to a now deleted post that I don't think tries are a suited data structure in this case, because tries are good for finding words when the first letters are known. But I'm not so sure now: One could descend all branches of a trie when a wildcard, i.e a missing letter, is found. For a word like 'i18n' the number of branches seems to be quite large, but in practice, there will be many null branches. In my (small) test dict of about 45,000 words, there is no word matching 'i18n', not even internationalisation. So for real dictionaries tries might even be an option.

(Apologies to the user who deleted the answer, but I referred to a single trie for the dict in my comment. Sub-tries for each length might work if the dictionary isn't gargantuan.)

**Jim Mischel** · Answer 3

So you have a query that's prefix - count - suffix. There are several ways to go about this.

If the prefix will never be empty, then you can build a prefix tree (it's just a trie), and query it for all words that start with that prefix, filtering for those that have the requested length and suffix.

You could do the same thing by building a generalized suffix tree.

Or, since either the prefix or suffix could be empty, you could build a prefix tree and a suffix tree. Query both, filtering for length, and union the results.

You could conceivably build a single prefix-suffix tree. The data structure would be somewhat more complex than having two separate trees, but it would require less memory.

As I recall (it's been some years), you can do all of this and more (search for words with missing letters, etc.) with a Directed Acyclic Word Graph (DAWG).

Fastest way to match an abbreviation against a dictionary

There are 3 best solutions below

Related Questions in REGEX

Related Questions in ALGORITHM

Related Questions in DATA-STRUCTURES

Related Questions in DICTIONARY

Related Questions in ABBREVIATION

Trending Questions

Popular # Hahtags

Popular Questions