Database schema for storing ngrams with multiple element search

631 Views Asked by At

I want to store a large number of ngrams on disk in such a way that I can perform the following queries on it:

  • Fetch all ngrams
  • Fetch all ngrams of a certain size
  • Fetch all ngrams which contain all these given elements in any position (subset)
  • Fetch all ngrams of a certain size which have these given elements in these positions (template)

An example for the third point would be all ngrams containing 'a', 'b' and 'c' which results in ngrams like (a,b,c), (b,c,a), (x,a,z,b,c), etc.

An example for the fourth point would be all ngrams following the template (a, *, *, b) which results in ngrams like (a,x,y,b), (a,a,a,b), etc.

At the moment I'm storing them in a database table with a separate field for each element of the ngram but this doesn't seem to be the best option for searching ngrams containing given elements in any order and position. In order to search for 3grams containing "a", "b" and "c" I am using the following SQL 'where' clause:

WHERE
     (ele0 = 'a' OR ele1 = 'a' OR ele2 = 'a') AND
     (ele0 = 'b' OR ele1 = 'b' OR ele2 = 'b') AND
     (ele0 = 'c' OR ele1 = 'c' OR ele2 = 'c')

This does not scale up well at all. Is there a better way to structure the data and query it?

1

There are 1 best solutions below

6
On

You don't specify what a "large number" are. I cannot think, off-hand, of a way to support all the operations you want using standard SQL optimization methods. In some databases, full text support might help.

If you want to use SQL (which is perfectly reasonable as a persistent store), I would suggest that you simply use strings. In other words, an ngram is a string.

Your queries would look like:

select *
from ngrams;

select *
from ngrams
where len(ngram) = XXX

select *
from ngrams
where ngram like '%a%' and ngram like '%b%' and ngram like '%c%';

select *
from ngrams
where ngram like 'a__b';

You can then enhance this structure to make it more efficient for certain queries. For instance, if you want to optimize the queries to get length, then add a length column and index it (this will not be very useful unless you have a lot of different lengths). To optimize queries of the third type, add a new column that has the elements in alphabetical order (so, "CBA" would also have a column "ABC"). An index on this would facilitate queries of the third type.

EDIT (in response to comment):

I always thought n-grams referred first to individual characters, but Wikipedia says they are order sets of any items.

You can readily handle "words" with the above schema, just by introducing a separator that is not an allowed character in any word, say the '|' delimiter. So, the n-gram "ABC" would be stored as "|A|B|C|":

select *
from ngrams;

select *
from ngrams
where ngramLen = XXX

select *
from ngrams
where ngram like '%|a|%' and ngram like '%|b|%' and ngram like '%|c|%';

select *
from ngrams
where ngram like |a|%|b|' and ngramLen = 4;

In this case, you would want a separate field that had the number of elements, because you cannot calculate that readily using a length function.

Given that you are thinking of having millions of ngrams, you have a bit of a challenge. With words, this could occupy up to gigabytes of memory. For performance, you will want the table to fit into memory. These operations are very well suited for a parallel database, so the process will scale smoothly. One of the advantages to using a database, in fact, is that you can simply throw more memory/disk/processors at the problem, and you will get better performance.