How to count word frequency in a fulltext column?

76 Views Asked by At

From following schema I'd like to return the count by year of any given word:

CREATE TABLE treatments (
    id INTEGER PRIMARY KEY,
    treatmentId TEXT UNIQUE NOT NULL CHECK(Length(treatmentId) = 32),
    treatmentTitle TEXT COLLATE NOCASE,
    journalYear INTEGER,
    fulltext TEXT COLLATE NOCASE
);

CREATE VIRTUAL TABLE treatmentsFts USING fts5 (
    treatmentTitle,
    fulltext,
    content=''
);

Given a word such as "formica" I would like to return its frequency of occurrence in fulltext column by journalYear so I can create an interactive line chart and link the chart points to the respective treatmentId, like the Google Ngram viewer (but for single words or potentially phrases).

For performance reasons I would have to build a TF/IDF table. The built in aux function bm25() utilizes IDF in its calculation. Could I somehow use that? The fts5vocab table I am in the process of seeing if that could be useful. I am using my database in a Node.js better-sqlite3 application.

1

There are 1 best solutions below

0
On

Seems the FTS5Vocab virtual tables, all three kinds of them, can help. Confused by the documentation, I decided to experiment with all of them. In case this helps someone else after me, I document my approach below

sqlite> CREATE VIRTUAL TABLE treatmentsFtvRow 
   ...> USING fts5vocab('treatmentsFts', 'row');
sqlite> SELECT * FROM treatmentsFtvRow WHERE term = 'formica';
term     doc   cnt 
-------  ----  ----
formica  1377  3087

That's nice as I can get the total count of the term, but no info on the article in which it appeared.

sqlite> CREATE VIRTUAL TABLE treatmentsFtvCol 
   ...> USING fts5vocab('treatmentsFts', 'col');
sqlite> SELECT * FROM treatmentsFtvCol WHERE term = 'formica';
term     col       doc       cnt
-------  --------  --------  ------
formica  fulltext  1377      3087     

This is only just as useful as the previous one.

sqlite> CREATE VIRTUAL TABLE treatmentsFtvIns 
   ...> USING fts5vocab('treatmentsFts', 'instance');
sqlite> SELECT * FROM treatmentsFtvIns WHERE term = 'formica';
term     doc   col       offset
-------  ----  --------  ------
formica  86    fulltext  0     
formica  770   fulltext  1     
formica  770   fulltext  3     
formica  1374  fulltext  4     
formica  1561  fulltext  358
…

Better. Now I am getting somewhere.

sqlite> SELECT journalYear, Count(doc) AS c
   ...> FROM treatmentsFtvins JOIN treatments 
   ...>     ON treatmentsFtvins.doc = treatments.id 
   ...> WHERE term = 'formica'
   ...> GROUP BY journalYear
   ...> ORDER BY journalYear ASC, c DESC;
journalYear  c  
-----------  ---
1841         3  
1846         2  
1850         2  
1851         26 
1853         5  
1855         7  
1857         27 
1859         30 
1860         4  
1861         9  
1862         18 
1863         4  
1866         12 
1877         1  
1884         2  
1886         12 
…

Super, I am back in business.