From following schema I'd like to return the count by year of any given word:
CREATE TABLE treatments (
id INTEGER PRIMARY KEY,
treatmentId TEXT UNIQUE NOT NULL CHECK(Length(treatmentId) = 32),
treatmentTitle TEXT COLLATE NOCASE,
journalYear INTEGER,
fulltext TEXT COLLATE NOCASE
);
CREATE VIRTUAL TABLE treatmentsFts USING fts5 (
treatmentTitle,
fulltext,
content=''
);
Given a word such as "formica" I would like to return its frequency of occurrence in fulltext
column by journalYear
so I can create an interactive line chart and link the chart points to the respective treatmentId
, like the Google Ngram viewer (but for single words or potentially phrases).
For performance reasons I would have to build a TF/IDF table. The built in aux function bm25()
utilizes IDF in its calculation. Could I somehow use that? The fts5vocab
table I am in the process of seeing if that could be useful. I am using my database in a Node.js better-sqlite3 application.
Seems the FTS5Vocab virtual tables, all three kinds of them, can help. Confused by the documentation, I decided to experiment with all of them. In case this helps someone else after me, I document my approach below
That's nice as I can get the total count of the term, but no info on the article in which it appeared.
This is only just as useful as the previous one.
Better. Now I am getting somewhere.
Super, I am back in business.