Introduction
I am building a caching system where each node of the cache may call any number of SQL queries from a predefined, finite, set of SQL queries with 0-n parameters.
Based on the result of those queries, the node does a rather slow computation and returns a value that is cached.
Queries might look like:
Query #1:
SELECT name
FROM users
WHERE id = ?;
Query #2:
SELECT email
FROM emails
WHERE deleted_at IS NULL AND user_id = ?;
Other queries may use joins, have none or multiple parameters, but the number of queries is finite.
I keep track of the queries and parameters set that each node calls, and build a dependencies list. Then when a query result changes, I know I need to invalidate all cache nodes that depend on it and recompute their values.
Heart of the problem
The hard part is now knowing which queries and parameters set are impacted when I do an INSERT, UPDATE, or DELETE.
Examples
INSERT INTO users ("id", "name")
VALUES ('foo', 'John');
This operation will impact query #1 with parameters ['foo']
, and all cache nodes that depend on that query with exactly those parameters should be invalidated.
UPDATE users
SET birth_date = '1990-01-01'
WHERE id = 'foo';
This operation will not impact query #1 because it does not rely on column birth_date
to build its result.
DELETE FROM users
WHERE id = 'bar';
This will impact query #1 with parameters ['bar']
even tho after the operation no rows matches query #1.
First solution
The solution I came up with works but sure needs improvements.
- For each operation on the database keep track of a set of rows and columns that were impacted:
INSERT
: consider the inserted row, with all of its columns
UPDATE
: consider the row before, and after it was updated, with only the updated columns. You end up with 2 rows
DELETE
: consider the deleted row before it was deleted, with all of its columns - For each row found in step 1, find all queries that might be impacted. This is where I am doing a lot of manual work today. I am currently listing all dependencies of each query by hand. Example for
Q1
:
const dependencies = [
{
table: 'users',
columns: ['id', 'name'],
getParams: (row) => [[row.id]],
}
]
A few interesting things to notice:
- A query may depend on multiple tables when using joins, so dependencies is an array
- I list the columns the query depends on, so updates on other columns can be skipped
- We know a row impacts a query by looking at the table and the columns
- We need to find the set of parameters based on the row.
The result is an array because a row might impact the same query with multiple parameters set. In this basic example, the length of the array is only 1 because the row impacts the query with 1 parameter set.
Now consider the following query:
UPDATE users
SET id = 'bar'
WHERE id = 'foo';
Based on step 1, we build two rows:
{ id: 'foo' }
: value of the row before the update{ id: 'bar' }
: value of the row after the update
Notice that both rows only have the id
column, because we only updated this column.
Now looking at the dependencies array we built above, we know that both rows impact query Q1
because the table matches, and the columns overlap (they both have the id
column).
To find the parameters set I need to call getParams
for each row and flatten the results:
[['foo'], ['bar']]
.
And that is it. We now invalidate all cache nodes that depend on Q1
with parameters set ['foo']
or ['bar']
.
Open question
I am looking for any other route I might have overlooked. And most importantly I am looking for a way to build the dependencies of each query automatically, doing it by hand is slow, difficult, and error-prone.
On the other possible approach, i suggest you to check if you can work directly with your RDBMS for that, if your RDBMS have result caching ability. Some RDBMS can be interrogated about the result cache status of a SQL query, thus giving you directly a way to know if your cache entries are still valid or not without having to parse DML statements. Query object dependencies are also provided for at least one RDBMS, that could be useful for your automated dependencies build.
Pro:
Cons:
For the first cons, a good way to handle it would be check the last DML execution on your RDBMS (through audit tables for example) before checking cache result for a bunch of queries. There's still a loss (I/O latency), but a least it will minimize the load for the RDBMS and your cache layer.(Not reliable, this can induce race conditions. If you have a query at T, get "ok no DML" for T(delta1) from audit at T(delta2), you'll submit your cache entry for T(delta2) while a DML could occur between T(delta1) and T(delta2).)To illustrate the point, you can submit an
EXPLAIN PLAN
statement on Oracle RDBMS 11+ with the/*+ RESULT_CACHE */
hints to get a result cache id for your query. You can then later just query thev$result_cache_objects
for this cache id withTYPE
asResult
and check theSTATUS
column. If it is different fromPublished
(Invalidate
,Sync
, etc.) or if the cache id has changed, you can invalidate your cache entry. It means also that you need to get and store the query cache id of your query in your cache entry when you populate it or refresh it. And you should get the result cache id of your query immediately after executing your query with/*+ RESULT_CACHE */
and get it from the query DBMS bloc, so you have to use a SQL client/interface able to return theses data.Documentation here..
With SQLServer, AFAIK, to this day, there's no result cache ability, SQLServer apply caching to its output buffer thus it can't permit that kind of usages.