In a Postgres db (engine version 14.6), I have two tables: school_students and id_cards. They look like the following:
CREATE TABLE school_students (
id BIGSERIAL PRIMARY KEY,
school_id character varying NOT NULL,
student_id character varying NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX school_students_pkey ON school_students(id int8_ops);
CREATE UNIQUE INDEX school_students_school_student_idx ON school_students(school_id text_ops,student_id text_ops);
CREATE TABLE id_cards (
id BIGSERIAL PRIMARY KEY,
school_student_id bigint NOT NULL REFERENCES school_students(id)
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX id_cards_pkey ON id_cards(id int8_ops);
CREATE INDEX id_cards_school_student_id_idx ON id_cards(school_student_id int8_ops);
ID cards are periodically reissued to students, so there may be any number of cards per student. Each card is unique. Each student+school entity is unique. A student may belong to multiple schools.
There are currently 11 schools in the system with number of students in each ranging anywhere from ~200 to ~20000. Most students have one id card, but they may issued new cards at any time and there is no limit to how many id cards they may have.
Here is an fiddle with sample data.
In front of the DB is an API with a resource whose job it is to take a list of school_ids and retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id. The API is to be queried frequently enough that it must be performant.
I have the following query to get me what I need without any limiting:
SELECT id FROM id_cards
WHERE id IN (
SELECT MAX(c.id) FROM id_cards c
JOIN school_students ss ON c.school_student_id = ss.id
WHERE
school_id in (1, 2, 3) -- up to 100 ids per application code. in practice, there are only a dozen or so schools
GROUP BY
school_id,
student_id
);
However, in my sub-query, I'd like to limit the number of school+student records returned to 100 for each school_id listed in the where clause in order to avoid runaway queries in instances where a given school may have a large number of students. For instance, the latest id card issued for up to the 50 of most recently added students for school 1, the id card for up to 50 of the most recently added students for school 2, repeat for each school listed in the query.
Is there a performant query that will accomplish this?
The mutilated relational model makes more efficient query techniques impossible.
The mentioned view is irrelevant for performance optimization. A view is basically just a stored query. It has no data nor indices of its own. A view is a convenience feature that doesn't help performance. (Not to be confused with a
MATERIALIZED VIEW.)The main table
id_cardsonly featuresschool_student_id. There is no way to immediately deduce a school from this number. So there is also no way to add an index to speed up the query at hand. Indexes apply to one table in Postgres (and most other RDBMS). My initially suggested smart queries are useless without a matching index.That said, for the moderately few rows in your DB, even a "brute force" query with a (single!) sequential scan on both tables (the only remaining option unless you add a
MATERIALIZED VIEWor change the relational model) is not that expensive. This should be as good as it gets:fiddle
Related:
About
DISTINCT ON:About the (optional) optimization with
ROWS UNBOUNDED PRECEDING:If we know that all result rows are among the N most recent entries, we could work with that, and optimize above query:
Note that the (highly unrealistic!) data distribution in your fiddle adds all the latest rows for a single school, so this optimization would not apply!
Also, the increased limit of 100 (up from 2) in your updated question makes this optimization a lot less juicy (or even applicable).
Reading only the top N rows from an index (and sorting) is proportionally faster than reading a couple 100k rows. In this case (but not for the general query above), it would help to make the PK into a covering index and get index-only scans out of it like this:
See:
Then again, selecting the "latest" entries based on a serial ID is unreliable to begin with. Concurrent write load and other internals can mess with the sequence. Such IDs are only guaranteed to be unique, not necessarily sequential. An added timestamp is more reasonable. (Plus rules how to break ties.)
BTW, Postgres 16 (and to some extent already pg 15) has multiple improvements over Postgres 14 that help the performance of these queries. Compare:
fiddle
Among other things,
ROWS UNBOUNDED PRECEDINGdoesn't help any more, because the optimization is built-in now. (Doesn't hurt, either.)Outdated answer for initial question
(While still assuming a table
school_id_cards.)If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.
Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:
fiddle
This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.
About the use of a recursive CTE (rCTE):
Be sure to have a matching index like
A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.
If there is no
schooltable:About
DISTINCT ON:You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) ...
But there really should be a
schooltable. Create it if you don't have it.