Select N latest rows per school, but skip duplicates from the same student

144 Views Asked by At

In a Postgres db (engine version 14.6), I have two tables: school_students and id_cards. They look like the following:

CREATE TABLE school_students (
    id BIGSERIAL PRIMARY KEY,
    school_id character varying NOT NULL,
    student_id character varying NOT NULL
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX school_students_pkey ON school_students(id int8_ops);
CREATE UNIQUE INDEX school_students_school_student_idx ON school_students(school_id text_ops,student_id text_ops);

CREATE TABLE id_cards (
    id BIGSERIAL PRIMARY KEY,
    school_student_id bigint NOT NULL REFERENCES school_students(id)
);

-- Indices -------------------------------------------------------

CREATE UNIQUE INDEX id_cards_pkey ON id_cards(id int8_ops);
CREATE INDEX id_cards_school_student_id_idx ON id_cards(school_student_id int8_ops);

ID cards are periodically reissued to students, so there may be any number of cards per student. Each card is unique. Each student+school entity is unique. A student may belong to multiple schools.

There are currently 11 schools in the system with number of students in each ranging anywhere from ~200 to ~20000. Most students have one id card, but they may issued new cards at any time and there is no limit to how many id cards they may have.

Here is an fiddle with sample data.

In front of the DB is an API with a resource whose job it is to take a list of school_ids and retrieve the list of the most recently created id cards for each school+student, limited to a select number of students per school_id. The API is to be queried frequently enough that it must be performant.

I have the following query to get me what I need without any limiting:

SELECT id FROM id_cards
WHERE id IN (
  SELECT MAX(c.id) FROM id_cards c
  JOIN school_students ss ON c.school_student_id = ss.id
  WHERE
    school_id in (1, 2, 3) -- up to 100 ids per application code. in practice, there are only a dozen or so schools
  GROUP BY
    school_id,
    student_id
);

However, in my sub-query, I'd like to limit the number of school+student records returned to 100 for each school_id listed in the where clause in order to avoid runaway queries in instances where a given school may have a large number of students. For instance, the latest id card issued for up to the 50 of most recently added students for school 1, the id card for up to 50 of the most recently added students for school 2, repeat for each school listed in the query.

Is there a performant query that will accomplish this?

2

There are 2 best solutions below

3
Erwin Brandstetter On BEST ANSWER

The mutilated relational model makes more efficient query techniques impossible.

The mentioned view is irrelevant for performance optimization. A view is basically just a stored query. It has no data nor indices of its own. A view is a convenience feature that doesn't help performance. (Not to be confused with a MATERIALIZED VIEW.)

The main table id_cards only features school_student_id. There is no way to immediately deduce a school from this number. So there is also no way to add an index to speed up the query at hand. Indexes apply to one table in Postgres (and most other RDBMS). My initially suggested smart queries are useless without a matching index.

That said, for the moderately few rows in your DB, even a "brute force" query with a (single!) sequential scan on both tables (the only remaining option unless you add a MATERIALIZED VIEW or change the relational model) is not that expensive. This should be as good as it gets:

SELECT card_id
FROM  (
   SELECT card_id
        , row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
   FROM  (
      SELECT DISTINCT ON (s.school_id, s.student_id)
             s.school_id, c.id AS card_id
      FROM   id_cards c
      JOIN   school_students s ON s.id = c.school_student_id
      ORDER  BY s.school_id, s.student_id, c.id DESC
      ) sub1
   ) sub2
WHERE  rn <= 2; -- your limit

fiddle

Related:

About DISTINCT ON:

About the (optional) optimization with ROWS UNBOUNDED PRECEDING:

If we know that all result rows are among the N most recent entries, we could work with that, and optimize above query:

SELECT card_id
FROM  (
   SELECT card_id
        , row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
   FROM  (
      SELECT DISTINCT ON (s.school_id, s.student_id)
             s.*, c.id AS card_id
      FROM  (
         SELECT *
         FROM   id_cards
         ORDER  BY id DESC
         LIMIT  1000         -- this covers the latest 2 for all schools (!)
         ) c
      JOIN   school_students s ON s.id = c.school_student_id
      ORDER  BY s.school_id, s.student_id, c.id DESC
      ) sub1
   ) sub2
WHERE  rn <= 2;

Note that the (highly unrealistic!) data distribution in your fiddle adds all the latest rows for a single school, so this optimization would not apply!

Also, the increased limit of 100 (up from 2) in your updated question makes this optimization a lot less juicy (or even applicable).

Reading only the top N rows from an index (and sorting) is proportionally faster than reading a couple 100k rows. In this case (but not for the general query above), it would help to make the PK into a covering index and get index-only scans out of it like this:

ALTER TABLE id_cards
  DROP CONSTRAINT id_cards_pkey
, ADD  CONSTRAINT id_cards_pkey PRIMARY KEY (id) INCLUDE (school_student_id);

See:

Then again, selecting the "latest" entries based on a serial ID is unreliable to begin with. Concurrent write load and other internals can mess with the sequence. Such IDs are only guaranteed to be unique, not necessarily sequential. An added timestamp is more reasonable. (Plus rules how to break ties.)

BTW, Postgres 16 (and to some extent already pg 15) has multiple improvements over Postgres 14 that help the performance of these queries. Compare:

fiddle

Among other things, ROWS UNBOUNDED PRECEDING doesn't help any more, because the optimization is built-in now. (Doesn't hurt, either.)

Outdated answer for initial question

(While still assuming a table school_id_cards.)

If your table is big you want to avoid expensive whole-table sequential scans. Use a smart query to pick qualifying rows with index(-only) scans from a matching index. Hugely faster.

Typically, there should exist some kind of "school" table in your DB with exactly one row per relevant school. Makes the query simpler and faster:

WITH RECURSIVE latest_card AS (
   SELECT c.*
   FROM   school s
   CROSS  JOIN LATERAL (
      SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
      FROM   school_id_cards c
      WHERE  c.school_id = s.school_id
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) c

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

fiddle

This scales well for a small limit per school like you demonstrated. For a large limit, I would switch to a different query.

About the use of a recursive CTE (rCTE):

Be sure to have a matching index like

CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);

A simpler index with (default) ascending sort order is barely any worse. Postgres can scan B-tree indices backwards. Only opposing sort order would be less ideal.

If there is no school table:

WITH RECURSIVE latest_card AS (
   (
   SELECT DISTINCT ON (school_id)
          school_id, card_id, ARRAY[student_id] AS leading_ids
   FROM   school_id_cards c
   ORDER  BY school_id DESC, card_id DESC
   )

   UNION ALL
   SELECT c.*
   FROM   latest_card l
   JOIN   LATERAL (
      SELECT l.school_id, c.card_id, l.leading_ids || student_id
      FROM   school_id_cards c
      WHERE  c.school_id = l.school_id
      AND    c.card_id < l.card_id
      AND    c.student_id <> ALL (l.leading_ids)
      ORDER  BY c.card_id DESC
      LIMIT  1
      ) C ON cardinality(l.leading_ids) < 2  -- your limit per school here!
   )
SELECT card_id
FROM   latest_card
ORDER  BY card_id;

About DISTINCT ON:

You could replace the non-recursive term with another, nested rCTE to generate the list of schools (possibly with the latest card to kick things off) ...
But there really should be a school table. Create it if you don't have it.

3
Schwern On

You can do this by building up a query of subqueries and dense_rank.

First, get the latest cards for each student/school. (Note: IDs are poor surrogates for time ordering. Add a datetime column.)

select 
  *,
  dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards

Next, we use that as a subquery to get the order of the most recent cards issued to students by school. The most recent cards have student_card_order = 1.

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
)
select
  *,
  dense_rank() over(partition by school_id order by card_id desc) as school_card_order
from ordered_student_cards
where student_card_order = 1

And, finally, we can fetch only the first two per school. school_card_order <= 2;

with ordered_student_cards as (
  select 
    *,
    dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
  from school_id_cards
), ordered_school_cards as (
  select
    *,
    dense_rank() over(partition by school_id order by card_id desc) as school_card_order
  from ordered_student_cards
  where student_card_order = 1
)
select card_id
from ordered_school_cards
where school_card_order <= 2;

Demonstration.

There may be a more compact or performant way to do it, but window functions and subqueries are a way to break down complex queries.