Return only one element per unique ID with only latest joined record

161 Views Asked by At

Suppose the following:

create schema bv;
create table bv.user(id bigint primary key);
create table bv.user_photo (
  id bigint primary key,
  url varchar(255) not null,
  user_id bigint references bv.user(id)
);

insert into bv.user values (100), (101);
insert into bv.user_photo values
  (1, 'https://1.com', 100),
  (3, 'https://3.com', 100),
  (4, 'https://4.com', 101),
  (2, 'https://2.com', 100);

I'd like to query for and build an object for every user, and include only the latest image in the result.

Here's what I have:

select
  json_build_object(
    'id', u.id,
    'latest_image', up.url
  ) user
from bv.user u
left join bv.user_photo up
  on u.id = up.user_id

However this returns:

[
  {"id" : 100, "url" : "https://2.com"},
  {"id" : 100, "url" : "https://3.com"},
  {"id" : 100, "url" : "https://1.com"},
  {"id" : 101, "url" : "https://4.com"}
]

However, the expected result is:

[
  {"id" : 100, "url" : "https://3.com"},
  {"id" : 101, "url" : "https://4.com"}
]

I've tried using distinct:

select distinct on(u.id)
  json_build_object(
    'id', u.id,
    'url', up.url
  ) user
from bv.user u
left join bv.user_photo up
  on u.id = up.user_id
order by u.id, up.id DESC

But my question is whether or not this is the correct approach? I feel like I shouldn't be using distinct in such a situation.

2

There are 2 best solutions below

3
Erwin Brandstetter On BEST ANSWER

With few photos per user, and while you return all (or most) users, DISTINCT ON is the best approach.
But it's typically faster to get distinct photos before you join:

SELECT json_build_object('id', u.id, 'url', p.url) AS "user"
FROM   users u
LEFT   JOIN (
   SELECT DISTINCT ON (user_id)
          user_id, url
   FROM   user_photo
   ORDER  BY user_id, id DESC
   ) p ON p.user_id = u.id;

For many photos per user, an emulated index-skip scan is (much) faster. See:

Summary array

To produce one summary array, you can skip json_build_object(). json_agg() can aggregate the row directly:

SELECT json_agg(sub) AS "users"
FROM  (
   SELECT u.id, p.url
   FROM   users u
   LEFT   JOIN (
      SELECT DISTINCT ON (user_id)
             user_id, url
      FROM   user_photo
      ORDER  BY user_id, id DESC
      ) p ON p.user_id = u.id
   ) sub;

Notably, all queries so far include the key "url" with a null value where no photo is found. You may want to strip the noise:

SELECT json_strip_nulls(json_agg(sub)) AS "users"
FROM  ...

For a small selection of users

This seems to be your use case.

A LATERAL subquery is typically faster. Also deals with many photos per user efficiently!

SELECT json_build_object('id', u.id, 'url', p.url) AS "user"
FROM   users u
LEFT   JOIN LATERAL (
   SELECT up.url
   FROM   user_photo up
   WHERE  up.user_id = u.id
   ORDER  BY up.id DESC
   LIMIT  1
   ) p ON true
WHERE  u.id IN (100, 101);  -- small selection

Summary array for a small selection, skipping null values

SELECT json_strip_nulls(json_agg(sub)) AS "users"
FROM  (
   SELECT u.id, p.url
   FROM   users u
   LEFT   JOIN LATERAL (
      SELECT up.url
      FROM   user_photo up
      WHERE  up.user_id = u.id
      ORDER  BY up.id DESC
      LIMIT  1
      ) p ON true
   WHERE  u.id IN (100, 101)  -- small selection
   ) sub;

fiddle

Index

DISTINCT ON does not need an index. All other queries absolutely need an index on user_photo(user_id, id). Or even, ideally:

CREATE INDEX ON user_photo (user_id, id DESC) INCLUDE (url);

Aside: Don't use the reserved word "user" as identifier. It works while schema-qualified, but fails without.

0
Zegarek On

This is the correct approach and you by all means should be using distinct on in situations like this. Plain distinct - no, distinct on - yes:

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal.

One thing I'm missing here is that you're expecting a single array of those objects, but your query returns each object in a separate row. You need a json_agg() to collect them: demo

select json_agg("user" order by id) as "user_array"
from(select distinct on(u.id) u.id,
      json_build_object('id', u.id,
                        'url', up.url) as "user"
     from bv.user u 
     left join bv.user_photo up
            on u.id = up.user_id
     order by u.id, up.id DESC) subquery;
user_array
[ {"id" : 100, "url" : "https://3.com"}, {"id" : 101, "url" : "https://4.com"} ]