Best way to get distinct count from a query joining two tables

Question

Best way to get distinct count from a query joining two tables

1.8k Views Asked by Niv At 20 October 2024 at 05:58

I have 2 tables, table A & table B.

Table A (has thousands of rows)

id
uuid
name
type
created_by
org_id

Table B (has a max of hundred rows)

org_id
org_name

I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?

select count(distinct a.created_by)
from a left join
     b
     on a.org_id = b.org_id 
where b.org_name like '%myorg%';

Original Q&A

There are 3 best solutions below

GMB On 04 April 2020 at 13:35

I would use exists for this:

select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')

An index on b(org_id) would help. But in terms of performance, key points are:

searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.
count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.

Erwin Brandstetter On 04 April 2020 at 17:25

Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.

You mention that table B has only ...

a max of hundred rows

while table A has ...

thousands of rows

If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)

Essential ingredient is this multicolumn index:

CREATE INDEX ON a (org_id, created_by);

It can replace a simple index on just (org_id) and works for your simple query as well. See:

Is a composite index also good for queries on the first field?

There are two complications for your case:

DISTINCT
0-n org_id resulting from org_name like '%myorg%'

So the optimization is harder to implement. But still possible with some fancy SQL:

SELECT count(DISTINCT created_by)  -- does not count NULL (as desired)
FROM   b
CROSS  JOIN LATERAL (
   WITH RECURSIVE t AS (
      (  -- parentheses required
      SELECT created_by
      FROM   a
      WHERE  org_id = b.org_id
      ORDER  BY created_by
      LIMIT 1
      )
      UNION ALL
      SELECT (SELECT created_by
              FROM   a
              WHERE  org_id = b.org_id
              AND    created_by > t.created_by
              ORDER  BY created_by
              LIMIT  1)
      FROM   t
      WHERE  t.created_by IS NOT NULL  -- stop recursion
      )
   TABLE t
   ) a
WHERE  b.org_name LIKE '%myorg%';

db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)

That's a recursive CTE in a LATERAL subquery, using a correlated subquery.

It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.

The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.

Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.

**Gordon Linoff** · Accepted Answer

Gordon Linoff On 04 April 2020 at 13:33 BEST ANSWER

You don't need a left join:

select count(distinct a.created_by)
from a join
     b
     on a.org_id = b.org_id
where b.org_name like '%myorg%'

For this query, you want an index on b.org_id, which I assume that you have.

Best way to get distinct count from a query joining two tables

There are 3 best solutions below

Related Questions in SQL

Related Questions in POSTGRESQL

Related Questions in JOIN

Related Questions in POSTGRESQL-PERFORMANCE

Related Questions in POSTGRES-9.6

Trending Questions

Popular # Hahtags

Popular Questions