Best way to get distinct count from a query joining two tables

1.8k Views Asked by At

I have 2 tables, table A & table B.

Table A (has thousands of rows)

  • id
  • uuid
  • name
  • type
  • created_by
  • org_id

Table B (has a max of hundred rows)

  • org_id
  • org_name

I am trying to get the best join query to obtain a count with a WHERE clause. I need the count of distinct created_bys from table A with an org_name in Table B that contains 'myorg'. I currently have the below query (producing expected results) and wonder if this can be optimized further?

select count(distinct a.created_by)
from a left join
     b
     on a.org_id = b.org_id 
where b.org_name like '%myorg%';
3

There are 3 best solutions below

0
On BEST ANSWER

You don't need a left join:

select count(distinct a.created_by)
from a join
     b
     on a.org_id = b.org_id
where b.org_name like '%myorg%' 

For this query, you want an index on b.org_id, which I assume that you have.

1
On

I would use exists for this:

select count(distinct a.created_by)
from a
where exists (select 1 from b where b.org_id = a.org_id and b.org_name like '%myorg%')

An index on b(org_id) would help. But in terms of performance, key points are:

  • searching using like with a wildcard on both sides is not good for performance (this cannot take advantage of an index); it would be far better to search for an exact match, or at least to not have a wildcard on the left side of the string.

  • count(distinct ...) is more expensive than a regular count(); if you don't really need distinct, then don't use it.

0
On

Your query looks good already. Use a plain [INNER] JOIN instead or LEFT [OUTER] JOIN, like Gordon suggested. But that won't change much.

You mention that table B has only ...

a max of hundred rows

while table A has ...

thousands of rows

If there are many rows per created_by (which I'd expect), then there is potential for an emulated index skip scan.
(The need to emulate it might go away in one of the coming Postgres versions.)

Essential ingredient is this multicolumn index:

CREATE INDEX ON a (org_id, created_by);

It can replace a simple index on just (org_id) and works for your simple query as well. See:

There are two complications for your case:

  1. DISTINCT
  2. 0-n org_id resulting from org_name like '%myorg%'

So the optimization is harder to implement. But still possible with some fancy SQL:

SELECT count(DISTINCT created_by)  -- does not count NULL (as desired)
FROM   b
CROSS  JOIN LATERAL (
   WITH RECURSIVE t AS (
      (  -- parentheses required
      SELECT created_by
      FROM   a
      WHERE  org_id = b.org_id
      ORDER  BY created_by
      LIMIT 1
      )
      UNION ALL
      SELECT (SELECT created_by
              FROM   a
              WHERE  org_id = b.org_id
              AND    created_by > t.created_by
              ORDER  BY created_by
              LIMIT  1)
      FROM   t
      WHERE  t.created_by IS NOT NULL  -- stop recursion
      )
   TABLE t
   ) a
WHERE  b.org_name LIKE '%myorg%';

db<>fiddle here (Postgres 12, but works in Postgres 9.6 as well.)

That's a recursive CTE in a LATERAL subquery, using a correlated subquery.

It utilizes the multicolumn index from above to only retrieve a single row for every (org_id, created_by). With an index-only scans if the table is vacuumed enough.

The main objective of the sophisticated SQL is to completely avoid a sequential scan (or even a bitmap index scan) on the big table and only read very few fast index tuples.

Due to the added overhead it can be a bit slower for an unfavorable data distribution (many org_id and/or only few rows per created_by) But it's much faster for favorable conditions and is scales excellently, even for millions of rows. You'll have to test to find the sweet spot.

Related: