How can I tell the SQL planner that a WHERE condition in the main query should be executed in the subqueries?

64 Views Asked by At

I am using Postgres 14.x.

I have a complex query that can be observed, next to it's EXPLAIN ANALYZE, here.

select *
from (select * from public.select_version_of_projects('2024-03-08T08:31:08.280Z')) as "project_version"
where "project_version"."id" = 'fd18211b-a400-49ed-a723-9648ab05ca4f';
CREATE OR REPLACE FUNCTION public.select_version_of_projects(version_p TIMESTAMP WITH TIME ZONE)
    RETURNS TABLE
            (
                id             UUID,
                folder_id      UUID,
                name           TEXT,
                description    TEXT,
                picture        TEXT,
                files          JSON,
                keywords       TSVECTOR,
                scene          JSON,
                view           JSON,
                credits        JSON,
                published      JSON,
                imports        JSON,
                settings       JSON,
                studio_version TEXT,
                readonly       BOOLEAN,
                deleted        BOOLEAN,
                created_at     TIMESTAMP WITH TIME ZONE,
                updated_at     TIMESTAMP WITH TIME ZONE
            )
AS
$$
    SELECT  COALESCE(past.id, present.id)                         AS id,
            COALESCE(past.folder_id, present.folder_id)           AS folder_id,
            COALESCE(past.name, present.name)                     AS name,
            COALESCE(past.description, present.description)       AS description,
            COALESCE(past.picture, present.picture)               AS picture,
            COALESCE(past.files, present.files)                   AS files,
            COALESCE(past.keywords, present.keywords)             AS keywords,
            COALESCE(past.scene, present.scene)                   AS scene,
            COALESCE(past.view, present.view)                     AS view,
            COALESCE(past.credits, present.credits)               AS credits,
            COALESCE(past.published, present.published)           AS published,
            COALESCE(past.imports, present.imports)               AS imports,
            COALESCE(past.settings, present.settings)             AS settings,
            COALESCE(past.studio_version, present.studio_version) AS studio_version,
            COALESCE(past.readonly, present.readonly)             AS readonly,
            COALESCE(past.deleted, present.deleted)               AS deleted,
            COALESCE(past.created_at, present.created_at)         AS created_at,
            COALESCE(past.updated_at, present.updated_at)         AS updated_at
    FROM (SELECT *
          FROM project) AS present
    FULL OUTER JOIN
          (SELECT history.project.id,
                  (ARRAY_AGG(history.project.folder_id ORDER BY history.project.recorded_at) FILTER (WHERE history.project.folder_id IS NOT NULL))[1]           AS folder_id,
                  (ARRAY_AGG(history.project.name ORDER BY history.project.recorded_at) FILTER (WHERE history.project.name IS NOT NULL))[1]                     AS name,
                  (ARRAY_AGG(history.project.description ORDER BY history.project.recorded_at) FILTER (WHERE history.project.description IS NOT NULL))[1]       AS description,
                  (ARRAY_AGG(history.project.picture ORDER BY history.project.recorded_at) FILTER (WHERE history.project.picture IS NOT NULL))[1]               AS picture,
                  (ARRAY_AGG(history.project.files ORDER BY history.project.recorded_at) FILTER (WHERE history.project.files IS NOT NULL))[1]                   AS files,
                  (ARRAY_AGG(history.project.keywords ORDER BY history.project.recorded_at) FILTER (WHERE history.project.keywords IS NOT NULL))[1]             AS keywords,
                  (ARRAY_AGG(history.project.scene ORDER BY history.project.recorded_at) FILTER (WHERE history.project.scene IS NOT NULL))[1]                   AS scene,
                  (ARRAY_AGG(history.project.view ORDER BY history.project.recorded_at) FILTER (WHERE history.project.view IS NOT NULL))[1]                     AS view,
                  (ARRAY_AGG(history.project.credits ORDER BY history.project.recorded_at) FILTER (WHERE history.project.credits IS NOT NULL))[1]               AS credits,
                  (ARRAY_AGG(history.project.published ORDER BY history.project.recorded_at) FILTER (WHERE history.project.published IS NOT NULL))[1]           AS published,
                  (ARRAY_AGG(history.project.imports ORDER BY history.project.recorded_at) FILTER (WHERE history.project.imports IS NOT NULL))[1]               AS imports,
                  (ARRAY_AGG(history.project.settings ORDER BY history.project.recorded_at) FILTER (WHERE history.project.settings IS NOT NULL))[1]             AS settings,
                  (ARRAY_AGG(history.project.studio_version ORDER BY history.project.recorded_at) FILTER (WHERE history.project.studio_version IS NOT NULL))[1] AS studio_version,
                  (ARRAY_AGG(history.project.readonly ORDER BY history.project.recorded_at) FILTER (WHERE history.project.readonly IS NOT NULL))[1]             AS readonly,
                  (ARRAY_AGG(history.project.deleted ORDER BY history.project.recorded_at) FILTER (WHERE history.project.deleted IS NOT NULL))[1]               AS deleted,
                  (ARRAY_AGG(history.project.created_at ORDER BY history.project.recorded_at) FILTER (WHERE history.project.created_at IS NOT NULL))[1]         AS created_at,
                  (ARRAY_AGG(history.project.updated_at ORDER BY history.project.recorded_at) FILTER (WHERE history.project.updated_at IS NOT NULL))[1]         AS updated_at
          FROM history.project
          WHERE history.project.recorded_at > version_p
          GROUP BY history.project.id) AS past
    ON present.id = past.id
    WHERE COALESCE(past.created_at, present.created_at) < version_p;
$$
    LANGUAGE sql
    STABLE;

The purpose of the history table is to record the changes to each row in the public table so that we can always know the past states. And the function basically gives you a table representing the state of the public table at the timestamp provided.

Calculating what the whole table looked like in the past is very expensive, but my intention is that by using the function alongside a filter for the ID of the row I want then just that row will be calculated. But that is not happening.

It is very slow. It should take a few milliseconds. But instead of executing the filter "id" = 'fd18211b-a400-49ed-a723-9648ab05ca4f' at the first scans to vastly reduce the number of rows fetched so that the aggregation node would be much faster, it is doing it at the end.

If I replace COALESCE(past.id, present.id) AS id to either past.id AS id or present.id AS id, it's fix as can be seen here, but it's not correct because past.id or present.id could be NULL.

But if past.id is NULL it means all of past.* is NULL. Same with present. So it is valid to apply the filter to the subqueries still.

So how can I keep the COALESCE while informing the planner that it should apply the filter to the subqueries?

1

There are 1 best solutions below

2
Erwin Brandstetter On

Unless Postgres can inline the function in the calling query (which is often not possible for a set-returning function), it is a black box to the query planner. Everything inside the function is planned and executed separately from the outer query, and the result is materialized before proceeding with the outer query. A worst-case scenario for you.

To apply the all-important filter on ID as early as possible, pass it to the function, and apply it at the root(s).

Plus, don't array_agg() all rows, when you just need the latest not-null value. I show a solution using UNION ALL instead of your FULL OUTER JOIN, order once in the subquery, and aggregate with the function first() from the additional first_last_agg module, which you must have installed first.
The extension is not available on Amazon RDS Postgres. But you can create the aggregate function yourself. It's simple, see:

Details and alternatives in this related answer:

CREATE OR REPLACE FUNCTION public.select_version_of_project(_id uuid, _ts timestamptz)
  RETURNS TABLE (
    id             uuid
  , folder_id      uuid
  , name           text
--, etc.

  )
  LANGUAGE sql STABLE AS
$func$
SELECT p.id
     , first(p.folder_id) FILTER (WHERE p.folder_id IS NOT NULL) AS folder_id
     , first(p.name)      FILTER (WHERE p.name IS NOT NULL)      AS name
--   , etc.
FROM  (
   SELECT *
   FROM   history.project h
   WHERE  h.id = _id
   AND    h.updated_at <= _ts  -- ?

   UNION ALL
   SELECT *
   FROM   public.project p
   WHERE  p.id = _id
   AND    p.updated_at <= _ts  -- ?

   ORDER  BY updated_at DESC  -- !! I assume you want the latest state
   ) p
GROUP  BY 1;
$func$;

Call:

SELECT *
FROM   public.select_version_of_project ('fd18211b-a400-49ed-a723-9648ab05ca4f'
                                       , '2024-03-08 08:31:08.280+0')  -- timestamptz !

Your filters on created_at and updated_at were not all clear to me. You may have to do more there.

Be sure to have a multicolumn index on history.project(id, updated_at)!
Not sure about public.project - there an index on just (id) should suffice, and should already exist as PK index.

Pass a timestamptz value as declared or you may get surprising conversions.