Problem with optimizing an lpath-dependent query

56 Views Asked by At

Application I am working on has a feature called folders. Folders may contain other folders, are owned by users and can be shared to other users.

The feature has been implemented and was working fine up until recently, when a particular API endpoint started to underperform. This API endpoint returns all folders owned by a calling user. Additionally, for each folder, a list of users who has access to the folder is returned (owning user would be one of them).

The schema of a folders table looks like this:

id
title
parent_id
path (of ltree type)

A table that contains sharing bits looks like this, a folder_visibility table:

id
folder_id
profile_id

Access to folder by a given profile_id is determined by 2 rules:

  1. a (folder_id, profile_id) record exists in folder_visibility table,
  2. a folder is a descended of another folder, for which there a (folder_id, profile_id) record exists.

Alternatively, first and second rules can be expressed together like this: if folder was shared with a user, the user can see this folder, as well as all of its descendants.

Problem is that implementing rule 2 efficiently seems hard to me.

Example. Below is my thinking process, together with test data:

create table folders (id uuid, title text, path ltree);
create table folder_visibility (id uuid, profile_id uuid, folder_id uuid);
insert into folders (id, title, path) values ('9ca5ed28-83c3-45c3-98a1-31ebf0466e63', 'Folder 1', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63');
insert into folders (id, title, path) values ('80ba97ae-b21b-49d7-aedb-3004cdee0c21', 'Folder 2', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63.80ba97ae_b21b_49d7_aedb_3004cdee0c21');
insert into folders (id, title, path) values ('a7abb686-2a69-44d1-b2cd-60825478c4c8', 'Folder 3', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63.80ba97ae_b21b_49d7_aedb_3004cdee0c21.a7abb686_2a69_44d1_b2cd_60825478c4c8');
insert into folder_visibility (id, profile_id, folder_id) values (gen_random_uuid(), 'e63533e4-eb10-48f6-ae69-ddf5f9242510', '80ba97ae-b21b-49d7-aedb-3004cdee0c21');
insert into folder_visibility (id, profile_id, folder_id) values (gen_random_uuid(), '7cbb5f24-61d0-430e-ab10-225554b0725b', '9ca5ed28-83c3-45c3-98a1-31ebf0466e63');
insert into profiles values ('7cbb5f24-61d0-430e-ab10-225554b0725b', '[email protected]');
insert into profiles values ('e63533e4-eb10-48f6-ae69-ddf5f9242510', '[email protected]');
create index on folder_visibility (profile_id);
create index on folder_visibility (folder_id);
create index on folders (folder_id);
create index on folders (id);
create index on profiles (id);
  1. I fetch all folders for a given profile, this is relatively easy:

    select f2.id folder_id,
           p.id profile_id,
           p.email
      from folders f1
      join folders f2 on f2.path <@ f1.path
      join folder_visibility fv1 on fv1.folder_id = f1.id
      join profiles p on p.id = fv1.profile_id
     where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b';
                  folder_id               |              profile_id              |      email
    --------------------------------------+--------------------------------------+------------------
     9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
     80ba97ae-b21b-49d7-aedb-3004cdee0c21 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
     a7abb686-2a69-44d1-b2cd-60825478c4c8 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
    (3 rows)
    
  2. From here, I complete rule 1 from above like this:

    select f2.id folder_id,
           p.id profile_id,
           p.email
      from folders f1
      join folders f2 on f2.path <@ f1.path
      join folder_visibility fv1 on fv1.folder_id = f1.id
      join folder_visibility fv2 on (fv2.folder_id = f1.id or fv2.folder_id = f2.id)
      join profiles p on p.id = fv2.profile_id
     where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b';
                  folder_id               |              profile_id              |      email
    --------------------------------------+--------------------------------------+------------------
     9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
     80ba97ae-b21b-49d7-aedb-3004cdee0c21 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
     a7abb686-2a69-44d1-b2cd-60825478c4c8 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected]
     80ba97ae-b21b-49d7-aedb-3004cdee0c21 | e63533e4-eb10-48f6-ae69-ddf5f9242510 | [email protected]
    (4 rows)
    
  3. To implement rule 2, I'm naively adding yet another set of joins and finally get a list of necessary rows:

    select distinct on (f3.id, p.email)
           f3.id folder_id,
           p.email,
           fv3.profile_id
      from folders f1
      join folders f2 on f2.path operator(public.<@) f1.path
      join folder_visibility fv1 on fv1.folder_id = f1.id
      join folder_visibility fv2 on (fv2.folder_id = f1.id or fv2.folder_id = f2.id)
      join folders f3 on f3.path operator(public.<@) f2.path
      join folder_visibility fv3 on (fv3.folder_id = f2.id or fv3.folder_id = f3.id)
      join profiles p on p.id = fv3.profile_id
     where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b';
                  folder_id               |      email       |              profile_id
    --------------------------------------+------------------+--------------------------------------
     80ba97ae-b21b-49d7-aedb-3004cdee0c21 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b
     80ba97ae-b21b-49d7-aedb-3004cdee0c21 | [email protected]  | e63533e4-eb10-48f6-ae69-ddf5f9242510
     9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b
     a7abb686-2a69-44d1-b2cd-60825478c4c8 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b
     a7abb686-2a69-44d1-b2cd-60825478c4c8 | [email protected]  | e63533e4-eb10-48f6-ae69-ddf5f9242510
    (5 rows)
    

Problem with the last step in my example is that the execution plan gets quite enormous:

                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=7.38..7.39 rows=1 width=64) (actual time=0.103..0.111 rows=5 loops=1)
   ->  Sort  (cost=7.38..7.39 rows=1 width=64) (actual time=0.103..0.106 rows=8 loops=1)
         Sort Key: f3.id, p.email
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=0.00..7.37 rows=1 width=64) (actual time=0.031..0.087 rows=8 loops=1)
               Join Filter: (fv3.profile_id = p.id)
               Rows Removed by Join Filter: 8
               ->  Nested Loop  (cost=0.00..6.33 rows=1 width=32) (actual time=0.028..0.066 rows=8 loops=1)
                     Join Filter: ((fv3.folder_id = f2.id) OR (fv3.folder_id = f3.id))
                     Rows Removed by Join Filter: 8
                     ->  Nested Loop  (cost=0.00..5.28 rows=1 width=32) (actual time=0.025..0.046 rows=8 loops=1)
                           Join Filter: (f3.path OPERATOR(public.<@) f2.path)
                           Rows Removed by Join Filter: 4
                           ->  Nested Loop  (cost=0.00..4.21 rows=1 width=48) (actual time=0.022..0.032 rows=4 loops=1)
                                 Join Filter: ((fv2.folder_id = f1.id) OR (fv2.folder_id = f2.id))
                                 Rows Removed by Join Filter: 2
                                 ->  Nested Loop  (cost=0.00..3.16 rows=1 width=64) (actual time=0.018..0.022 rows=3 loops=1)
                                       Join Filter: (f2.path OPERATOR(public.<@) f1.path)
                                       ->  Nested Loop  (cost=0.00..2.09 rows=1 width=48) (actual time=0.014..0.016 rows=1 loops=1)
                                             Join Filter: (f1.id = fv1.folder_id)
                                             Rows Removed by Join Filter: 2
                                             ->  Seq Scan on folder_visibility fv1  (cost=0.00..1.02 rows=1 width=16) (actual time=0.009..0.010 rows=1 loops=1)
                                                   Filter: (profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b'::uuid)
                                                   Rows Removed by Filter: 1
                                             ->  Seq Scan on folders f1  (cost=0.00..1.03 rows=3 width=48) (actual time=0.002..0.003 rows=3 loops=1)
                                       ->  Seq Scan on folders f2  (cost=0.00..1.03 rows=3 width=48) (actual time=0.002..0.002 rows=3 loops=1)
                                 ->  Seq Scan on folder_visibility fv2  (cost=0.00..1.02 rows=2 width=16) (actual time=0.001..0.001 rows=2 loops=3)
                           ->  Seq Scan on folders f3  (cost=0.00..1.03 rows=3 width=48) (actual time=0.001..0.001 rows=3 loops=4)
                     ->  Seq Scan on folder_visibility fv3  (cost=0.00..1.02 rows=2 width=32) (actual time=0.001..0.001 rows=2 loops=8)
               ->  Seq Scan on profiles p  (cost=0.00..1.02 rows=2 width=48) (actual time=0.001..0.001 rows=2 loops=8)
 Planning Time: 0.588 ms
 Execution Time: 0.151 ms
(32 rows)

On large data set (thousands of folders for a given profile) this query gets executed several seconds. What would be a more efficient way to grab all folders, together with all "visibilities" for each folder according to the two rules above?

0

There are 0 best solutions below