Application I am working on has a feature called folders
. Folders may contain other folders, are owned by users and can be shared to other users.
The feature has been implemented and was working fine up until recently, when a particular API endpoint started to underperform. This API endpoint returns all folders owned by a calling user. Additionally, for each folder, a list of users who has access to the folder is returned (owning user would be one of them).
The schema of a folders table looks like this:
id
title
parent_id
path (of ltree type)
A table that contains sharing bits looks like this, a folder_visibility table:
id
folder_id
profile_id
Access to folder by a given profile_id is determined by 2 rules:
- a
(folder_id, profile_id)
record exists in folder_visibility table, - a folder is a descended of another folder, for which there a
(folder_id, profile_id)
record exists.
Alternatively, first and second rules can be expressed together like this: if folder was shared with a user, the user can see this folder, as well as all of its descendants.
Problem is that implementing rule 2 efficiently seems hard to me.
Example. Below is my thinking process, together with test data:
create table folders (id uuid, title text, path ltree);
create table folder_visibility (id uuid, profile_id uuid, folder_id uuid);
insert into folders (id, title, path) values ('9ca5ed28-83c3-45c3-98a1-31ebf0466e63', 'Folder 1', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63');
insert into folders (id, title, path) values ('80ba97ae-b21b-49d7-aedb-3004cdee0c21', 'Folder 2', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63.80ba97ae_b21b_49d7_aedb_3004cdee0c21');
insert into folders (id, title, path) values ('a7abb686-2a69-44d1-b2cd-60825478c4c8', 'Folder 3', '9ca5ed28_83c3_45c3_98a1_31ebf0466e63.80ba97ae_b21b_49d7_aedb_3004cdee0c21.a7abb686_2a69_44d1_b2cd_60825478c4c8');
insert into folder_visibility (id, profile_id, folder_id) values (gen_random_uuid(), 'e63533e4-eb10-48f6-ae69-ddf5f9242510', '80ba97ae-b21b-49d7-aedb-3004cdee0c21');
insert into folder_visibility (id, profile_id, folder_id) values (gen_random_uuid(), '7cbb5f24-61d0-430e-ab10-225554b0725b', '9ca5ed28-83c3-45c3-98a1-31ebf0466e63');
insert into profiles values ('7cbb5f24-61d0-430e-ab10-225554b0725b', '[email protected]');
insert into profiles values ('e63533e4-eb10-48f6-ae69-ddf5f9242510', '[email protected]');
create index on folder_visibility (profile_id);
create index on folder_visibility (folder_id);
create index on folders (folder_id);
create index on folders (id);
create index on profiles (id);
I fetch all folders for a given profile, this is relatively easy:
select f2.id folder_id, p.id profile_id, p.email from folders f1 join folders f2 on f2.path <@ f1.path join folder_visibility fv1 on fv1.folder_id = f1.id join profiles p on p.id = fv1.profile_id where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b'; folder_id | profile_id | email --------------------------------------+--------------------------------------+------------------ 9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] 80ba97ae-b21b-49d7-aedb-3004cdee0c21 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] a7abb686-2a69-44d1-b2cd-60825478c4c8 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] (3 rows)
From here, I complete rule 1 from above like this:
select f2.id folder_id, p.id profile_id, p.email from folders f1 join folders f2 on f2.path <@ f1.path join folder_visibility fv1 on fv1.folder_id = f1.id join folder_visibility fv2 on (fv2.folder_id = f1.id or fv2.folder_id = f2.id) join profiles p on p.id = fv2.profile_id where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b'; folder_id | profile_id | email --------------------------------------+--------------------------------------+------------------ 9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] 80ba97ae-b21b-49d7-aedb-3004cdee0c21 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] a7abb686-2a69-44d1-b2cd-60825478c4c8 | 7cbb5f24-61d0-430e-ab10-225554b0725b | [email protected] 80ba97ae-b21b-49d7-aedb-3004cdee0c21 | e63533e4-eb10-48f6-ae69-ddf5f9242510 | [email protected] (4 rows)
To implement rule 2, I'm naively adding yet another set of joins and finally get a list of necessary rows:
select distinct on (f3.id, p.email) f3.id folder_id, p.email, fv3.profile_id from folders f1 join folders f2 on f2.path operator(public.<@) f1.path join folder_visibility fv1 on fv1.folder_id = f1.id join folder_visibility fv2 on (fv2.folder_id = f1.id or fv2.folder_id = f2.id) join folders f3 on f3.path operator(public.<@) f2.path join folder_visibility fv3 on (fv3.folder_id = f2.id or fv3.folder_id = f3.id) join profiles p on p.id = fv3.profile_id where fv1.profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b'; folder_id | email | profile_id --------------------------------------+------------------+-------------------------------------- 80ba97ae-b21b-49d7-aedb-3004cdee0c21 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b 80ba97ae-b21b-49d7-aedb-3004cdee0c21 | [email protected] | e63533e4-eb10-48f6-ae69-ddf5f9242510 9ca5ed28-83c3-45c3-98a1-31ebf0466e63 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b a7abb686-2a69-44d1-b2cd-60825478c4c8 | [email protected] | 7cbb5f24-61d0-430e-ab10-225554b0725b a7abb686-2a69-44d1-b2cd-60825478c4c8 | [email protected] | e63533e4-eb10-48f6-ae69-ddf5f9242510 (5 rows)
Problem with the last step in my example is that the execution plan gets quite enormous:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=7.38..7.39 rows=1 width=64) (actual time=0.103..0.111 rows=5 loops=1)
-> Sort (cost=7.38..7.39 rows=1 width=64) (actual time=0.103..0.106 rows=8 loops=1)
Sort Key: f3.id, p.email
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=0.00..7.37 rows=1 width=64) (actual time=0.031..0.087 rows=8 loops=1)
Join Filter: (fv3.profile_id = p.id)
Rows Removed by Join Filter: 8
-> Nested Loop (cost=0.00..6.33 rows=1 width=32) (actual time=0.028..0.066 rows=8 loops=1)
Join Filter: ((fv3.folder_id = f2.id) OR (fv3.folder_id = f3.id))
Rows Removed by Join Filter: 8
-> Nested Loop (cost=0.00..5.28 rows=1 width=32) (actual time=0.025..0.046 rows=8 loops=1)
Join Filter: (f3.path OPERATOR(public.<@) f2.path)
Rows Removed by Join Filter: 4
-> Nested Loop (cost=0.00..4.21 rows=1 width=48) (actual time=0.022..0.032 rows=4 loops=1)
Join Filter: ((fv2.folder_id = f1.id) OR (fv2.folder_id = f2.id))
Rows Removed by Join Filter: 2
-> Nested Loop (cost=0.00..3.16 rows=1 width=64) (actual time=0.018..0.022 rows=3 loops=1)
Join Filter: (f2.path OPERATOR(public.<@) f1.path)
-> Nested Loop (cost=0.00..2.09 rows=1 width=48) (actual time=0.014..0.016 rows=1 loops=1)
Join Filter: (f1.id = fv1.folder_id)
Rows Removed by Join Filter: 2
-> Seq Scan on folder_visibility fv1 (cost=0.00..1.02 rows=1 width=16) (actual time=0.009..0.010 rows=1 loops=1)
Filter: (profile_id = '7cbb5f24-61d0-430e-ab10-225554b0725b'::uuid)
Rows Removed by Filter: 1
-> Seq Scan on folders f1 (cost=0.00..1.03 rows=3 width=48) (actual time=0.002..0.003 rows=3 loops=1)
-> Seq Scan on folders f2 (cost=0.00..1.03 rows=3 width=48) (actual time=0.002..0.002 rows=3 loops=1)
-> Seq Scan on folder_visibility fv2 (cost=0.00..1.02 rows=2 width=16) (actual time=0.001..0.001 rows=2 loops=3)
-> Seq Scan on folders f3 (cost=0.00..1.03 rows=3 width=48) (actual time=0.001..0.001 rows=3 loops=4)
-> Seq Scan on folder_visibility fv3 (cost=0.00..1.02 rows=2 width=32) (actual time=0.001..0.001 rows=2 loops=8)
-> Seq Scan on profiles p (cost=0.00..1.02 rows=2 width=48) (actual time=0.001..0.001 rows=2 loops=8)
Planning Time: 0.588 ms
Execution Time: 0.151 ms
(32 rows)
On large data set (thousands of folders for a given profile) this query gets executed several seconds. What would be a more efficient way to grab all folders, together with all "visibilities" for each folder according to the two rules above?