I am trying to fix a performance problem that I am having with a query in PostgreSQL.
General concept being modeled: Software licences being purchased and assigned. I think I have removed enough of the other things being modeled that it is now very similar to a standard hotel room booking system, except that it is normal for a current assignment (hotel booking) to have no known end date.
Query's purpose: This is a view that assembles the information necessary to display information about a licence, and the purchase it came from. When the application queries the view, it provides a tLicence.id
, so that one row is returned.
Non-hotel-like concepts remaining in the query:
- Some licencing agreements limit how quickly software can be re-assigned; this has been hard-coded into the query as 1 day.
- It is theoretically possible for a licence to have concurrent assignments, both past and present; this should not happen, and the application discourages it, but if humans made an error in the real world, the application does allow that error to be entered into the system. This is clearly different from a normal hotel system, where if guests walked into the wrong room, the current occupants would object.
The nested SELECT
with the alias purchase_quantities_assignnments
is a view in the database (inlined here for convenience). Ideally, I would like any fix to my performance issue to not require that a modified version of this view be inlined into the query; ideally, the view can continue to exist mostly as-is, and be used in other ways in other queries.
Problem
If I query this view (the query) with WHERE tLicences.id = 19
, the result takes a long time to generate. In particular, it seems to be generating the entire set for periodsOfAvailability_start
(which is slow), and then joining; this conclusion is based on the EXPLAIN ANALYZE GroupAggregate
returning 10 rows (which is the number of purchases). I feel like query planner should be able to figure out that the tAssignments.purchase_id
could be used to significantly reduce how much of periodsOfAvailability_start
needs to be generated.
However, if I query this view (the query) with WHERE tLicences.id = 19 AND tLicences.purchase_id = ?
[?
being that licence's purchase ID], then the query runs as expected, generating only the set for periodsOfAvailability_start
that has that purchase ID; this conclusion is based on the EXPLAIN ANALYZE GroupAggregate
returning 1 row (which is the number of purchases to which the licence belongs).
Query
SELECT *
FROM test.tPurchases AS tPurchases
INNER JOIN test.tLicences
ON tLicences.purchase_id = tPurchases.id
LEFT JOIN (
SELECT
purchase_id,
SUM(
CASE
assignment_newer_id IS NOT null
WHEN true THEN 1
WHEN false THEN 0
END
) AS prchs_quantity_assigned,
SUM(
CASE
assignment_newer_id IS null AND
current_timestamp
BETWEEN licence_availability_start AND
licence_availability_end
WHEN true THEN 1
WHEN false THEN 0
END
) AS prchs_quantity_notAssignedAndCanBeAssigned,
SUM(
CASE
assignment_newer_id IS null AND
current_timestamp < licence_availability_start
WHEN true THEN 1
WHEN false THEN 0
END
) AS prchs_quantity_notAssignedAndCannotBeAssigned
FROM (
SELECT
tPurchases.id AS purchase_id,
tPurchases.date_ AS purchase_date,
tLicences.id AS licence_id,
GREATEST(
tPurchases.date_,
older.end_,
older.start + '1 day'::interval
) AS licence_availability_start,
CASE
WHEN newer.id IS NULL THEN 'infinity'
ELSE newer.start - '1 day'::interval
END AS licence_availability_end,
COALESCE(newer.start, 'infinity') AS licence_availability_uninstallBy,
older.id AS assignment_older_id,
older.start AS assignment_older_start,
older.end_ AS assignment_older_end,
newer.id AS assignment_newer_id,
newer.start AS assignment_newer_start,
newer.end_ AS assignment_newer_end
FROM test.tLicences
INNER JOIN test.tPurchases
ON tPurchases.id = tLicences.purchase_id
LEFT JOIN test.tAssignments AS older
ON (
NOT older.deleted AND
older.licence_id = tLicences.id
)
LEFT JOIN test.tAssignments AS newer
ON (
NOT newer.deleted AND
newer.id <> older.id AND
newer.licence_id = older.licence_id
)
WHERE
NOT tLicences.deleted
UNION
SELECT
tPurchases.id AS purchase_id,
tPurchases.date_ AS purchase_date,
tLicences.id AS licence_id,
tPurchases.date_ AS licence_availability_start,
oldest.start - '1 day'::interval AS licence_availability_end,
oldest.start AS licence_availability_uninstallBy,
null AS assignment_older_id,
null AS assignment_older_start,
null AS assignment_older_end,
oldest.id AS assignment_newer_id,
oldest.start AS assignment_newer_start,
oldest.end_ AS assignment_newer_end
FROM test.tLicences
INNER JOIN test.tPurchases
ON tPurchases.id = tLicences.purchase_id
INNER JOIN test.tAssignments AS oldest
ON oldest.licence_id = tLicences.id
WHERE
NOT tLicences.deleted AND
NOT oldest.deleted
) AS periodsOfAvailability_start
WHERE
(assignment_newer_id IS null OR assignment_newer_end IS null)
GROUP BY purchase_id
) AS purchase_quantities_assignnments
ON
purchase_quantities_assignnments.purchase_id = tPurchases.id
WHERE
tLicences.id = 19 /* [Unexpected behaviour] The full set for "purchase_quantities_assignnments" is generated */
--tLicences.id = 19 AND tLicences.purchase_id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated */
--tLicences.id = 19 AND tPurchases.id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated */
--tLicences.purchase_id = ? /* [Expected behaviour] Only the single relevant row for "purchase_quantities_assignnments" appears to be generated. Note: This is a different query *result* than the others */
Question: Is there some way that I can resolve this issue, without having to provide tLicences.purchase_id
?
Database version: PostgreSQL 9.0
SQL to generate a schema, tables, and populate those tables:
This is kind of long-running, since I wanted a quantity similar to our actual data. The number of licences (30000) and the number of assignments (100000) can be reduced if the run-time is a problem.
CREATE SCHEMA test;
CREATE TABLE test.tPurchases (
id serial not null,
date_ date not null,
/* … */
deleted boolean not null DEFAULT false,
PRIMARY KEY (id)
);
CREATE TABLE test.tLicences (
id serial not null,
purchase_id integer not null,
/* … */
deleted boolean not null DEFAULT false,
PRIMARY KEY (id),
FOREIGN KEY (purchase_id)
REFERENCES test.tPurchases (id)
ON UPDATE RESTRICT
ON DELETE RESTRICT
);
CREATE INDEX ON test.tLicences(purchase_id);
CREATE TABLE test.tAssignments (
id serial not null,
licence_id integer not null,
start date not null,
end_ date,
/* … */
deleted boolean not null DEFAULT false,
PRIMARY KEY (id),
FOREIGN KEY (licence_id)
REFERENCES test.tLicences (id)
ON UPDATE RESTRICT
ON DELETE RESTRICT,
CHECK (start <= end_)
);
CREATE INDEX ON test.tAssignments(licence_id);
INSERT INTO test.tPurchases(id, date_)
SELECT
id,
'2000-01-01'::timestamp + random() * '1 year'::interval AS date_
FROM generate_series(1, 10) AS id
;
INSERT INTO test.tLicences(id, purchase_id, deleted)
SELECT
id,
trunc(random() * 10 + 1) AS purchase_id,
(random() > 0.99) AS deleted
FROM generate_series(1, 30000) AS id
;
INSERT INTO test.tAssignments(id, licence_id, start, end_, deleted)
SELECT
assignments.id,
assignments.licence_id,
tPurchases.date_ + ((rank * 20 + random() * 10) || ' days')::interval AS start,
CASE
assignments.rank = max(assignments.rank) OVER (PARTITION BY assignments.licence_id) AND
random() > 0.5
WHEN true THEN null
ELSE tPurchases.date_ + ((rank * 20 + 10 + random() * 10) || ' days')::interval
END AS end_,
(random() > 0.95) AS deleted
FROM (
SELECT
assignments.id,
assignments.licence_id,
rank() OVER (PARTITION BY assignments.licence_id ORDER BY assignments.id) AS rank
FROM (
SELECT
id,
trunc(random() * 30000 + 1) AS licence_id
FROM generate_series(1, 100000) AS id
) AS assignments
) AS assignments
INNER JOIN test.tLicences
ON tLicences.id = assignments.licence_id
INNER JOIN test.tPurchases
ON tPurchases.id = tLicences.purchase_id
;
You might need to run statistics but often you can force the optimization you want with a CTE. Here I also take out all your sub-queries to CTEs just to make it clear: