Postgresql - Optimize ordering of columns in table range partitioning with multiple columns range

384 Views Asked by At

I am testing with creating a data warehouse for a relatively big dataset. Based on ~10% sample of the data I decided to partition some tables that are expected to exceed memory which currently 16GB Based on the recommendation in postgreSQL docs: (edited)

These benefits will normally be worthwhile only when a table would otherwise be very large. The exact point at which a table will benefit from partitioning depends on the application, although a rule of thumb is that the size of the table should exceed the physical memory of the database server.

One particular table I am not sure how to partition, this table is frequently queried in 2 different ways, with WHERE clause that may include primary key OR another indexed column, so figured I need a range partition using the existing primary key and the other column (with the other column added to the primary key).

Knowing that the order of columns matters, and given the below information my question is:
What is the best order for primary key and range partitioning columns?

Original table:

CREATE TABLE items (
    item_id BIGSERIAL NOT NULL, -- primary key
    src_doc_id bigint NOT NULL, -- every item can exist in one src_doc only and a src_doc can have multiple items 
    item_name character varying(50) NOT NULL, -- used in `WHERE` clause with src_doc_id and guaranteed to be unique from source
    attr_1 bool,
    attr_2 bool, -- +15 other columns all bool or integer types
    PRIMARY KEY (item_id)
);
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);

Table size for 10% of the dataset is ~2GB (result of pg_total_relation_size) with 3M+ rows, loading or querying performance is excellent, but thinking that this table is expected to grow to 30M rows and size 20GB I do not know what to expect in terms of performance.

Partitioned table being considered:

CREATE TABLE items (
    item_id BIGSERIAL NOT NULL,
    src_doc_id bigint NOT NULL,
    item_name character varying(50) NOT NULL,
    attr_1 bool,
    attr_2 bool,
    PRIMARY KEY (item_id, src_doc_id) -- should the order be reversed?
) PARTITION BY RANGE (item_id, src_doc_id); -- should the order be reversed?
CREATE INDEX index_items_src_doc_id ON items USING btree (src_doc_id);
CREATE INDEX index_items_item_name ON items USING hash (item_name);

-- Ranges are not initially known, so maxvalue is used as upper bound,
-- when upper bound is known, partition is detached and reattached using
-- known known upper bound and a new partition is added for the next range
CREATE TABLE items_00 PARTITION OF items FOR VALUES FROM (MINVALUE, MINVALUE) TO (MAXVALUE, MAXVALUE);

Table usage

On loading data, the load process (python script) looks up existing items based on src_doc_id and item_name and stores item_id, so it does not reinsert existing items. Item_id gets referenced in lot of other tables, no foreign keys are used.

On querying for analytics item information is always looked up based on item_id.

So I can't decide the suitable order for the table PRIMARY KEY and PARTITION BY RANGE,
Should it be (item_id, src_doc_id) or (src_doc_id, item_id)?

0

There are 0 best solutions below