PostgreSQL: Speed Up Updating a Column using Aggregate Value from Another Column

40 Views Asked by At

I work with PostGIS and the data is quite large, around 1.5 billion rows, I have a grid table along with its administrative areas, I need to get a value which is the number of residents in a particular administrative area to carry out further calculations. The steps I took were to create a sum_residents column and fill it with the aggregate number of residents grouped by administrative area name (adm_col1, adm_col2, adm_col3)

Query I ran:

UPDATE table_grid AS gd
SET sum_resident_grid = subquery.total_sum
FROM (
     SELECT adm_col1, adm_col2, adm_col3, SUM(resident_grid) AS total_sum
     FROM table_grid
     GROUP BY adm_col1, adm_col2, adm_col3
) AS subquery
WHERE gd.adm_col1 = subquery.adm_col1
AND gd.adm_col2 = subquery.adm_col2
AND gd.adm_col3 = subquery.adm_col3;

Initially for relatively small data, the query was no problem, but when the data was very large, the query took a very long time, it could take more than a day. Is there a faster way to get the sum_resident_grid value?

Stage 2: After getting the sum_resident_grid, I actually use this value to carry out weighting calculations, where the formula is resident_grid/sum_resident_grid*actual_resident_value obtained by joining the administrative table, the query I ran:

UPDATE table_grid AS gd
SET
  sum_income = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)::float/de.actual_resident_value*de.total_inco ,
  sum_expend = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)::float/de.actual_resident_value*de.total_expe ,
  resident_grid = ROUND(gd.resident_grid::float/gd.sum_resident_grid*de.actual_resident_value)
FROM table_administrative de
WHERE gd.sum_resident_grid !=0
AND gd.adm_code1 = de.adm_code1;

so is there any other way that can speed up the query that I am running? Thank in advance for your help!

1

There are 1 best solutions below

4
Tim Biegeleisen On

This answer is mainly aiming to provide a general approach which can alleviate the need to do the costly full table aggregation/summation. Instead of doing your current query, you may consider keeping a running summation. Each time a new record is inserted into the table_grid table, you can update the running total. In Postgres, you would accomplish this via a trigger:

-- assumes a "table_grid_total" table already exists

CREATE OR REPLACE FUNCTION table_grid_insert_trigger_fnc()
RETURNS trigger AS
$$

BEGIN

UPDATE table_grid_total
SET total = total + NEW.resident_grid
WHERE adm_col1 = NEW.adm_col1 AND
      adm_col2 = NEW.adm_col2 AND
      adm_col3 = NEW.adm_col3;

RETURN NEW;
END;

$$

LANGUAGE 'plpgsql';

CREATE TRIGGER table_grid_insert_trigger

AFTER INSERT ON table_grid

FOR EACH ROW

EXECUTE PROCEDURE table_grid_insert_trigger_fnc();

In other words, for every insert to the table_grid table, we keep a running total of the resident_grid for each (adm_col1, adm_col2, adm_col3) tuple. To access these sums, you may simply query the table_grid_total table at any time.