Postgresql generate date series (performance)

5.5k Views Asked by At

Using postgresql version > 10, I have come to an issue when generating date series using the built-in generate_series function. In essence, it does not accord for the day of the month correctly.

I have many different frequencies (provided by the user) that need to be calculated between a given start and end date. The start date can be any date and thus any day of the month. This generates issues when having frequencies such as monthly combined with a start date of 2018-01-31 or 2018-01-30 as shown in the output below.

I created a solution and wanted to post this here for others to use as I could not find any other solution.

However, after some tests I have seen that my solution has a different performance compared to the built-in generate_series when used on (absurdly) large date ranges. Does anyone have an insight as to how this can be improved?

TL;DR: if possible avoid loops as they are a performance hit, scroll to bottom for improved implementation.

Built-in Output

select generate_series(date '2018-01-31', 
                       date '2018-05-31', 
                       interval '1 month')::date
as frequency;

generates:

 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-28
 2018-04-28
 2018-05-28

As can be seen from the output, the day of the month is not respected and truncated to the minimum day encountered along the way, in this case: 28 due to the month of februari.

Expected Output

As a result of this issue I created a custom function:

create or replace function generate_date_series(
  starts_on date, 
  ends_on date, 
  frequency interval)
returns setof date as $$
declare
  interval_on date := starts_on;
  count int := 1;
begin
  while interval_on <= ends_on loop
    return next interval_on;
    interval_on := starts_on + (count * frequency);
    count := count + 1;
  end loop;
  return;
end;
$$ language plpgsql immutable;

select generate_date_series(date '2018-01-31', 
                            date '2018-05-31', 
                            interval '1 month')
as frequency;

generates:

 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-31
 2018-04-30
 2018-05-31

Performance comparison

No matter what date range is provided, the built-in generate_series has a performance of 2ms on average for:

select generate_series(date '1900-01-01', 
                       date '10000-5-31', 
                       interval '1 month')::date 
as frequency;

while the custom function generate_date_series has a performance of 120ms on average for:

select generate_date_series(date '1900-01-01', 
                            date '10000-5-31', 
                            interval '1 month')::date 
as frequency;

Question

In reality, such ranges will never occur and thus it is a non-issue. For most queries the custom generate_date_series will attain the same performance. Although, I do wonder what causes the difference.

Is there a reason why the built-in function is able to attain a constant performance of 2ms on average no matter what range is provided?

Is there a better way to implement generate_date_series that performs as well as the built-in generate_series?

Improved implementation without loops

(derived from the answer of @eurotrash)

create or replace function generate_date_series(
  starts_on date, 
  ends_on date, 
  frequency interval)
returns setof date as $$
select (starts_on + (frequency * count))::date
from (
  select (row_number() over ()) - 1 as count
  from generate_series(starts_on, ends_on, frequency)
) series
$$ language sql immutable;

with the improved implementation, the generate_date_series function has a performance of 45ms on average for:

select generate_date_series(date '1900-01-01', 
                            date '10000-5-31', 
                            interval '1 month')::date 
as frequency;

The implementation provided by @eurotrash gives me 80ms on average, which I assume is due to calling the generate_series function twice.

4

There are 4 best solutions below

2
On BEST ANSWER

Why your function is slow: you use variables and (more importantly) a loop. Loops are slow. Variables also mean reads from and writes to those variables.

CREATE OR REPLACE FUNCTION generate_date_series_2(starts_on DATE, ends_on DATE, frequency INTERVAL)
        RETURNS SETOF DATE AS
$BODY$
        SELECT (starts_on + (frequency * g))::DATE
        FROM generate_series(0, (SELECT COUNT(*)::INTEGER - 1 FROM generate_series(starts_on, ends_on, frequency))) g;
$BODY$
        LANGUAGE SQL IMMUTABLE;

The concept is basically the same as your plpgsql function but via a single query instead of a loop. The only problem is deciding how many iterations are needed (i.e. the second parameter to generate_series). Sadly I couldn't think of a better way to get the number of intervals required other than calling generate_series for the dates and using the count of that. Of course if you know your intervals will only ever be certain values then it may be possible to optimise; however this version handles any interval values.

On my system it's about 50% slower than a pure generate_series, and about 400% faster than your plpgsql version.

2
On

You may use date_trunc and add a month to the output ofgenerate_series, the performance should be almost similar.

SELECT 
  (date_trunc('month', dt) + INTERVAL '1 MONTH - 1 day') ::DATE AS frequency 
FROM 
  generate_series(
    DATE '2018-01-31', DATE '2018-05-31', 
    interval '1 MONTH'
  ) AS dt 

Demo

Test

knayak=# select generate_series(date '2018-01-31',
knayak(#                        date '2018-05-31',
knayak(#                        interval '1 month')::date
knayak-# as frequency;
 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-28
 2018-04-28
 2018-05-28
(5 rows)

Time: 0.303 ms
knayak=#
knayak=#
knayak=# SELECT
knayak-#   (date_trunc('month', dt) + INTERVAL '1 MONTH - 1 day' ):: DATE AS frequency
knayak-# FROM
knayak-#   generate_series(
knayak(#     DATE '2018-01-31', DATE '2018-05-31',
knayak(#     interval '1 MONTH'
knayak(#   ) AS dt
knayak-# ;
 frequency
------------
 2018-01-31
 2018-02-28
 2018-03-31
 2018-04-30
 2018-05-31
(5 rows)

Time: 0.425 ms
2
On

REVISED SOLUTION

This gives me 97,212 rows in under 7 seconds (approx 0.7ms per row) and also supports leap-years where February have 29 days:

SELECT      t.day_of_month
FROM        (
                SELECT  ds.day_of_month
                        , date_part('day', ds.day_of_month) AS day
                        , date_part('day', ((day_of_month - date_part('day', ds.day_of_month)::INT + 1) + INTERVAL '1' MONTH) - INTERVAL '1' DAY) AS eom
                FROM    (
                            SELECT generate_series( date '1900-01-01', 
                                                    date '10000-12-31', 
                                                    INTERVAL '1 day')::DATE as day_of_month
                        ) AS ds
            ) AS t
            --> REMEMBER to change the day at both places below (eg. 31)
WHERE       t.day = 31 OR (t.day = t.eom AND t.day < 31)

Resulting output: Please ensure you change the day on BOTH the RED numbers. Performance Output

The output data:

Data Output

0
On

Simple solution:

SELECT '2000-01-31'::DATE + ('1 MONTH'::INTERVAL)*x FROM generate_series(0,100) x;

Drawback:

Since generate_series() parameters are integers, you need to calculate them.

Massive advantage:

generate_series() gives correct row count estimates to the optimizer when its parameters are integers, but it isn't smart enough to do so when its parameters are dates and intervals:

This is very important especially if you use it to build a huge series. Using date parameters will always return default 1000 row estimates, which can cause the optimizer to go for a catastrophic plan.

CREATE UNLOGGED TABLE foo( id SERIAL PRIMARY KEY, dt TIMESTAMP NOT NULL );
INSERT INTO foo (dt) SELECT '2000-01-01'::TIMESTAMP + ('1 SECOND'::INTERVAL)*x FROM generate_series(1,1000000) x;
CREATE INDEX foo_dt ON foo(dt);
VACUUM ANALYZE foo;

EXPLAIN ANALYZE
WITH d AS (SELECT '2000-01-01'::TIMESTAMP + ('10 SECOND'::INTERVAL)*x dt FROM generate_series(1,100000) x)
SELECT * FROM foo JOIN d USING (dt);
 Hash Join  (cost=27906.00..30656.00 rows=100000 width=12) (actual time=191.020..237.268 rows=100000 loops=1)
   Hash Cond: (('2000-01-01 00:00:00'::timestamp without time zone + ('00:00:10'::interval * (x.x)::double precision)) = foo.dt)
   ->  Function Scan on generate_series x  (cost=0.00..1000.00 rows=100000 width=4) (actual time=7.070..11.096 rows=100000 loops=1)
     CORRECT ESTIMATE -------------------------------------------------^
   ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=12) (actual time=181.844..181.845 rows=1000000 loops=1)
         Buckets: 1048576  Batches: 1  Memory Usage: 51161kB
         ->  Seq Scan on foo  (cost=0.00..15406.00 rows=1000000 width=12) (actual time=0.009..64.702 rows=1000000 loops=1)

EXPLAIN ANALYZE
WITH d AS (SELECT generate_series('2000-01-01'::TIMESTAMP, '2000-01-12 13:46:40'::TIMESTAMP, '10 SECOND'::INTERVAL) dt)
SELECT * FROM foo JOIN d USING (dt);
 Nested Loop  (cost=0.42..7515.52 rows=1000 width=12) (actual time=0.050..139.251 rows=100000 loops=1)
   ->  ProjectSet  (cost=0.00..5.02 rows=1000 width=8) (actual time=0.006..5.493 rows=100001 loops=1)
     WRONG ESTIMATE ----------------------^
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
   ->  Index Scan using foo_dt on foo  (cost=0.42..7.49 rows=1 width=12) (actual time=0.001..0.001 rows=1 loops=100001)
         Index Cond: (dt = (generate_series('2000-01-01 00:00:00'::timestamp without time zone, '2000-01-12 13:46:40'::timestamp without time zone, '00:00:10'::interval)))

With the correct estimate, it uses a hash which is the right thing to do here. With the wrong, much too low estimate, it uses a nested loop index scan instead. If the stars align just right, that's one random IO per page.