Is there a way to delete / update duplicate values in a time sequence?

98 Views Asked by At

I have a table that looks like (never mind the database vendor as I need this in ANSI SQL):

create table edge (
   parent_id int not null,
   child_id int not null,
   value float not null,
   start date not null,
   end date not null
)

I would then have the following CSV input data:

1,2,0,2023-01-01,2023-01-10
1,2,0,2023-01-11,2023-01-20
1,2,0,2023-01-21,NULL
1,3,0,2023-01-01,2023-01-10
1,3,0,2023-01-11,2023-01-20
1,3,1,2023-01-21,NULL

Note the case where the parent is 1 and the child is 2, the same value 0 repeats across multiple date intervals, so it could be collapsed to:

1,2,0,2023-01-01,NULL
1,3,0,2023-01-01,2023-01-20
1,3,1,2023-01-21,NULL

a value of NULL in end means valid until further notice.

Ideally there would be a query that would for table edge (yes graph edges) generate the output of the second excerpt. The goal is to first delete redundant rows and update the remaining last row with start to match the start of the first deleted row.

The closest I can get is the following, here I am finding all the redundant rows to delete:

SELECT *
FROM edge
WHERE (parent, child, value, start) IN (SELECT parent, child, value, end+1
                                        FROM edge)
ORDER BY parent, child, start

I would then need to do the update step to leave the time series consistent. On a separate note, value is being exact matched and it is a float but this works on PostgreSQL at least.

If there is no way using ANSI I would be interested to learn for the PostgreSQL case.

4

There are 4 best solutions below

1
On

If the date ranges are continuous I think the easiest (but a bit dirty) approach would be as follows:

  1. Rename table and set end to a unique date in the future where it is null.

  2. Put values in new edge table:

     SELECT parent_id, child_id, value, MIN(start), MAX(end)
     FROM edge_old
     GROUP BY parent_id, child_id, value
    
  3. Set end to null where it matches the unique date you set before and delete old table.

1
On

I would first suggest a query that selects the desired rows as per the business rules above that may be used (as a view?) without deleting. I am using the non-ANSI distinct on PostgreSQL's clause.

select distinct on (parent_id, child_id, value) 
 parent_id, child_id, value, 
 first_value(start_date) over (partition by parent_id, child_id, value order by start_date),
 end_date
from edge;
order by parent_id, child_id, value, start_date desc;

If this is not acceptable then shape the query as a CTE and then delete these rows that are not in it.

Here is a SQL standard-compliant version w/o distinct on.

select parent_id, child_id, value, sd start_date, end_date
from
(
 select
   row_number() over (partition by parent_id, child_id, value order by start_date desc) rn, 
   parent_id, child_id, value, 
   first_value(start_date) over (partition by parent_id, child_id, value order by start_date) sd,
   end_date
 from edge
) t
where rn = 1;

SQL Fiddle demo

0
On

Hello another way to do this in postgresql if you have continuous date ranges with one select is this:

SELECT 
    parent_id, 
    child_id, 
    "value", 
    min(start) AS start,
    CASE 
        WHEN min(coalesce(end, '1900-01-01'::date)) = '1900-01-01'::date THEN null 
        ELSE max(end) 
    END AS end
FROM 
    edge
GROUP BY  
    parent_id, 
    child_id, 
    "value";

Hope it helps.

0
On

Here's the PostgreSQL version to show off range types, and also handles non-contiguous ranges.

I've made some changes...

  • start/end is now a single daterange. We also avoid the keyword end.
  • Using infinity to represent an unbounded end time; this avoids having to deal with nulls.
  • I've added a non-contiguous edge to the data. The result should be 4 edges.
  • dateranges can be indexed using gist so range operations are efficient.
create table edge (
   parent_id int not null,
   child_id int not null,
   value float not null,
   span daterange not null
);

create index edge_span_idx on edge using gist(span);

-- daterange(start, end, '[]') says the range is to include the end date.
insert into edge(parent_id, child_id, value, span) values
  -- Edge 1
  (1,2,0, daterange('2023-01-01','2023-01-10', '[]')),
  (1,2,0, daterange('2023-01-11','2023-01-20', '[]')),
  (1,2,0, daterange('2023-01-21','infinity', '[]')),

  -- Edge 2
  (1,2,0, daterange('2022-01-01', '2022-02-01', '[]')),

  -- Edge 3
  (1,3,0, daterange('2023-01-01','2023-01-10', '[]')),
  (1,3,0, daterange('2023-01-11','2023-01-20', '[]')),

  -- Edge 4
  (1,3,1, daterange('2023-01-21','infinity', '[]'))

Now we can use the group by technique from other answers, but it's simpler because we can use range_agg to aggregate all the date ranges, there's no fussing about with nulls, and it handles non-overlapping ranges.

select parent_id, child_id, value, range_agg(span)
from edge
group by parent_id, child_id, value;

parent_id   child_id    value   range_agg
1           3           0       {[2023-01-01,2023-01-21)}
1           2           0       {[2022-01-01,2022-02-02),[2023-01-01,infinity]}
1           3           1       {[2023-01-21,infinity]}

We're left with an array of dateranges for each parent/child/value combo. Each daterange is a compacted edge.

To get them as individual rows, unnest the arrays into individual columns.

select parent_id, child_id, value, unnest(range_agg(span)) as span
from edge
group by parent_id, child_id, value

parent_id   child_id    value   unnest
1           3           0       [2023-01-01,2023-01-21)
1           2           0       [2022-01-01,2022-02-02)
1           2           0       [2023-01-01,infinity]
1           3           1       [2023-01-21,infinity]

Demonstration.