SQL query for run-length, or consecutive identical value encoding

1.8k Views Asked by At

My goal is to take a set of data that is ordered by id and return a resultset that indicates the number of consecutive rows where the val column is identical. E.g. given this data:

| id | val |
|  1 |  33 |
|  2 |  33 |
|  3 |  44 |
|  4 |  28 |
|  5 |  44 |
|  6 |  44 |

I would like to see this result:

| id | val | run_length |
| 1  | 33  | 2          |
| 3  | 44  | 1          |
| 4  | 28  | 1          |
| 5  | 44  | 2          |

The id column in the resultset is optional. In fact, if it makes it significantly harder, then just leave that column out of the result. I sort of like having it because it "pins" the resultset to a particular location in the table.

I am primarily interested in the result in free database engines. My order of preference for a solution is:

  1. SQLite
  2. Postgres
  3. MySQL
  4. Oracle
  5. SQL Server
  6. Sybase
2

There are 2 best solutions below

1
On

I'll choose #2 on your list, because this is incredibly painful to do in SQLite with a single query. The following is standard SQL:

select min(id), val, count(*) as runlength
from (select t.*,
             (row_number() over (order by id) -
              row_number() over (partition by val order by id)
             ) as grp
      from data t
     ) t
group by grp, val;

This uses the difference of two row number calculations to identify the seuqnces of identical values. It should work in the recent versions of databases 2, 4, 5, and 6.

0
On

I've been wandering around in RLE space in SQLITE and ran across this post. I believe this code works for #1. The first answer is correct this is a bit painful in SQLite as a single query.

create table example (id integer primary key autoincrement, val integer);

insert into example (val) values (33);
insert into example (val) values (33);
insert into example (val) values (44);
insert into example (val) values (28);
insert into example (val) values (44);
insert into example (val) values (44);


select ren.low_id, e2.val, (ren.high_id - ren.low_id)+1
from example e2
inner join (
select min(hb.low_id) as low_id, hb.high_id as high_id
from 
(
    with nexample(low_id, high_id, val) 
    as 
    (
    select e.id, e.id, e.val from example e
    union all
    select ne.low_id, eu.id, ne.val 
    from nexample ne
    inner join example eu on eu.id = ne.high_id+1 AND eu.val=ne.val
    )
    select ne.low_id, max(ne.high_id) as high_id from nexample ne
    group by ne.low_id
) hb
group by hb.high_id
) ren on ren.low_id = e2.id;

Output:

1|33|2
3|44|1
4|28|1
5|44|2

Note this solution doesn't perform well on very sparse sets... I'm looking for an alternative approach to dealing with sparse sets.

For example on a set of 10000 rows with a val set of [0,1] but all values are 0. This code takes ~2 minutes 30sec to run on my hardware. Not great.