HIVE - increment value on column change

230 Views Asked by At

I'm just basically trying to add a column with a unique identifier for a journey. I have a table that looks similar to this:

Time                id  station     newtrip
2017-11-15 16:45    100 St.George   TRUE
2017-11-15 16:46    100 Bloor       FALSE
2017-11-15 16:47    110 Wellesley   TRUE
2017-11-15 16:48    110 Wellesley   FALSE
2017-11-15 16:49    200 Dundas      TRUE
2017-11-15 16:55    200 College     FALSE
2017-11-15 16:56    200 Union       FALSE
2017-11-15 17:51    200 Union       TRUE
2017-11-15 17:52    200 St.Andrew   FALSE

All I am trying to do is increment a counter every time that last column shows true. So the result should look like:

Time                id  station     newtrip journeyID
2017-11-15 16:45    100 St.George   TRUE    1
2017-11-15 16:46    100 Bloor       FALSE   1
2017-11-15 16:47    110 Wellesley   TRUE    2
2017-11-15 16:48    110 Wellesley   FALSE   2
2017-11-15 16:49    200 Dundas      TRUE    3
2017-11-15 16:55    200 College     FALSE   3
2017-11-15 16:56    200 Union       FALSE   3
2017-11-15 17:51    200 Union       TRUE    4
2017-11-15 17:52    200 St.Andrew   FALSE   4

Couple things to note

  • If using window analytic functions, I am not partitioning on anything. I want this to cover the entire table (about 30mil rows).
  • I was able to get this to work on my Hortonworks Ambari VM by adding a row_counter (rowid) for the whole table and then doing something like:

    SUM(IF(newtrip, 1, 0)) OVER(order by rowid) as journeyID

But when running the EXACT same code in a cluster in AWS EMR, adding a row ID seems to mess up the order of the other columns and I get messed up results.

Surely there is an easy way to do this?

0

There are 0 best solutions below