I'm just basically trying to add a column with a unique identifier for a journey. I have a table that looks similar to this:
Time id station newtrip
2017-11-15 16:45 100 St.George TRUE
2017-11-15 16:46 100 Bloor FALSE
2017-11-15 16:47 110 Wellesley TRUE
2017-11-15 16:48 110 Wellesley FALSE
2017-11-15 16:49 200 Dundas TRUE
2017-11-15 16:55 200 College FALSE
2017-11-15 16:56 200 Union FALSE
2017-11-15 17:51 200 Union TRUE
2017-11-15 17:52 200 St.Andrew FALSE
All I am trying to do is increment a counter every time that last column shows true. So the result should look like:
Time id station newtrip journeyID
2017-11-15 16:45 100 St.George TRUE 1
2017-11-15 16:46 100 Bloor FALSE 1
2017-11-15 16:47 110 Wellesley TRUE 2
2017-11-15 16:48 110 Wellesley FALSE 2
2017-11-15 16:49 200 Dundas TRUE 3
2017-11-15 16:55 200 College FALSE 3
2017-11-15 16:56 200 Union FALSE 3
2017-11-15 17:51 200 Union TRUE 4
2017-11-15 17:52 200 St.Andrew FALSE 4
Couple things to note
- If using window analytic functions, I am not partitioning on anything. I want this to cover the entire table (about 30mil rows).
I was able to get this to work on my Hortonworks Ambari VM by adding a row_counter (rowid) for the whole table and then doing something like:
SUM(IF(newtrip, 1, 0)) OVER(order by rowid) as journeyID
But when running the EXACT same code in a cluster in AWS EMR, adding a row ID seems to mess up the order of the other columns and I get messed up results.
Surely there is an easy way to do this?