I am struggling on how to make a "multi-row" formula in U-SQL. I have ordered the data by Date, and for each for I want to find the first value of "Port" that is not equal to the current row's value. In a similar way I want to find the last row in the date value with the current port value to figure out how many days a vessel has been in the port. Keep in mind here that is has to be the row with the same port name, with no new/other ports in between.
I am loading my data in like this:
@res = SELECT
Port,
Date
FROM @data;
This is how my date is structured:
Port | Date |
Port A | 1/1/2017 |
Port A | 1/1/2017 |
Port A | 1/2/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/4/2017 |
Port B | 1/5/2017 |
Port B | 1/6/2017 |
Port C | 1/9/2017 |
Port C | 1/10/2017 |
Port C | 1/11/2017 |
Port A | 1/14/2017 |
Port A | 1/15/2017 |
How I would like the data to be structured:
Port | Date | Time in Port | Previous Port
Port A | 1/1/2017 | 0 | N/A
Port A | 1/1/2017 | 0 | N/A
Port A | 1/2/2017 | 1 | N/A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/4/2017 | 0 | Port A
Port B | 1/5/2017 | 1 | Port A
Port B | 1/6/2017 | 2 | Port A
Port C | 1/9/2017 | 0 | Port B
Port C | 1/10/2017 | 1 | Port B
Port C | 1/11/2017 | 2 | Port B
Port A | 1/14/2017 | 0 | Port C
Port A | 1/15/2017 | 1 | Port C
I am new to U-SQL and so I am having a bit of trouble on how to approach this. My first instinct would be to use some combination of LEAD()/LAG() and ROW_NUMBER() OVER(PARTITION BY xx ORDER BY Date), but I am unsure of how to get the exact effect I am looking for.
Could anyone point me in the right direction?
You can do what you need with the so-called Ranking and Analytic functions like
LAG
,DENSE_RANK
and theOVER
clause, although it's not entirely straightforward. This simple rig worked for your test data, I would suggest testing thoroughly with a larger and more complex dataset.My results: