U-SQL How to select first value in column that is different from current row?

226 Views Asked by At

I am struggling on how to make a "multi-row" formula in U-SQL. I have ordered the data by Date, and for each for I want to find the first value of "Port" that is not equal to the current row's value. In a similar way I want to find the last row in the date value with the current port value to figure out how many days a vessel has been in the port. Keep in mind here that is has to be the row with the same port name, with no new/other ports in between.

I am loading my data in like this:

@res = SELECT
        Port,
        Date
        FROM @data;

This is how my date is structured:

Port      |   Date       |
Port A    |   1/1/2017   |
Port A    |   1/1/2017   |
Port A    |   1/2/2017   |
Port B    |   1/4/2017   |
Port B    |   1/4/2017   |
Port B    |   1/4/2017   |
Port B    |   1/5/2017   |
Port B    |   1/6/2017   |
Port C    |   1/9/2017   |
Port C    |   1/10/2017  |
Port C    |   1/11/2017  |
Port A    |   1/14/2017  |
Port A    |   1/15/2017  |

How I would like the data to be structured:

Port      |   Date       |  Time in Port   | Previous Port
Port A    |   1/1/2017   |      0          |   N/A
Port A    |   1/1/2017   |      0          |   N/A
Port A    |   1/2/2017   |      1          |   N/A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/4/2017   |      0          |   Port  A
Port B    |   1/5/2017   |      1          |   Port  A
Port B    |   1/6/2017   |      2          |   Port  A
Port C    |   1/9/2017   |      0          |   Port  B
Port C    |   1/10/2017  |      1          |   Port  B
Port C    |   1/11/2017  |      2          |   Port  B
Port A    |   1/14/2017  |      0          |   Port  C
Port A    |   1/15/2017  |      1          |   Port  C

I am new to U-SQL and so I am having a bit of trouble on how to approach this. My first instinct would be to use some combination of LEAD()/LAG() and ROW_NUMBER() OVER(PARTITION BY xx ORDER BY Date), but I am unsure of how to get the exact effect I am looking for.

Could anyone point me in the right direction?

1

There are 1 best solutions below

0
On

You can do what you need with the so-called Ranking and Analytic functions like LAG, DENSE_RANK and the OVER clause, although it's not entirely straightforward. This simple rig worked for your test data, I would suggest testing thoroughly with a larger and more complex dataset.

// Test data
@input = SELECT *
     FROM (
        VALUES
        ( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port A", DateTime.Parse("1/1/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port A", DateTime.Parse("1/2/2017", new CultureInfo("en-US") ), 1 ),
        ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port B", DateTime.Parse("1/4/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port B", DateTime.Parse("1/5/2017", new CultureInfo("en-US") ), 1 ),
        ( "Port B", DateTime.Parse("1/6/2017", new CultureInfo("en-US") ), 2 ),
        ( "Port C", DateTime.Parse("1/9/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port C", DateTime.Parse("1/10/2017", new CultureInfo("en-US") ), 1 ),
        ( "Port C", DateTime.Parse("1/11/2017", new CultureInfo("en-US") ), 2 ),
        ( "Port A", DateTime.Parse("1/14/2017", new CultureInfo("en-US") ), 0 ),
        ( "Port A", DateTime.Parse("1/15/2017", new CultureInfo("en-US") ), 1 )
     ) AS x ( Port, Date, timeInPort );



// Add a group id to the dataset
@working =
    SELECT Port,
           Date,
           timeInPort,
           DENSE_RANK() OVER(ORDER BY Date) - DENSE_RANK() OVER(PARTITION BY Port ORDER BY Date) AS groupId

    FROM @input;


// Use the group id to work out the datediff with previous row
@working =
    SELECT Port,
           Date,
           timeInPort,
           groupId,
           Date.Date.Subtract((DateTime)(LAG(Date) OVER(PARTITION BY groupId ORDER BY Date) ?? Date)).TotalDays AS diff    // datediff

    FROM @working;


// Work out the previous port, based on group id
@ports =
    SELECT Port, groupId
    FROM @working
    GROUP BY Port, groupId;

@ports =
    SELECT Port, groupId, LAG(Port) OVER( ORDER BY groupId ) AS previousPort
    FROM @ports;


// Prep the final output
@output =
    SELECT w.Port,
           w.Date.ToString("M/d/yyyy") AS Date,
           SUM(w.diff) OVER( PARTITION BY w.groupId ORDER BY w.Date ROWS BETWEEN 1 PRECEDING AND CURRENT ROW ) AS timeInPort,
           p.previousPort
    FROM @working AS w
         INNER JOIN
             @ports AS p
         ON w.Port == p.Port
            AND w.groupId == p.groupId;


OUTPUT @output TO "/output/output.csv"
ORDER BY Date, Port       
USING Outputters.Csv(quoting:false);

My results:

Results