Apache PIG- set date of current row as next records date minus one day for a given id

815 Views Asked by At

I have requirement to set end_dt as next records effective_dt minus 1 day for a given id and default it to 9999-12-31 for last record of a given id in pig.

input data-

id     eff_dt      end_dt
1    2012-02-28   9999-12-31
1    2013-03-15   9999-12-31
1    2014-05-01   9999-12-31

Required result- (order by eff_dt and then get the next record)

id     eff_dt       end_dt
1    2012-02-28    2013-02-14
1    2013-03-15    2014-04-30
1    2014-05-01    9999-12-31

i am new to apache PIG, found that we can use lead/lag , stitch/flatten but not getting how to use it in the script to achieve above result .I am facing few issues.

Issue 1 :- PIG accepts date as chararray. Need to convert eff_dt into date.
Issue 2 :- want to know syntax for 'date minus 1 day'.
Issue 3 :- How to use lead lag to get next record and do a minus one day and default if there is no next record.

Got below sample code from apache pig site but not getting how to transform it to use it in my use case.:-

To find the record 3 ahead of the current record, using a window between the current row and 3 records ahead and a default value of 0.

 A = load 'T';
 B = group A by si;
 C = foreach B {
     C1 = order A by i;
     generate flatten(Stitch(C1, Over(C1.i, 'lead', 0, 3, 3, 0)));
 }
 D = foreach C generate s, $9;

This is equivalent to the SQL statement

select s, lead(i, 3, 0) over (partition by si order by i rows between current row and 3 following) over T;

Any help will be appreciated.

1

There are 1 best solutions below

1
On BEST ANSWER

You have 3 questions, to which I can only answer the first two at the moment:

How to convert yyyy-mm-dd into a date and substract a day:

dataB = FOREACH data { 
    date = ToDate(eff_dt, 'yyyy-MM-dd');
    dayBefore = SubtractDuration(date, 'P1D');
    dayBeforeFormated = ToString(dayBefore, 'yyyy-MM-dd');

    GENERATE eff_dt, dayBeforeFormated;
}

I finally had a chance to try the Over and Stich method from piggybank. Here's a working solution.

-- first load the piggybank and define shorthand to Over and Stitch functions
REGISTER '/data/lib/piggybank-0.12.0.jar';
DEFINE Over org.apache.pig.piggybank.evaluation.Over();
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch();

-- load the input data
data = LOAD '/data' USING PigStorage('\t') AS (id:int, eff_dt:chararray);

-- generate the previous date (that could be done later)
data_before = FOREACH data { 
    date = ToDate(eff_dt, 'yyyy-MM-dd');
    dayBefore = SubtractDuration(date, 'P1D');
    eff_before = ToString(dayBefore, 'yyyy-MM-dd');
    GENERATE id as id, eff_dt as eff_dt, eff_before as eff_before;
}

-- Stitch join two bags based on position
-- Over apply a function on a group. Here we use the lead operator to get the next tuple
data_over = FOREACH (GROUP data_before ALL) {
    out = Stitch(data_before, Over(data_before.eff_before, 'lead', 0, 1, 1, '9999-99-99'));
    GENERATE FLATTEN(out) as (id, eff_dt, eff_before, end_dt);
}

-- finally, we output (we could have transform the date here)
data_final = FOREACH data_over GENERATE id, eff_dt, end_dt;

The output of this script is :

(1,2012-02-28,2013-03-14)
(1,2013-03-15,2014-04-30)
(1,2014-05-01,9999-99-99)