Talend ETL To filter one row from a set of rows based on a condition

1.3k Views Asked by At

I have a set of rows in a table X as

job1   20170506 Sat   Not Run
job1   20170507 Sun   Completed
job1   20170508 Mon   Not Run
job1   20170513 Sat   Completed
job1   20170514 Sun   Failed
job1   20170515 Mon   Not Run
job1   20170520 Sat   Not Run
job1   20170521 Sun   Not run
job1   20170522 Mon   Not Run
job2   20170506 Sat   Failed
job2   20170507 Sun   Not run
job2   20170508 Mon   Completed
job2   20170513 Sat   Completed
job2   20170514 Sun   Completed
job2   20170515 Mon   Completed
job2   20170520 Sat   Not Run
job2   20170521 Sun   Not run
job2   20170522 Mon   Not Run
 and so on.

I need to take weekend (Sat, Sun, Mon) as a single entity and should get a resulting table as follows:

job1   20170507 Sun   Completed
job1   20170513 Sat   Completed
job1   20170522 Mon   Not Run
job2   20170508 Mon   Completed
job2   20170515 Mon   Completed
job2   20170522 Mon   Not Run

If the job has an entry completed in any of these three days, it is considered as completed. Otherwise if all are Not run, then latest entry with Not Run is to be taken. How is it possible in talend.

I can create a table containing Sat to Mon interval dates as follows:

Table Y

20170506 Sat  20170508 Mon
20170513 Sat  20170515 Mon
20170520 Sat  20170522 Mon

Please give me an idea, should I use tJava, or tIntervalMatch components, how is it possible.

3

There are 3 best solutions below

0
On

Ok. If I understood your requirement clearly then I think this would be your answer.

Input(EntireTable) --> Tmap(Left outer Join) --> Complt_FLG = "Y" then all set ^ | | Lkp(Records only the status as Completed, Complt_FLG as "Y")

if Complt_FLG is null (that means not Y) then use taggregate and get max date record by grouping on Job id.

Please let me know if this is what you are looking for or any questions.

0
On

Here is my proposal: enter image description here

1st subjob is here to add sequence to the input based on the weekday and to separate Completed and NotCompleted jobs.
For this purpose I use the following tMap_23 in which both flows have the same shema: enter image description here

The new field jobNumber contains the value of the jobname + a sequence number separated by a dash (job-1 for example).
As the sequence is based on the weekday, every lines associated to the same jobname have the same value for jobNumber.
Result for both flows is memorized into a dedicated tHashOutput (here called Completed and NotCompleted).

After that, a 2nd subjob is started to exclude NotCompleted records associated to a Completed one. This the role of tMap_24: enter image description here This is a classic inner join based on jobNumber. Inner join rejected records are catched, then sorted on jobNumber asc and execDate desc (tSortRow_1) and finaly records with the same jobNumber are eliminated except the 1st one (tUniqRow_6) and the result is appended at the end of the tHashOutput called Completed in the 1st subjob (here RawResult task).

Last subjob read the tHashInput called RawResult, remove the extra field jobNumber, sort all the record based on job before to display the result on the console: enter image description here

Hope this helps.
TRF

0
On

Here is my answer:-

I have divided the task into two subjobs.

  1. First subjob reads the data of Table Y and iterates for one by one weekend at a time. Here I have taken the data of Table Y into file, please find the below snapshot link.

enter image description here

  1. In the main subjob we will fetch the week start and week end date in the variable max_date and min_date and will pass these two the next child subjob.

  2. In the child subjob we will fetch the data from the main file and sort that data based on job_name and date and will pass the records to tjavaflex. Please find the image of tSortRow properties and the child subjob.

Child Subjob Image : enter image description here

Here in the child subjob I have created 8 context variables:-

a.) MIN_DATE : is getting through parent job b.) MAX_DATE : is getting through parent job c.) JOB_NAME : is the name of the job in the processing record d.) DATE : is the execution date of the job for processing record e.) DAY : is the execution day of the job for processing record f.) STATUS : status of the job g.) OUT_JOB_NAME h.) OUT_DATE i.) OUT_DAY j.) OUT_STATUS

OUT prefix variable are those which needs to be printed as an output.

In tJavaflex, I have written the logic, it checks wheter the incoming row is in between the week and checks for the status= completed, once we get the status=completed, we will set the flag='false'. so that for the same job_name, it wont check the condition again.

Once the job_name gets changed, flag value becomes true and repeats the above steps again for that weekend.