I have the following data:
+-----------+-----------+-----------+-----+-----------+
| Env1_date | Env2_date | Env3_date | Pid | orderDate |
+-----------+-----------+-----------+-----+-----------+
| Null | Null | 1/9/2020 | abc | 10/6/2020 |
| Null | 1/9/2020 | 1/8/2020 | pqr | 10/4/2020 |
| 1/9/2020 | Null | Null | xyz | 10/2/2020 |
| 1/8/2020 | 1/7/2020 | Null | uvw | 10/1/2020 |
+-----------+-----------+-----------+-----+-----------+
I am trying to create 3 new columns which basically tells if Pid is valid for env1, env2 and env3.
For that, i first sort the records on orderDate column in descending order(already sorted in the above table).
If for
Env1_date,Env2_date,Env3_date, the top records areNull, they are considered valid. After theNullrecords, if the date is less than a particular date(in this example1/9/2020), its considered valid. Any other records are flagged as invalid.If top records are not
NULL, need to check if dates are equal to1/9/2020. If so, they are also flagged as valid
My output should look like below:
+-----------+-----------+-----------+-----+-----------+-----------+-----------+-----------+
| Env1_date | Env2_date | Env3_date | Pid | orderDate | Env1_Flag | Env2_Flag | Env3_Flag |
+-----------+-----------+-----------+-----+-----------+-----------+-----------+-----------+
| Null | Null | 1/9/2020 | abc | 10/6/2020 | Valid | Valid | Valid |
| Null | 1/9/2020 | 1/8/2020 | pqr | 10/4/2020 | Valid | Valid | Invalid |
| 1/9/2020 | Null | Null | xyz | 10/2/2020 | Valid | Invalid | Invalid |
| 1/8/2020 | 1/7/2020 | Null | uvw | 10/1/2020 | Invalid | Invalid | Invalid |
+-----------+-----------+-----------+-----+-----------+-----------+-----------+-----------+
I am trying to achieve this using Spark 1.5 and scala.
I tried using lag function. But not able to include all the scenarios.
Not sure how to approach this problem.
Can someone please help me here.
Note: Windows function, toDf(), createDataFrame() functions don't work in the spark i am using. Its a custom spark environment and there are few restrictions in place
One way you can do this is by collecting all your data to the driver and processing it as a regular array and then turning it to DF again. Beware though, data should fit into the driver.
I wrote the code that can work with data you provided. If you tune it a bit(especially the data comparison part), you should get what you expected.
Here're the outputs from printlines above