I am having a scenario where I will be receiving data in csv files and there I need to generate some columns with the existing one. Example:
Col_1 Col_2 Col_3 Col_4
abc 1 No 123
xyz 2 Yes 123
def 1 Yes 345
Expected:
Col_1 Col_2 Col_3 Col_4 Col_5 Col_6
abc 1 No 123 1 1
xyz 2 Yes 123 0 0
def 1 Yes 345 0 0
Col_5 Condition : if Col_1 = 'abc' then 1 else 0 end Col_6 Condition : max(Col_5) over (Col_2)
I know we can perform transformations in Druid when we loading the file in it, I tried simpler condition which is working fine for me, but I am Pretty doubt to perform aggregate and other transformation like Col_6 here.
Also we need to perform aggregate on different files data which we going to receive, Assume we get 2 file today and we loaded the data to Druid table, Tomorrow again we got some 3 files which is having data for same (ID) which is Col_2 here then we need to do aggregation based on all the records we have, Example : Col_6 generation here...
Shall this will be possible in Druid?
Col_5 Condition : if Col_1 = 'abc' then 1 else 0
You can use the following
Col_6 Condition : max(Col_5) over (Col_2)
You can apply window operation
Now remove duplicates for each
Col_2
and then join thedf_max
with your main df.The above code snippet is in python, but spark API is the same so you can use it with minimal changes.