How do I code in Pyspark for the above problem
Problem description:
The respective code will iterate through each row in the dataset partitioned by coll_id_latest. For all the first record total_alloc will be zero and for rest based on the If else condition new column final to be created. The total_alloc to be retained and to be used in next iteration.I tried in pyspark writing a for loop to iterate through each row which is very in-efficient. And on big datasets it gives memory errors. Is there a better way of writing or any suggestions to how write the pyspark code? Can any write sample code for me to refer.I have tried writing a for loop which is very in efficient and takes time to execute any better methods using window method, but need to retain a value and use in next iteration.`
Below is the SAS code:
data x;
set y;
by coll_id_latest;
retain total_alloc 0;
if first.coll_id_latest then
do;total_alloc=0;end;
if lowcase(acct)="primary" then
do;
if exposure not in (.,0) and total_alloc<pba1 then
final_pba = min(exposure,(pba1-total_alloc));
else final_pba = 0;
end;
if lowcase(acct)="secondary"
and lowcase(suff_ind)^='y' and total_alloc < pba1 then
do;
final_pba =min(max(pba1-total_alloc,0),exposure);
end;
total_alloc=sum(total_alloc,final_pba);
run;