In Pyspark how to iterate through each row in a data frame do calculations retain the value and use in next step by partition

131 Views Asked by At

Sample Input: enter image description here

Expected output: enter image description here

How do I code in Pyspark for the above problem

Problem description:

The respective code will iterate through each row in the dataset partitioned by coll_id_latest. For all the first record total_alloc will be zero and for rest based on the If else condition new column final to be created. The total_alloc to be retained and to be used in next iteration.I tried in pyspark writing a for loop to iterate through each row which is very in-efficient. And on big datasets it gives memory errors. Is there a better way of writing or any suggestions to how write the pyspark code? Can any write sample code for me to refer.I have tried writing a for loop which is very in efficient and takes time to execute any better methods using window method, but need to retain a value and use in next iteration.`

Below is the SAS code:

data x;
    set y;
    by coll_id_latest;
    retain total_alloc  0;
    if first.coll_id_latest then
        do;total_alloc=0;end;
    if lowcase(acct)="primary" then
                        do;
                            if  exposure not in (.,0) and total_alloc<pba1 then
                                    final_pba   =   min(exposure,(pba1-total_alloc));
                            else    final_pba   =   0;                          
                        end;
    if  lowcase(acct)="secondary" 
                    and lowcase(suff_ind)^='y' and total_alloc < pba1 then
                        do;
                            final_pba   =min(max(pba1-total_alloc,0),exposure);                         
                        end;
                    total_alloc=sum(total_alloc,final_pba);
run;
0

There are 0 best solutions below