Why the order is not respected in a for loop using dask?

165 Views Asked by At

Why when I run a for-loop in the code below, dask prefers to firstly do the 'Four' then 'One', and so on instead of starting from the first and finishing with the last element? Is it possible that I get some mixed (wrong) results where for example it puts the content of one file/folder into another? or if there are conditions within the for-loop they are ignored etc.?

Thanks in advance!

def compa(filename):
    filex=pd.read_json('folder/{}'.format(filename))    
    for jj in ['Zero', 'One', 'Two', 'Three','Four']:
        filexz=filex[filex[jj]==1].reset_index(drop=True)


        newpath = 'Newfolder/{}'.format(jj)
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        filexz.to_json('{}/{}'.format(newpath,filename))

delayed_results=[delayed(compa)(filename) for filename in filelist]
compute(*delayed_results, scheduler='processes')

Code for replication purposes:

import pandas as pd
sof1=pd.DataFrame({'minus': ['a', 'b', 'c', 'd', 'e'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof2=pd.DataFrame({'minus': ['aa', 'bb', 'cc', 'dd', 'ee'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof3=pd.DataFrame({'minus': ['az', 'bz', 'cz', 'dz', 'ez'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof4=pd.DataFrame({'minus': ['azy', 'bzy', 'czy', 'dzy', 'ezy'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof5=pd.DataFrame({'minus': ['azx', 'bzx', 'czx', 'dzx', 'ezx'],'Zero': [1, 0, 0, 0, 0],'One': [0, 1, 0, 0, 0],'Two': [0, 0, 1, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof6=pd.DataFrame({'minus': ['azw', 'bzw', 'czw', 'dzw', 'ezw'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof7=pd.DataFrame({'minus': ['azyq', 'bzyq', 'czyq', 'dzyq', 'ezyq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof8=pd.DataFrame({'minus': ['azxq', 'bzxq', 'czxq', 'dzxq', 'ezxq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})
sof9=pd.DataFrame({'minus': ['azwq', 'bzwq', 'czwq', 'dzwq', 'ezwq'],'Zero': [1, 0, 0, 0, 0],'One': [0, 0, 1, 0, 0],'Two': [0, 1, 0, 0, 0],'Three': [0, 0, 0, 0, 1],'Four': [0, 0, 0, 1, 0]})

filelist=[sof1,
sof2,
sof3,
sof4,
sof5,
sof6,
sof7,
sof8,
sof9]

import pandas as pd
import dask
from dask import compute, delayed
import os

def compa(filename):
    filex=filename
    for jj in ['Zero', 'One', 'Two', 'Three','Four']:
        filexz=filex[filex[jj]==1].reset_index(drop=True)
        newpath = 'Newfolderstackoverflow/{}'.format(jj)
        if not os.path.exists(newpath):
            os.makedirs(newpath)
        filexz.to_json('{}/{}'.format(newpath,filename.loc[1,'minus']))

delayed_results=[delayed(compa)(filename) for filename in filelist]
compute(*delayed_results, scheduler='processes')

As the code above runs immediately I don't know how to record the creation order but first "four" and "one" folders are created then the rest! (and the order of creation of the files within each folder does not follow the order in the filelist neither which is understandable to me as THOSE FILES are supposed to be computed in parallel)

Thanks to the comments and answers specially those of @MichaelDelgado here is how it got solved: I added sleep for 60 seconds noticing that after 60sec it creates two files at each time and add it starting from folder Zero up to Four. The reason for my initial problem was that as the last couple of files were added within the same minute to the 5 folders, sorting folders based on time was meaningless and my OS sorted them alphabetically (hence "four" then "one")

1

There are 1 best solutions below

6
On

The order in which tasks are executed is determined by several factors:

  • user-specified priorities;
  • FIFO order;
  • graph structure.

With regards to the possibility of a mix-up, as long as the internal code is correct (so no multiple processes writing to the same file at the same time), this should not be possible. As noted in the comment by @mdurant, it looks like your loop writes to the same file multiple times.