I am trying to run a report on my storage of Azure Gen2 Data lake. I have written a below recursive function that goes inside every folder and list files till last level.
def recursive_ls(path: str):
"""List all files from path recursively."""
for file in dbutils.fs.ls(path):
if file.path[-1] is not '/':
yield (file.path.split('/')[3:11],file.size)
else:
for folder in recursive_ls(file.path):
yield folder
I have very with huge number of files and as a result this function is not coming even after 2 hours.
This might be happening because it currently handled by one single process. I need some way where in I can execute these executor functions in a multiprocessing environment.