My task is to train a ML model. I want to yield the files to avoid memory problems in the future. I have stumbled upon a solution that I adjusted a bit. But the modification does not quite meet my needs. Assume the folder structure I have is like the following:
../
A/
2014-01-01
2014-01-05
2014-01-06
/B
2014-01-02
2014-01-06
...
So essentially in folder: test, I have subdirs like: A,B .. etc. Within each of those subdirectories, I have dates: 2014-01-01, etc.
What I need my generator to do is to yield me the files in datetime order, ignoring the directories themselves (order subdirectories does not matter, I can get the files first from B then from A, does not matter).
I have the following code atm:
def sort_func(x):
x_ = x
x = str(x)
# dates - files
try:
return datetime.datetime.strptime(x, "%Y-%m-%d")
# folder. Ignore
except ValueError as e:
return x_
except Exception as e:
raise(e)
p = pathlib.Path('../datasets/train/')
a = sorted(p.glob('**/*'), key=sort_func)
And this would output something like this:
[PosixPath('../datasets/train/A'),
PosixPath('../datasets/train/A/2014-01-01'),
PosixPath('../datasets/train/A/2014-01-02'),
PosixPath('../datasets/train/A/2014-01-03'),
...]
i.e. I do not need the first path, and all the directory paths.
How do I omit these?
EDIT: Actually, it appears that glob returns a list... p.glob('*/*') seems to do the trick, but sorted(.) is giving me a list instead of yielding the files one by one
You can filter with the
is_filemethod of thePathobject: