Yielding files in a directory of directories without inner directories

20 Views Asked by At

My task is to train a ML model. I want to yield the files to avoid memory problems in the future. I have stumbled upon a solution that I adjusted a bit. But the modification does not quite meet my needs. Assume the folder structure I have is like the following:

../

A/

2014-01-01

2014-01-05

2014-01-06

/B

2014-01-02

2014-01-06

...

So essentially in folder: test, I have subdirs like: A,B .. etc. Within each of those subdirectories, I have dates: 2014-01-01, etc.

What I need my generator to do is to yield me the files in datetime order, ignoring the directories themselves (order subdirectories does not matter, I can get the files first from B then from A, does not matter).

I have the following code atm:

def sort_func(x):
    x_ = x
    x = str(x)
    # dates - files
    try:
        return datetime.datetime.strptime(x, "%Y-%m-%d")
    # folder. Ignore
    except ValueError as e:
        return x_
    except Exception as e:
        raise(e)

p = pathlib.Path('../datasets/train/')

a = sorted(p.glob('**/*'), key=sort_func)

And this would output something like this:

[PosixPath('../datasets/train/A'),
 PosixPath('../datasets/train/A/2014-01-01'),
 PosixPath('../datasets/train/A/2014-01-02'),
 PosixPath('../datasets/train/A/2014-01-03'),
...]

i.e. I do not need the first path, and all the directory paths.

How do I omit these?

EDIT: Actually, it appears that glob returns a list... p.glob('*/*') seems to do the trick, but sorted(.) is giving me a list instead of yielding the files one by one

1

There are 1 best solutions below

0
blhsing On

You can filter with the is_file method of the Path object:

a = sorted([path for path in p.glob('**/*') if path.is_file()], key=sort_func)