How to split folders to 3 datasets with ImageDataGenerator?

Question

How to split folders to 3 datasets with ImageDataGenerator?

1.1k Views Asked by zs2020 At 30 October 2025 at 15:20

validation_split parameter is able to allow ImageDataGenerator to split the data sets reading from the folder into 2 different disjoint sets. Is there any way to create 3 sets - of training, validation, and evaluation datasets using it?

I am thinking about splitting the dataset into 2 datasets, then splitting the 2nd dataset into another 2 datasets

datagen = ImageDataGenerator(validation_split=0.5, rescale=1./255)

train_generator = datagen.flow_from_directory(
    TRAIN_DIR, 
    subset='training'
)

val_generator = datagen.flow_from_directory(
    TRAIN_DIR,
    subset='validation'
)

Here I am thinking about splitting the validation dataset into 2 sets using val_generator. One for validation and the other for evaluation? How should I do it?

Original Q&A

There are 2 best solutions below

**bjornaer** · Answer 1

I mostly have been splitting data in 80/10/10 for training, validation and test respectivelly.

When working with keras I favor the tf.data API as it provides a good abstraction for complex input pipelines

It does not provide a simple tf.data.DataSet.split functionality though

I have this function (that I found from someone's code and my source is missing) which I consistently use

def get_dataset_partitions_tf(ds: tf.data.Dataset, ds_size, train_split=0.8, val_split=0.1, test_split=0.1, shuffle=True, shuffle_size=10000):
    assert (train_split + test_split + val_split) == 1

    if shuffle:
    # Specify seed to always have the same split distribution between runs
        ds = ds.shuffle(shuffle_size, seed=12)

    train_size = int(train_split * ds_size)
    val_size = int(val_split * ds_size)

    train_ds = ds.take(train_size)    
    val_ds = ds.skip(train_size).take(val_size)
    test_ds = ds.skip(train_size).skip(val_size)

    return train_ds, val_ds, test_ds

Firstly read your data set, and get its size(with cardianlity method), then pass it into the function and you're good to go!

This function can be given a flag to shuffle the original dataset before creating the splits, this is useful to have more realistic validation and test metrics.

The seed for shuffling is fixed so that we can run the same function and the splits remain the same, which we want for consistent results.

**Giora Simchoni** · Answer 2

I like working with the flow_from_dataframe() method of ImageDataGenerator, where I interact with a simple Pandas DataFrame (perhaps containig other features), not with the directory. But you can easily change my code if you insist on flow_from_directory().

So this is my go-to function, e.g. for a regression task, where we try to predict a continuous y:

def get_generators(train_samp, test_samp, validation_split = 0.1):
    train_datagen = ImageDataGenerator(validation_split=validation_split, rescale = 1. / 255)
    test_datagen = ImageDataGenerator(rescale = 1. / 255)
    
    train_generator = train_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(train_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = True,
        subset = 'training',
        validate_filenames = False
    )
    valid_generator = train_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(train_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = False,
        subset = 'validation',
        validate_filenames = False
    )

    test_generator = test_datagen.flow_from_dataframe(
        dataframe = images_df[images_df.index.isin(test_samp)],
        directory = images_dir,
        x_col = 'img_file',
        y_col = 'y',
        target_size = (IMG_HEIGHT, IMG_WIDTH),
        class_mode = 'raw',
        batch_size = batch_size,
        shuffle = False,
        validate_filenames = False
    )
    return train_generator, valid_generator, test_generator

Things to notice:

I use two generators
The input to the function are the train/test indices (such as received from Sklearn's train_test_split) which are used to filter the DataFrame index.
The function also take a validation_split parameter for the training generator
images_df is a DataFrame somewhere in global memory with proper columns like img_file and y.
No need to shuffle validation and test generators

This can be further generalized for multiple outputs, classification, what have you.

How to split folders to 3 datasets with ImageDataGenerator?

There are 2 best solutions below

Related Questions in PYTHON

Related Questions in KERAS

Related Questions in IMAGEDATAGENERATOR

Trending Questions

Popular # Hahtags

Popular Questions