How to split a tensorflow dataset

1.1k Views Asked by At

I have a set of data stored in csv file, currently I read it out and store it np, and then transfer it into Dataset use below code

def read_data():
    with open(fname, "r") as f:
        lines = f.read().split("\n")
        header = lines[0].replace('"', "").split(',')
        lines = lines[1:]

        print(header)
        print(len(lines))

    float_data = np.zeros((len(lines), len(header) - 1))
    for i, line in enumerate(lines):
        values = [float(x) for x in line.split(",")[1:]]
        float_data[i, :] = values

    return tf.data.Dataset.from_tensor_slices(float_data)

and then I want to define a generator function to get data from this dataset for train, but it looks that Dataset is not subscriptable, like numpy I can use [:2] to get the data from it, but Dataset cannot.

How can I do it?

below is my generator function when I use numpy as input(the first parameter is numpy)

def generator(data, lookback, delay, min_index, max_index, shuffle = False, batch_size = 128, step = 6):

    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback

    while True:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size = batch_size)
        else:
            if i + batch_size >= max_index:
                i = min_index + lookback

            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)

        samples = np.zeros((len(rows),
                    lookback // step,
                    data.shape[-1]))

        targets = np.zeros((len(rows),))

        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]

        yield samples, targets

I'm not sure if the Dataset can do the same thing like what I did when use numpy

I can use tf.data.Dataset.from_tensor_slices method in the end of this generator, but it was low performance even I use from_generator(generator).prefetch() method, I assume that it was because the data is very big, performance were limited by CPU to process the numpy data(I referenced to this question Tensorflow: How to prefetch data on the GPU from CPU tf.data.Dataset (from_generator)), so I want to load data as Tensor start from begining to see if this will speed up my code or not.

Thanks!

1

There are 1 best solutions below

0
On

You can split the Tensorflow dataset using below sample code.

test_ds_size = int(length * 0.2) # 20 percent of length of ds
train_ds = ds.skip(test_ds_size)
test_ds = ds.take(test_ds_size)  

It's better to have prefetch to load the one batch in que for training to increase the speed of training. You can use prefetch in the below way.

ds.batch(batch_size).prefetch(1)  

You can have GPU and multiple CPU's as well to improve the performance, below is the illustration.

enter image description here