How to split a tensorflow dataset

1.1k Views Asked by At

I have a set of data stored in csv file, currently I read it out and store it np, and then transfer it into Dataset use below code

def read_data():
    with open(fname, "r") as f:
        lines ="\n")
        header = lines[0].replace('"', "").split(',')
        lines = lines[1:]


    float_data = np.zeros((len(lines), len(header) - 1))
    for i, line in enumerate(lines):
        values = [float(x) for x in line.split(",")[1:]]
        float_data[i, :] = values


and then I want to define a generator function to get data from this dataset for train, but it looks that Dataset is not subscriptable, like numpy I can use [:2] to get the data from it, but Dataset cannot.

How can I do it?

below is my generator function when I use numpy as input(the first parameter is numpy)

def generator(data, lookback, delay, min_index, max_index, shuffle = False, batch_size = 128, step = 6):

    if max_index is None:
        max_index = len(data) - delay - 1
    i = min_index + lookback

    while True:
        if shuffle:
            rows = np.random.randint(min_index + lookback, max_index, size = batch_size)
            if i + batch_size >= max_index:
                i = min_index + lookback

            rows = np.arange(i, min(i + batch_size, max_index))
            i += len(rows)

        samples = np.zeros((len(rows),
                    lookback // step,

        targets = np.zeros((len(rows),))

        for j, row in enumerate(rows):
            indices = range(rows[j] - lookback, rows[j], step)
            samples[j] = data[indices]
            targets[j] = data[rows[j] + delay][1]

        yield samples, targets

I'm not sure if the Dataset can do the same thing like what I did when use numpy

I can use method in the end of this generator, but it was low performance even I use from_generator(generator).prefetch() method, I assume that it was because the data is very big, performance were limited by CPU to process the numpy data(I referenced to this question Tensorflow: How to prefetch data on the GPU from CPU (from_generator)), so I want to load data as Tensor start from begining to see if this will speed up my code or not.



There are 1 best solutions below


You can split the Tensorflow dataset using below sample code.

test_ds_size = int(length * 0.2) # 20 percent of length of ds
train_ds = ds.skip(test_ds_size)
test_ds = ds.take(test_ds_size)  

It's better to have prefetch to load the one batch in que for training to increase the speed of training. You can use prefetch in the below way.


You can have GPU and multiple CPU's as well to improve the performance, below is the illustration.

enter image description here