I have a set of data stored in csv file, currently I read it out and store it np, and then transfer it into Dataset use below code
def read_data():
with open(fname, "r") as f:
lines = f.read().split("\n")
header = lines[0].replace('"', "").split(',')
lines = lines[1:]
print(header)
print(len(lines))
float_data = np.zeros((len(lines), len(header) - 1))
for i, line in enumerate(lines):
values = [float(x) for x in line.split(",")[1:]]
float_data[i, :] = values
return tf.data.Dataset.from_tensor_slices(float_data)
and then I want to define a generator function to get data from this dataset for train, but it looks that Dataset is not subscriptable, like numpy I can use [:2] to get the data from it, but Dataset cannot.
How can I do it?
below is my generator function when I use numpy as input(the first parameter is numpy)
def generator(data, lookback, delay, min_index, max_index, shuffle = False, batch_size = 128, step = 6):
if max_index is None:
max_index = len(data) - delay - 1
i = min_index + lookback
while True:
if shuffle:
rows = np.random.randint(min_index + lookback, max_index, size = batch_size)
else:
if i + batch_size >= max_index:
i = min_index + lookback
rows = np.arange(i, min(i + batch_size, max_index))
i += len(rows)
samples = np.zeros((len(rows),
lookback // step,
data.shape[-1]))
targets = np.zeros((len(rows),))
for j, row in enumerate(rows):
indices = range(rows[j] - lookback, rows[j], step)
samples[j] = data[indices]
targets[j] = data[rows[j] + delay][1]
yield samples, targets
I'm not sure if the Dataset can do the same thing like what I did when use numpy
I can use tf.data.Dataset.from_tensor_slices
method in the end of this generator, but it was low performance even I use from_generator(generator).prefetch()
method, I assume that it was because the data is very big, performance were limited by CPU to process the numpy data(I referenced to this question Tensorflow: How to prefetch data on the GPU from CPU tf.data.Dataset (from_generator)), so I want to load data as Tensor start from begining to see if this will speed up my code or not.
Thanks!
You can split the Tensorflow dataset using below sample code.
It's better to have
prefetch
to load the one batch in que for training to increase the speed of training. You can useprefetch
in the below way.You can have GPU and multiple CPU's as well to improve the performance, below is the illustration.