I am using a class called DataGenerator
, that returns a tuple (data_array, label_array)
, follows the code:
from tensorflow.keras.utils import Sequence
class DataGenerator(Sequence):
"""
path_data: the path of the csv files
"""
...
This class consumes from a list of .csv
files, as shown in the following image:
Each file contains a column like this:
0.44
0.45
0.42
0.22
0.05
0.05
0.05
0.05
0.11
0.11
0.05
0.05
0.05
0.05
0.05
0.05
But these files are very huge and each one represents the data of each instance.
The problem is that I don't understand how to ingest the data through the tfx.v1.components.CsvExampleGen
to use it inside the tfx
pipeline...
- Is it possible to ingest the data using
tfx
or should I look at another alternative? - Can I use CsvExampleGen to ingest from a bunch of files in a directory?
Data ingestion which consists of reading data from raw format and formatting it into a binary format suitable for ML (e.g. TFRecord). TFX provides a standard component called ExampleGen which is responsible for generating training examples from different data sources.
tfx.v1.components.CsvExampleGen component takes
input_base
args which expects an external directory containing the CSV files. You can even customize the input and output train/eval split ratio for ExampleGen as shown here.