MLOps with TFX: How to ingest data when using Sequence from Keras?

109 Views Asked by At

I am using a class called DataGenerator, that returns a tuple (data_array, label_array), follows the code:

from tensorflow.keras.utils import Sequence

class DataGenerator(Sequence):
    """
    path_data: the path of the csv files
    """
...

This class consumes from a list of .csv files, as shown in the following image:

enter image description here

Each file contains a column like this:

0.44
0.45
0.42
0.22
0.05
0.05
0.05
0.05
0.11
0.11
0.05
0.05
0.05
0.05
0.05
0.05

But these files are very huge and each one represents the data of each instance.

The problem is that I don't understand how to ingest the data through the tfx.v1.components.CsvExampleGen to use it inside the tfx pipeline...

  • Is it possible to ingest the data using tfx or should I look at another alternative?
  • Can I use CsvExampleGen to ingest from a bunch of files in a directory?
2

There are 2 best solutions below

0
On

Data ingestion which consists of reading data from raw format and formatting it into a binary format suitable for ML (e.g. TFRecord). TFX provides a standard component called ExampleGen which is responsible for generating training examples from different data sources.

tfx.v1.components.CsvExampleGen component takes input_base args which expects an external directory containing the CSV files. You can even customize the input and output train/eval split ratio for ExampleGen as shown here.

0
On

Are you saying you have five features, and that initially their shapes are (None, 1), and you need them to be a higher-dimensional feature of shape (None, 1, 5) when you are done? In my mind, this is doable with tfx, you would need to concatenate your data in the Transform component using the right axis after reading with CsvExampleGen. If you could clarify how DataGenerator gets the data, maybe there is a simpler solution.