I have implemented a custom TensorFlow Dataset for my raw data. I can download, prepare, and load the data as a tensorflow.data.Dataset as follows:
import tensorflow_datasets
builder = tensorflow_datasets.builder("my_dataset")
builder.download_and_prepare()
ds = builder.as_dataset()
I want to transform this data in a TensorFlow Transform pipeline for model training. However, the only way I have been able to pass the dataset in to the transform pipeline is by converting it to instance dicts and passing in raw data metadata.
instance_dicts = tensorflow_datasets.as_dataframe(ds).to_dict(orient="records")
with tensorflow_transform.beam.Context():
(transformed_data, _), transform_fn = (
instance_dicts,
RAW_DATA_METADATA,
) | tensorflow_transform.beam.AnalyzeAndTransformDataset(
preprocessing_fn, output_record_batches=True
)
Is there an easier and more memory efficient way of passing a TensorFlow Dataset to a TensorFlow Transform pipeline?
In this case, the easier and more memory efficient way to pass a Tensorflow Dataset to a Tensorflow Transform pipeline is to reference the TFRecord files written by the Tensorflow Dataset Builder's
download_and_prepare()job.To transform the raw data, create a TFXIO from a feature spec.
Then in the pipeline, convert the
PCollectionto a Beam source, and provide the appropriate adapter to convertRecordBatchto Tensors.