Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples
to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split
into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset
reading speed surprisingly slows down when tf.SequenceExamples
or tfrecords grow in size.
There are quite a few possible solutions:
- Apache Beam + Cloud DataFlow + feed_dict;
- tf.Transform;
- Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
- tensorflow/ecosystem + Hadoop or Spark
- tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
- Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
- Sending data (protos?) directly to
tf.Dataset
in one way or another to be used in part II. - Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset
pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.