I'm reading the tfx tutorials, which all uses the DataAccessor
to load data. The code looks something like this:
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory
function take a tfxio.TensorFlowDatasetOptions
argument and so from the args description I'm deducing that class has some type of effect similar to:
dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data
But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache()
but it's not clear to me whether if applying cache()
after the tf_dataset_factory
makes sense.
Another example of what I don't understand is whether if DataAccessor
has predefined distributed training support. i.e. do I need to multiply batch_size
by strategy.num_replicas_in_sync
? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync
is already compensated for.
Wondering if anyone else are in the same shoes or have better ideas?
Looking through the documentations,
DataAccessor
seems to be a utility wrapper around atf.data.Dataset
factory. The object that is returned is an instance oftf.data.Dataset
- so any subsequent methods that apply to a normalDataset
object are valid here too.I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out