I'm reading the tfx tutorials, which all uses the DataAccessor to load data. The code looks something like this:
return data_accessor.tf_dataset_factory(
file_pattern,
tfxio.TensorFlowDatasetOptions(
batch_size=batch_size, label_key=_LABEL_KEY),
schema=schema).repeat()
This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper.
The tf_dataset_factory function take a tfxio.TensorFlowDatasetOptions argument and so from the args description I'm deducing that class has some type of effect similar to:
dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data
But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache() but it's not clear to me whether if applying cache() after the tf_dataset_factory makes sense.
Another example of what I don't understand is whether if DataAccessor has predefined distributed training support. i.e. do I need to multiply batch_size by strategy.num_replicas_in_sync? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync is already compensated for.
Wondering if anyone else are in the same shoes or have better ideas?
Looking through the documentations,
DataAccessorseems to be a utility wrapper around atf.data.Datasetfactory. The object that is returned is an instance oftf.data.Dataset- so any subsequent methods that apply to a normalDatasetobject are valid here too.I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out