What does DataAccessor do in tfx?

326 Views Asked by At

I'm reading the tfx tutorials, which all uses the DataAccessor to load data. The code looks something like this:

  return data_accessor.tf_dataset_factory(
      file_pattern,
      tfxio.TensorFlowDatasetOptions(
          batch_size=batch_size, label_key=_LABEL_KEY),
      schema=schema).repeat()

This makes sense at a highlevel for the tutorials but I can't find relevant documentations when I dug deeper. The tf_dataset_factory function take a tfxio.TensorFlowDatasetOptions argument and so from the args description I'm deducing that class has some type of effect similar to:

dataset = tfds.load() # load data
dataset = dataset.batch(batch_size) # batch data
dataset = dataset.shuffle() # shuffle data
dataset = dataset.prefetch() # prefetch data

But it's not clear to me in what order this is applied (if it matters) and how I can manipulate the dataset in details. For example, I want to apply Dataset.cache() but it's not clear to me whether if applying cache() after the tf_dataset_factory makes sense.

Another example of what I don't understand is whether if DataAccessor has predefined distributed training support. i.e. do I need to multiply batch_size by strategy.num_replicas_in_sync? Because the distributed training tutorials makes a clear statement on doing that but the tfx tutorials don't even mention it, so to me it's a 50/50 whether if the num_replicas_in_sync is already compensated for.

Wondering if anyone else are in the same shoes or have better ideas?

1

There are 1 best solutions below

0
On

Looking through the documentations, DataAccessor seems to be a utility wrapper around a tf.data.Dataset factory. The object that is returned is an instance of tf.data.Dataset - so any subsequent methods that apply to a normal Dataset object are valid here too.

I cant comment on the distributed training support and it is not documented anywhere as you rightly pointed out