Stream private data to google collab TPUs from GCS

633 Views Asked by At

So I'm trying to make a photo classifier with 150 classes. I'm trying to run it on google colab TPUs, I understood I need a tfds with try_gcs = True for it & for that I need to put a dataset on google colab cloud. So I converted a generator to a tfds, stored it locally using

my_tf_ds = tf.data.Dataset.from_generator(datafeeder.allGenerator,
                           output_signature=(
     tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
     tf.TensorSpec(shape=(150), dtype=tf.float32)))

tf.data.experimental.save(my_tf_ds,filename)

Then I sent it to my bucket on GCS. But when I try to load it from my bucket with

import tensorflow_datasets as tfds
dsFromGcs = tfds.load("pokemons",data_dir = "gs://dataset-7000")

It doesn't work and gives available datasets like :

- abstract_reasoning
- accentdb
- aeslc
- aflw2k3d
- ag_news_subset
- ai2_arc
- ai2_arc_with_ir
- amazon_us_reviews
- anli
- arc

that are not on my GCS bucket.

When loading it myself from local:

tfds_from_file = tf.data.experimental.load(filename, element_spec= (
     tf.TensorSpec(shape=(64,64,3), dtype=tf.float32),
     tf.TensorSpec(shape=(150), dtype=tf.float32)))

it works, the dataset is fine.

So I don't understand why I can't read it on gcs, can we read private ds on GCS? Or only already defined datasets. I also gave the role Storage Legacy Bucket Reader on my Bucket to the public.

1

There are 1 best solutions below

0
On

I think the data_dir argument to tfds.load is where the module will store things locally on your device and the try_gcs is whether to stream the data or not. So the data_dir cannot be used to point the module to your GCS bucket.

Here are some ideas you could try:

  1. You could try these steps to add your dataset to TFDS and then you should be able to load it using tfds.load
  2. You could get a dataset in the right format using tf.data.experimental.save (which I think you've already done) and save it to GCS and then load it using tf.data.experimental.load, which you said is working for you locally. You could follow these steps to install gcsfuse and use that to download your dataset to Colab from GCS.
  3. You could try TFRecord to load your dataset. Here is a codelab with explanation and then here is a Colab example that's linked in the codelab