Training custom data set model using mask_rcnn_inception from tensorflow model zoo on Macbook pro M2

139 Views Asked by At

Running into GPU related error while working with latest tensorflow ( 2.13 ) . Please note the test model training provided on tensorflow-metal page to verify my setup works fine.

Please advise.

Below is the command I used - the script is from [github.com/tensorflow/models][1]

 python3 model_main_tf2.py --model_dir=models/ark_mask_rcnn_inception_resnet_v2 --pipeline_config_path=models/ark_mask_rcnn_inception_resnet_v2/pipeline.config 

tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
     [[{{node GatherV2_7}}]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]] [Op:IteratorGetNext] name: 

The above are the last lines of the error message. below is the full log from the model training script

2023-09-10 20:06:55.580212: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 32.00 GB
2023-09-10 20:06:55.580217: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 10.67 GB
2023-09-10 20:06:55.580248: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.580265: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2023-09-10 20:06:55.581703: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:303] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2023-09-10 20:06:55.581712: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:269] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I0910 20:06:55.581999 8568659456 mirrored_strategy.py:419] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I0910 20:06:55.590664 8568659456 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I0910 20:06:55.590721 8568659456 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0910 20:06:55.605112 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py:563: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I0910 20:06:55.607398 8568659456 dataset_builder.py:162] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I0910 20:06:55.607451 8568659456 dataset_builder.py:79] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I0910 20:06:55.607482 8568659456 dataset_builder.py:80] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W0910 20:06:55.607504 8568659456 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
W0910 20:06:55.610141 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:100: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.deterministic`.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W0910 20:06:55.618376 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/builders/dataset_builder.py:235: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W0910 20:06:56.389322 8568659456 deprecation.py:569] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py:459: calling map_fn (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W0910 20:06:58.673335 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W0910 20:06:59.748894 8568659456 deprecation.py:364] From /Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/util/dispatch.py:1176: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2023-09-10 20:07:01.205124: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
2023-09-10 20:07:01.207747: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:114] Plugin optimizer for device_type GPU is enabled.
Traceback (most recent call last):
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 126, in <module>
    tf.compat.v1.app.run()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/platform/app.py", line 36, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/Users/_dga/ml-git/tf-ark/Tensorflow/workspace/training_demo/model_main_tf2.py", line 117, in main
    model_lib_v2.train_loop(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 605, in train_loop
    load_fine_tune_checkpoint(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 401, in load_fine_tune_checkpoint
    _ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/object_detection/model_lib_v2.py", line 161, in _ensure_model_is_built
    features, labels = iter(input_dataset).next()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 260, in next
    return self.__next__()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 264, in __next__
    return self.get_next()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 325, in get_next
    return self._get_next_no_partial_batch_handling(name)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 361, in _get_next_no_partial_batch_handling
    replicas.extend(self._iterators[i].get_next_as_list(new_name))
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/distribute/input_lib.py", line 1427, in get_next_as_list
    return self._format_data_list_with_options(self._iterator.get_next())
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/multi_device_iterator_ops.py", line 553, in get_next
    result.append(self._device_iterators[i].get_next())
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 867, in get_next
    return self._next_internal()
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 777, in _next_internal
    ret = gen_dataset_ops.iterator_get_next(
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 3028, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/Users/_dga/ml-git/tf-venv/lib/python3.9/site-packages/tensorflow/python/framework/ops.py", line 6656, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: {{function_node __wrapped__IteratorGetNext_output_types_18_device_/job:localhost/replica:0/task:0/device:GPU:0}} indices[0] = 0 is not in [0, 0)
     [[{{node GatherV2_7}}]]
     [[MultiDeviceIteratorGetNextFromShard]]
     [[RemoteCall]] [Op:IteratorGetNext] name: ```


  [1]: https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py

running the setup verification script available on apple tensorflow-metal page i.e.

import tensorflow as tf

cifar = tf.keras.datasets.cifar100
(x_train, y_train), (x_test, y_test) = cifar.load_data()
model = tf.keras.applications.ResNet50(
    include_top=True,
    weights=None,
    input_shape=(32, 32, 3),
    classes=100,)

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
model.fit(x_train, y_train, epochs=5, batch_size=64) ``` 

works fine i.e. detects the device etc.

1

There are 1 best solutions below

0
On BEST ANSWER

This answer / assumption also seems to be incorrect. Training the same model on UBUNTU machine with GPU / CPU also faile with identical error.

Found this issue listed since 2020 on github issue on github

For future reference for self and others -

On the same machine I could successfully move ahead with my training for other categories of models and couldn't find any specific response to the question of why this error shows up for this specific model type i.e. mask_rcnn_inception_resnet.

Thus I concluded that since this model is not supported on TPU's yet it cannot run on Mac M2 where though its called a GPU, possibly TF sees it as a TPU due to the pluggable device pattern with tensorflow-metal.

Further update -- I managed to catch hold of someone from Tensorflow official team and the update is research models are not supported i.e. Tensorflow/models/research section and we are expected to use official models.

Working Mac M1 gist for TF2 Object detection