ERROR Connecting to TPU in Google Colab during Training

62 Views Asked by At

I've been encountering an issue while attempting to connect to a TPU on Google Colab during model training for a categorical image classification task. I followed the instructions in this resource (https://www.tensorflow.org/guide/tpu), which outlines the TPU initialization process. However, I still run into the following error:

The error message suggests that there is a failure to connect to the addresses, specifically at ipv4:127.0.0.1:34649, with the last error being a connection refusal. The error is associated with the MultiDeviceIteratorGetNextFromShard node and involves GRPC communication errors between the local host and the CPU device.

I've verified the TPU initialization steps, but the issue persists. Any insights or suggestions on resolving this would be highly appreciated. ** ERROR MESSAGE ON TRAINING** :

InternalError: 9 root error(s) found.
  (0) INTERNAL: {{function_node __inference_train_function_92879}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:34649: Failed to connect to remote host: Connection refused
Additional GRPC error information from the remote target /job:localhost/replica:0/task:0/device:CPU:0:
:UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:34649: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-12-08T14:44:19.980032272+00:00"}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_10127589026846923034/_6/_379]]
  (1) INTERNAL: {{function_node __inference_train_function_92879}} failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:34649: Failed to connect to remote host: Connection refused
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:34649: Failed to connect to remote host: Connection refused {grpc_status:14, created_time:"2023-12-08T14:44:19.980032272+00:00"}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
Executing non-communication op <MultiDeviceIteratorGetNextFromShard> originally returned UnavailableError, and was replaced by InternalError to avoid invoking TF network error handling logic.

WHAT I DID

TPU INITIALIZATION CODE:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
strategy = tf.distribute.TPUStrategy(resolver)

LOADING IMAGE AND CREATING DATA GENERATORS

Define paths for Training and Testing Data

train_path = "./processed_datasets/train_dir"
valid_path = "./processed_datasets/val_dir"

this is a summary of how I passed in values to my data generator
train_generator, validation_generator, test_generator = data_generator(
    seed, train_path, valid_path, image_resize, train_batch_size,
    pretrained_model, validation_batch_size
)

HERE I CALLED THE FUNCTION FOR CREATING AND TRAINING THE MODEL

Initialize pre-trained EfficientNet model within strategy scope

with strategy.scope():
    baseline_create_model = baseline_model(
        input_size=image_shape,
        num_classess=hyper_param['num_class'],
        pretrained_model=hyper_param['backbone_model'],
        model_name=hyper_param['save_final_model'],
        lr_rate=0.001,
        dropout_rate=0.2
    )


Model training within strategy scope

with strategy.scope():
    history = baseline_train_model(
        model=baseline_model,
        train_generator=train_generator,
        epoch=hyper_param['epoch'],
        train_batch_size=hyper_param['train_batch_size'],
        class_weights=class_weights,
        validation_generator=validation_generator,
        validation_batch_size=hyper_param['validation_batch_size'],
        train_step=STEP_SIZE_TRAIN,
        valid_step=STEP_SIZE_VALID,
        callback=call_backs
    )

I've double-checked the TPU initialization and followed the recommended steps, but the issue persists.I have a feeling the issue could be with how i'm loading the dataset but what other way.

0

There are 0 best solutions below