Keras Tuner - Chief running trials instead of the workers

223 Views Asked by At

Setup
keras-tuner==1.1.0
tensorflow==2.8.0
Python 3.10.2

Chief and Tuner0 running on one machine
Tuner1 running on another machine

Hyperband initialization:

hp = Hyperband(
    hypermodel=em.get_model,
    objective='val_accuracy',
    max_epochs=int(config.get(eid, 'epochs')),
    project_name=project_folder,
    hyperband_iterations=int(config.get(eid, 'tuner_iterations'))
)

print(hp.search_space_summary())

# TensorBoard logs
# tlogs = 'tboard_logs/' + eid

lr_schedule = LearningRateScheduler(exp_scheduler)
early_stop = int(config.get(eid, 'early_stop'))
if len(output_keys) > 1:
    hp.search(train, steps_per_epoch=train_steps,
              validation_data=test, validation_steps=test_steps, verbose=2,
              callbacks=[EarlyStopping(patience=early_stop), lr_schedule, Combined_Accuracy(len(output_keys))])
else:
    hp.search(train, steps_per_epoch=train_steps,
              validation_data=test, validation_steps=test_steps, verbose=2,
              callbacks=[EarlyStopping(patience=early_stop), lr_schedule])

Issue:
After Tuner0 and Tuner1 complete the search, the chief starts running the trials. Ideally the chief is suppose to only provide the variables for trials being conducted by the workers. Also, because I have restricted the chief to only run on CPU, it's very slow. Here are logs from the chief script:

Oracle server on chief is exiting in 10s.The chief will go on with post-search code.
Search space summary
Default search space size: 18
enc_dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.4, 'step': None, 'sampling': None}
enc_layer_norm (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
enc_l2_reg (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
pos_dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.4, 'step': None, 'sampling': None}
pos_layer_norm (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
pos_l2_reg (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
decoder_dropout (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.4, 'step': None, 'sampling': None}
decoder_layer_norm (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
decoder_l2_reg (Float)
{'default': 0.0, 'conditions': [], 'min_value': 0.0, 'max_value': 0.6, 'step': None, 'sampling': None}
learning_rate (Float)
{'default': 1e-05, 'conditions': [], 'min_value': 1e-05, 'max_value': 9e-05, 'step': None, 'sampling': None}
enc_dense_stack (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
bert_url (Choice)
{'default': 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/2', 'conditions': [], 'values': ['https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/2'], 'ordered': False}
pos_enc_blocks (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
pos_attn_heads (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
pos_dense_stack (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
decoder_enc_blocks (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
decoder_attn_heads (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
decoder_dense_stack (Choice)
{'default': 2, 'conditions': [], 'values': [2, 3, 4], 'ordered': True}
None

Search: Running Trial #218

Hyperparameter    |Value             |Best Value So Far 
enc_dropout       |0.37332           |0.10642           
enc_layer_norm    |0.15571           |0.12288           
enc_l2_reg        |0.48613           |0.57864           
pos_dropout       |0.17162           |0.14473           
pos_layer_norm    |0.11009           |0.26961           
pos_l2_reg        |0.49191           |0.20803           
decoder_dropout   |0.24864           |0.051037          
decoder_layer_norm|0.46016           |0.57878           
decoder_l2_reg    |0.41414           |0.013985          
learning_rate     |7.8417e-05        |6.716e-05         
enc_dense_stack   |4                 |3                 
bert_url          |https://tfhub.d...|https://tfhub.d...
pos_enc_blocks    |2                 |4                 
pos_attn_heads    |4                 |4                 
pos_dense_stack   |2                 |4                 
decoder_enc_blocks|2                 |3                 
decoder_attn_heads|2                 |3                 
decoder_dense_s...|2                 |2                 
tuner/epochs      |50                |50                
tuner/initial_e...|0                 |17                
tuner/bracket     |0                 |2                 
tuner/round       |0                 |2                 

Epoch 1/50
85/85 - 215s - loss: 149.9310 - accuracy: 0.8909 - val_loss: 103.2796 - val_accuracy: 0.9896 - lr: 6.4203e-05 - 215s/epoch - 3s/step
Epoch 2/50
85/85 - 220s - loss: 94.1549 - accuracy: 0.9897 - val_loss: 83.6212 - val_accuracy: 0.9896 - lr: 6.4203e-05 - 220s/epoch - 3s/step
Epoch 3/50
85/85 - 210s - loss: 75.2738 - accuracy: 0.9897 - val_loss: 67.1717 - val_accuracy: 0.9896 - lr: 6.4203e-05 - 210s/epoch - 2s/step
Epoch 4/50
85/85 - 190s - loss: 60.2264 - accuracy: 0.9898 - val_loss: 53.5418 - val_accuracy: 0.9896 - lr: 6.4203e-05 - 190s/epoch - 2s/step
1

There are 1 best solutions below

0
On

According to Keras Tuner - Distributed Tuning you should add the distributed_strategy parameter to the Hyperband constructor.