Tensorflow/AI Cloud Platform: HyperTune trials failed to report the hyperparameter tuning metric

367 Views Asked by At

I'm using the tf.estimator API with TensorFlow 2.1 on Google AI Platform to build a DNN Regressor. To use AI Platform Training hyperparameter tuning, I followed Google's docs. I used the following configuration parameters:

config.yaml:

trainingInput:
    scaleTier: BASIC
    hyperparameters:
        goal: MINIMIZE
        maxTrials: 2
        maxParallelTrials: 2
        hyperparameterMetricTag: rmse
        enableTrialEarlyStopping: True
        params:
        - parameterName: batch_size
          type: DISCRETE
          discreteValues:
          - 100
          - 200
          - 300
        - parameterName: lr
          type: DOUBLE
          minValue: 0.0001
          maxValue: 0.1
          scaleType: UNIT_LOG_SCALE 

And to add the metric to my summary, I used the following code for my DNNRegressor:

def rmse(labels, predictions):
    pred_values = predictions['predictions']
    rmse = tf.keras.metrics.RootMeanSquaredError(name='root_mean_squared_error')
    rmse.update_state(labels, pred_values)
    return {'rmse': rmse}

def train_and_evaluate(hparams):
    ...
    estimator = tf.estimator.DNNRegressor(
                       model_dir = output_dir,
                       feature_columns = get_cols(),
                       hidden_units = [max(2, int(FIRST_LAYER_SIZE * SCALE_FACTOR ** i))
                        for i in range(NUM_LAYERS)],
                       optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
                       config = run_config)
    estimator = tf.estimator.add_metrics(estimator, rmse)

According to Google's documentation, the add_metric function creates a new estimator with the metric specified, which is then used as the hyperparameter metric. However, the AI Platform Training service doesn't recognise this metric: Job details on AI Platform

On running the code locally, the rmse metric does get outputted in the logs. So, how do I make the metric available to the Training job on AI Platform using Estimators?

Additionally, there is an option of reporting the metrics through the cloudml-hypertune Python package. But it requires the value of the metric as one of the input arguments. How do I extract the metric from tf.estimator.train_and_evaluate function (since that's the function I use to train/evaluate my estimator) to input into the report_hyperparameter_tuning_metric function?

hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
    hyperparameter_metric_tag='rmse',
    metric_value=??,
    global_step=1000
)

ETA: Logs show no error. It says that the job completed successfully even though it fails.

0

There are 0 best solutions below