How does Tensorflow's DNNRegressor decide how many steps to do?

39 Views Asked by At

I'm running DNNRegressor from tensor flow and I'm only getting steps up to about 2000-4000, depending on the input parameters. I'm running DNNRegressor with embedding columns that describe a FEN code for a chess game. My labels are chess evaluations as supplied by https://www.kaggle.com/datasets/ronakbadhe/chess-evaluations.

Here are the input parameters I'm testing and getting the low number of steps with:

embedding_dims = [int(64),(64*2),(64*4)]
basic_list = [32, 32, 64, 64, 64, 128, 128, 264, 264]
hidden_units = [basic_list, [i*2 for i in basic_list], [i/2 for i in basic_list]]
batches = [32, 64, 128, 256, 512, 1024, 2048]

rates = [i for i in list(np.arange (.01,.2,.1))]
optimizers=[]
optimizersworded=[]
for lr in rates:
    optimizers.append(tf.compat.v1.train.AdagradOptimizer(learning_rate=lr))
    optimizersworded.append('tf.compat.v1.train.AdagradOptimizer(learning_rate='+str(lr))
    optimizers.append(tf.compat.v1.train.AdamOptimizer(learning_rate=lr))
    optimizersworded.append('tf.compat.v1.train.AdamOptimizer(learning_rate='+str(lr))
    optimizers.append(tf.compat.v1.train.FtrlOptimizer(learning_rate=lr))
    optimizersworded.append('tf.compat.v1.train.FtrlOptimizer(learning_rate='+str(lr))

Here is an example of the type of evaluations I'm getting:

{'average_loss': 350384.3, 'label/mean': 42.602047, 'loss': 89658030.0, 'prediction/mean': 42.507717, 'global_step': 1068}

By the way, I know there are much better ways in tensorflow to run this type of data. I'm only using DNNRegressor because I'm getting errors with defining my own model as described in this thread: No Loss Found In Tensorflow Model Despite Compiling. I'm not expecting to create results comparable to results you could create from defining your own model based on testing of different combinations of layers and parameters.

Other information:

Dataframe head (target is label parameter, column names are incorrect, but each column describes one chess board square. We also have a column for which player turn it is at the end.):

   target  r  n  b  q  k b.1 n.1 r.1  p  ... P.7  R X.31  B  Q  K B.1 N.1 R.1   
0      50  r  n  b  q  k   b   X   r  p  ...   P  R    X  B  Q  K   B   N   R  \
1      10  r  n  b  q  k   b   X   r  p  ...   P  R    X  B  Q  K   B   N   R   
2      75  r  n  b  q  k   b   X   r  p  ...   P  R    X  B  Q  K   B   N   R   
3      52  r  n  b  q  k   b   X   r  p  ...   P  R    X  B  Q  K   B   N   R   
4      52  r  n  b  q  k   b   X   r  p  ...   P  R    X  B  Q  K   B   N   R   

  b.2  
0   w  
1   b  
2   w  
3   b  
4   w  
[5 rows x 66 columns]

Amount of rows and columns in dataset:

row number: 1366424
column number: 66

Number of rows after splitting into train and test datasets.

1093139 train examples
273285 test examples

Here is my code for preprocessing data and running the model with different parameters and hyper parameters (I am not using GridSearchCV because a) I want to test parameters, not just hyper parameters and b) I could not get it running without rewriting code as GridsearchCV requires you to fit a model, not just train it. DNNRegressor doesn't have a fit option, although I saw some people getting a fit option from an older version of tensorflow):

parameters = [embedding_dims,hidden_units,optimizers, batches]
pramatersworded= [embedding_dims,hidden_units,optimizersworded, batches]

parameters_perm = list(itertools.product(*parameters))
parametersworded_perm = list(itertools.product(*pramatersworded))

try:

    with open("evals", "rb") as fp:   # Unpickling
        evals = pickle.load(fp)
except:

    evals={}

X = pd.read_csv('RegChessDataForTensorflow.csv', nrows=1366424) #,nrows=10000
X["target"] = X["target"].astype(int)
X = X.loc[:, ~X.columns.str.contains('^Unnamed')]

print(X.head())
print("row number:", X.shape[0])
print("column number:", X.shape[1])

train_ds, test_ds = train_test_split(X, test_size=0.2)
print(len(train_ds), 'train examples')
print(len(test_ds), 'test examples')

#tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)

features={}
labels=[]

def embedded_col_maker(mode,embedding_dims):

    X = train_ds if mode == tf.estimator.ModeKeys.TRAIN else test_ds

    global features
    global labels
    features={}
    labels=[i for i in X.iloc[:,0]]

    for colnum in range(1,len(X.columns)): features[str(X.columns[colnum])] = [str(i) for i in X.iloc[:,colnum]]

    embedding_cols=[]
    for colnum in range(1,len(X.columns)):
        vocab_col = tf1.feature_column.categorical_column_with_vocabulary_list(X.columns[colnum],vocabulary_list=['r', 'n', 'b','q','k','p','X','R','P','B','Q','K','N'],num_oov_buckets=0)
        embedding_cols.append(tf1.feature_column.embedding_column(vocab_col, embedding_dims))

    return embedding_cols

for i in range(0,len(parameters_perm)):

    print("parameters:", parametersworded_perm[i])
    print("finished", i, "of", len(parameters_perm))
    embedding_dims=int(parameters_perm[i][0])
    hidden_units=parameters_perm[i][1]
    optimizer=parameters_perm[i][2]
    batch=parameters_perm[i][3]

    try:

        i=int(i)
        if i == 0: embedding_cols = embedded_col_maker(tf.estimator.ModeKeys.TRAIN, embedding_dims)
        if parametersworded_perm[i] in list(evals.values()): parametersworded_perm[i], list(evals.values()); continue

        model_dir = "/Users/lukastaylor/tfpy/RegModels/modelsearcb2/"+str(i)

        DNN = tf1.estimator.DNNRegressor(feature_columns = embedding_cols,
                                         hidden_units=hidden_units,
                                         #model_dir=model_dir,
                                         optimizer=optimizer)

        input_fn = lambda:tf1.data.Dataset.from_tensor_slices((features, labels)).batch(batch)

        DNN.train(input_fn)

        embedding_cols = embedded_col_maker(tf.estimator.ModeKeys.EVAL, embedding_dims)
        evaluation = DNN.evaluate(input_fn)
        print(parametersworded_perm[i],"evaluation:",evaluation)
        evals[evaluation['average_loss']] = parametersworded_perm[i]

    except Exception as e:

        print("ERROR:", e)
        evals[e] = parametersworded_perm[i]

    with open("evals", "wb") as fp:

        pickle.dump(evals, fp)

print(evals)

Here's one train loop output(first line is parameters in form [embedding_dim,hidden_units,optimizer, batchsize]):

parameters: (64, [32, 32, 64, 64, 64, 128, 128, 264, 264], 'tf.compat.v1.train.AdagradOptimizer(learning_rate=0.11', 256)
finished 24 of 378
INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: /var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo
INFO:tensorflow:Using config: {'_model_dir': '/var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 0...
INFO:tensorflow:Saving checkpoints for 0 into /var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 0...
INFO:tensorflow:loss = 76700340.0, step = 0
INFO:tensorflow:global_step/sec: 46.2673
INFO:tensorflow:loss = 87202760.0, step = 100 (2.160 sec)
INFO:tensorflow:global_step/sec: 100.786
INFO:tensorflow:loss = 87366430.0, step = 200 (0.992 sec)
INFO:tensorflow:global_step/sec: 103.092
INFO:tensorflow:loss = 64167450.0, step = 300 (0.970 sec)
INFO:tensorflow:global_step/sec: 103.334
INFO:tensorflow:loss = 97141176.0, step = 400 (0.968 sec)
INFO:tensorflow:global_step/sec: 102.102
INFO:tensorflow:loss = 98498920.0, step = 500 (0.979 sec)
INFO:tensorflow:global_step/sec: 101.364
INFO:tensorflow:loss = 139143980.0, step = 600 (0.987 sec)
INFO:tensorflow:global_step/sec: 100.408
INFO:tensorflow:loss = 111471360.0, step = 700 (0.996 sec)
INFO:tensorflow:global_step/sec: 92.3166
INFO:tensorflow:loss = 82485256.0, step = 800 (1.083 sec)
INFO:tensorflow:global_step/sec: 91.5144
INFO:tensorflow:loss = 152213540.0, step = 900 (1.093 sec)
INFO:tensorflow:global_step/sec: 90.5959
INFO:tensorflow:loss = 102906220.0, step = 1000 (1.104 sec)
INFO:tensorflow:Calling checkpoint listeners before saving checkpoint 1068...
INFO:tensorflow:Saving checkpoints for 1068 into /var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo/model.ckpt.
INFO:tensorflow:Calling checkpoint listeners after saving checkpoint 1068...
INFO:tensorflow:Loss for final step: 60065990.0.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2023-05-05T05:49:32
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo/model.ckpt-1068
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Inference Time : 36.61686s
INFO:tensorflow:Finished evaluation at 2023-05-05-05:50:09
INFO:tensorflow:Saving dict for global step 1068: average_loss = 403794.44, global_step = 1068, label/mean = 40.95011, loss = 103324870.0, prediction/mean = 48.46155
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1068: /var/folders/4p/3406l7xn5txb6rmj4lqpn5br0000gn/T/tmp35hrhgdo/model.ckpt-1068
(64, [32, 32, 64, 64, 64, 128, 128, 264, 264], 'tf.compat.v1.train.AdagradOptimizer(learning_rate=0.11', 256) evaluation: {'average_loss': 403794.44, 'label/mean': 40.95011, 'loss': 103324870.0, 'prediction/mean': 48.46155, 'global_step': 1068}
parameters: (64, [32, 32, 64, 64, 64, 128, 128, 264, 264], 'tf.compat.v1.train.AdagradOptimizer(learning_rate=0.11', 512)
finished 25 of 378
0

There are 0 best solutions below