I am trying to fine-tune inceptionv3 model using slim tensorflow library. I am unable to understand certain things while writing the code for it. I tried to read source code (no proper documentation) and figured out few things and I am able to fine-tune it and save the check point. Here are the steps I followed 1. I created a tf.record for my training data which is fine, now I am reading the data using the below code.
import tensorflow as tf
import tensorflow.contrib.slim.nets as nets
import tensorflow.contrib.slim as slim
import matplotlib.pyplot as plt
import numpy as np
# get the data and labels here
data_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/train1.tfrecords'
# Training setting
num_epochs = 100
initial_learning_rate = 0.0002
learning_rate_decay_factor = 0.7
num_epochs_before_decay = 5
num_classes = 5980
# load the checkpoint
model_path = '/home/sfarkya/nvidia_challenge/datasets/detrac/inception_v3.ckpt'
# log directory
log_dir = '/home/sfarkya/nvidia_challenge/datasets/detrac/fine_tuned_model'
with tf.Session() as sess:
feature = {'train/image': tf.FixedLenFeature([], tf.string),
'train/label': tf.FixedLenFeature([], tf.int64)}
# Create a list of filenames and pass it to a queue
filename_queue = tf.train.string_input_producer([data_path], num_epochs=1)
# Define a reader and read the next record
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
# Decode the record read by the reader
features = tf.parse_single_example(serialized_example, features=feature)
# Convert the image data from string back to the numbers
image = tf.decode_raw(features['train/image'], tf.float32)
# Cast label data into int32
label = tf.cast(features['train/label'], tf.int32)
# Reshape image data into the original shape
image = tf.reshape(image, [128, 128, 3])
# Creates batches by randomly shuffling tensors
images, labels = tf.train.shuffle_batch([image, label], batch_size=64, capacity=128, num_threads=2,
min_after_dequeue=64)
Now I am finetuning the model using slim and this is the code.
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init_op)
# Create a coordinator and run all QueueRunner objects
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
# load model
# load the inception model from the slim library - we are using inception v3
#inputL = tf.placeholder(tf.float32, (64, 128, 128, 3))
img, lbl = sess.run([images, labels])
one_hot_labels = slim.one_hot_encoding(lbl, num_classes)
with slim.arg_scope(slim.nets.inception.inception_v3_arg_scope()):
logits, inceptionv3 = nets.inception.inception_v3(inputs=img, num_classes=5980, is_training=True,
dropout_keep_prob=.6)
# Restore convolutional layers:
variables_to_restore = slim.get_variables_to_restore(exclude=['InceptionV3/Logits', 'InceptionV3/AuxLogits'])
init_fn = slim.assign_from_checkpoint_fn(model_path, variables_to_restore)
# loss function
loss = tf.losses.softmax_cross_entropy(onehot_labels=one_hot_labels, logits = logits)
total_loss = tf.losses.get_total_loss()
# train operation
train_op = slim.learning.create_train_op(total_loss + loss, optimizer= tf.train.AdamOptimizer(learning_rate=1e-4))
print('Im here')
# Start training.
slim.learning.train(train_op, log_dir, init_fn=init_fn, save_interval_secs=20, number_of_steps= 10)
Now I have few questions about the code, which I am quite unable to figure out. Once, the code reaches slim.learning.train I don't see anything printing however, it's training, I can see in the log. Now,
1. How do I give the number of epochs to the code? Right now it's running step by step with each step has batch_size = 64.
2. How do I make sure that in the code tf.train.shuffle_batch I am not repeating my images and I am training over the whole dataset?
3. How can I print the loss values while it's training?
Here are answers to your questions.
You cannot give epochs directly to
slim.learning.train. Instead, you give the number of batches as the argument. It is callednumber_of_steps. It is used to set an operation calledshould_stop_opon line 709. I assume you know how to convert number of epochs to batches.I don't think the
shuffle_batchfunction will repeat images because internally it uses the RandomShuffleQueue. According to this answer, theRandomShuffleQueueenqueues elements using a background thread as:size(queue) < capacity:It dequeues elements as:
number of elements dequeued < batch_size:size(queue) >= min_after_dequeue + 1elements.So in my opinion, there is very little chance that the elements would be repeated, because in the
dequeuingoperation, the chosen element is removed from the queue. So it is sampling without replacement.Will a new queue be created for every epoch?
The tensors being inputted to
tf.train.shuffle_batchareimageandlabelwhich ultimately come from thefilename_queue. If that queue is producing TFRecord filenames indefinitely, then I don't think a new queue will be created byshuffle_batch. You can also create a toy code like this to understand howshuffle_batchworks.Coming to the next point, how to train over the whole dataset? In your code, the following line gets the list of TFRecord filenames.
If
filename_queuecovers all TFRecords that you have, then you are surely training over the entire dataset. Now, how to shuffle the entire dataset is another question. As mentioned here by @mrry, there is no support (yet, AFAIK) to shuffle out-of-memory datasets. So the best way is to prepare many shards of your dataset such that each shard contains about 1024 examples. Shuffle the list of TFRecord filenames as:Note that I removed the
num_epochs = 1argument and setshuffle=True. This way it will produce the shuffled list of TFRecord filenames indefinitely. Now on each file, if you usetf.train.shuffle_batch, you will get a near-to-uniform shuffling. Basically, as the number of examples in each shard tend to 1, your shuffling will get more and more uniform. I like to not setnum_epochsand instead terminate the training using thenumber_of_stepsargument mentioned earlier.training.pyand introducelogging.info('total loss = %f', total_loss). I don't know if there is any simpler way. Another way without changing the code is to view summaries in Tensorboard.There are very helpful articles on how to view summaries in Tensorboard, including the link at the end of this answer. Generally, you need to do the following things.
summaryobject.summary.summaryop.Now steps 5 and 6 are already done automatically for you if you use
slim.learning.train.For first 4 steps, you could check the file
train_image_classifier.py. Line 472 shows you how to create asummariesobject. Lines 490, 512 and 536 write the relevant variables intosummaries. Line 549 merges all summaries and the line 553 creates an op. You can pass this op toslim.learning.trainand you can also specify how frequently you want to write summaries. In my opinion, do not write anything apart from loss, total_loss, accuracy and learning rate into the summaries, unless you want to do specific debugging. If you write histograms, then the tensorboard file could take tens of hours to load for networks like ResNet-50 (my tensorboard file once was 28 GB, which took 12 hours to load the progress of 6 days!). By the way, you could actually usetrain_image_classifier.pyfile to finetune and you will skip most of the steps above. However, I prefer this as you get to learn a lot of things.See the launching tensorboard section on how to view the progress in a browser.
Additional remarks:
Instead of minimizing
total_loss + loss, you could do the following:I found this post to be very useful when I was learning Tensorflow.