I am trying to train the DeepLab v3+ model (https://github.com/tensorflow/models/research/deeplab/) on the WiSe dataset (https://cvhci.anthropomatik.kit.edu/~mhaurile/wise/). I have modified the parameters in the scripts provided and started running the train.py script, but even though the loss keeps decreasing (from about 2.7 at step 10 to about 1.9 at step 100), I am getting all-zeros in the predictions made by the exported checkpoints. Even on every train image, I am getting an all-zero prediction.
Dataset information (I have processed the dataset to suit my needs):
Train images: 1222
Val images: 100
Total images: 1322
Total classes: 9 (including background)
Classes: ['background', 'TitleSlide', 'PresTitle', 'ImageCaption', 'Image', 'Code', 'Enumeration', 'Tables', 'Paragraph'] \

I added the following code to datasets/data_generator.py:

_WISE_SEG_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 1222,
        'trainval': 1322,
        'val': 100,
    },
    num_classes=10,        # 8 foreground + 1 background + 1 ignore
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
    'wise_seg': _WISE_SEG_INFORMATION,
}

Note that in my dataset, no image actually has any pixel with label 255. Each label is in the range [0, 8]. I have also tried setting num_classes to 9, without any success.
My directory structure is as follows:

deeplab
├── datasets
│   ├── wise_seg
│   │   ├── exp
│   │   │   └── train_on_train_set
│   │   │       ├── eval
│   │   │       ├── export
│   │   │       ├── train
│   │   │       └── vis
│   │   ├── init_models
│   │   │   └── xception
|   |   |       ├── model.ckpt.data-00000-of-00001
|   |   |       └── model.ckpt.index
│   │   ├── tfrecord
│   │   └── WiSe
│   │       ├── Annotations
│   │       ├── ImageSets
│   │       │   └── Segmentation
|   |       |       ├── train.txt
|   |       |       ├── trainval.txt
|   |       |       └── val.txt
│   │       ├── JPEGImages
│   │       ├── SegmentationClass
│   │       └── SegmentationClassRaw
│   └── __pycache__
|------ Other stuff

The command which I used to run the training:

python ./train.py \
  --logtostderr \
  --train_split="train" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --train_crop_size="513,513" \
  --train_batch_size=16 \
  --training_number_of_steps=30000 \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="./datasets/wise_seg/init_models/xception/model.ckpt" \
  --train_logdir="./datasets/wise_seg/train" \
  --dataset="wise_seg" \
  --initialize_last_layer=false \
  --last_layers_contain_logits_only=false \
  --dataset_dir="./datasets/wise_seg/tfrecord"

Note that I have set initialize_last_layer = False and last_layers_contain_logits_only = False. I have used the ImageNet pretrained Xception-65 model as the backbone network, which I downloaded from the link given here (specifically, I used xception_65_imagenet).
I also made the following change in utils/train_utils.py:

exclude_list = ['global_step', 'logits']
  if not initialize_last_layer:
    exclude_list.extend(last_layers)

When I execute the training, it is successfully able to get to the training part, and it has been trained up to about step 110 now. I exported an intermediate checkpoint using the following command:

python ./export_model.py \
  --logtostderr \
  --checkpoint_path="./datasets/wise_seg/exp/train_on_train_set/train/model.ckpt-41" \
  --export_path="./datasets/wise_seg/exp/train_on_train_set/export/frozen_inference_graph-41.pb" \
  --model_variant="xception_65" \
  --atrous_rates=6 \
  --atrous_rates=12 \
  --atrous_rates=18 \
  --output_stride=16 \
  --decoder_output_stride=4 \
  --num_classes=${3} \
  --crop_size=513 \
  --crop_size=513 \
  --inference_scales=1.0

The checkpoint gets exported successfully. Then I try to run inference using the sample notebook given here. Specifically, when I run the following part, 0 gets printed in the output:

graph_path = './datasets/wise_seg/exp/train_on_train_set/export/frozen_inference_graph-41.pb'
MODEL = DeepLabModel(graph_path)
resized_im, seg_map = MODEL.run(Image.open('./datasets/wise_seg/WiSe/JPEGImages/130110-3MQQHISL3D-540_frame11610.jpg'))
print(sum(sum(seg_map)))

And the same happens for any given image. Why is this happening? Any help would be deeply appreciated.

1

There are 1 best solutions below

0
On

You should try training with more than 110 steps (2000+ at a minimum). Your loss should be lower than 1.9. Please also make sure that the labeled masks show pixel values of 0, 1, 2, 3, 4, ... 8. Also, setting num_classes = 9 is correct.