I'm attempting to apply Matterport's Mask-RCNN setup to my own data, but despite all of the great examples of impressive detections I've seen out there, I'm really struggling to get results that are at all promising, and so I'm suspecting that there's something fundamental I'm overlooking in my setup.
My dataset consists of aerial RGB shots of a city, with two classes: tree and background.
Image info: Aerial RGB photos, all 512x512, training: 324, validation: 36, using random crops of 128x128.
~46 trees per image on average.
Each training session ends up with something looking pretty similar to this:
With the following rough stats when testing on the validation set with no image cropping using the inspect_model.ipynb
as a guide:
Original image shape: [512 512 3]
Processing 1 images
image shape: (512, 512, 3) min: 23.00000 max: 255.00000 uint8
molded_images shape: (1, 512, 512, 3) min: 23.00000 max: 255.00000 uint8
image_metas shape: (1, 14) min: 0.00000 max: 512.00000 int64
anchors shape: (1, 65280, 4) min: -0.17712 max: 1.11450 float32
gt_class_id shape: (12,) min: 1.00000 max: 1.00000 int32
gt_bbox shape: (12, 4) min: 20.00000 max: 512.00000 int32
gt_mask shape: (512, 512, 12) min: 0.00000 max: 1.00000 float64
AP @0.50: 0.000
AP @0.55: 0.000
AP @0.60: 0.000
AP @0.65: 0.000
AP @0.70: 0.000
AP @0.75: 0.000
AP @0.80: 0.000
AP @0.85: 0.000
AP @0.90: 0.000
AP @0.95: 0.000
AP @0.50-0.95: 0.000
I keep getting the same results (seemingly high confidence with zero or very close to zero IoU, generally clustered at the tops of the images), even after implementing advice I've found elsewhere in the Mask-RCNN repo (for small datasets) such as only training on heads, initializing with coco weights but not for too long, adjusting my anchor scales to match the general sizes and aspect ratios of the annotations, etc.
So far I'm questioning:
- Is my dataset simply too small for the complexity a Resnet101 backbone?
- Maybe something is up with my annotations?
- I'm screwing up a fundamental aspect of my config
- Unknown unknowns
Checking out the losses, what obviously stands out is the high overall loss (epoch_loss) which increases with each training iteration (just heads -> resnet +4 -> all layers):
My config:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 8
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 100
DETECTION_MIN_CONFIDENCE 0.5
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 8
IMAGE_CHANNEL_COUNT 3
IMAGE_MAX_DIM 128
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 128
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE crop
IMAGE_SHAPE [128 128 3]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 101
MEAN_PIXEL [107. 105.2 101.5]
MINI_MASK_SHAPE (56, 56)
NAME tree
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 1.5]
RPN_ANCHOR_SCALES (16, 32, 64, 128)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.9
RPN_TRAIN_ANCHORS_PER_IMAGE 64
STEPS_PER_EPOCH 500
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 200
USE_MINI_MASK False
USE_RPN_ROIS True
VALIDATION_STEPS 50
WEIGHT_DECAY 0.005