Error when trying to train FasterRCNN with custom backbone on GRAYSCALE images

Question

Error when trying to train FasterRCNN with custom backbone on GRAYSCALE images

1.1k Views Asked by Stefan Radonjic At 29 July 2025 at 17:08

I am following instructions from https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html#putting-everything-together tutorial in order to create object detector for 1 class on GRAYSCALE images.

Here is my code (note that I am using DenseNet as a BACKBONE - pretrained model by me on my own dataset):

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

num_classes = 2 # 1 class + background

model = torch.load(os.path.join(patch_classifier_model_dir, "densenet121.pt"))

backbone = model.features

backbone.out_channels = 1024

anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),

                                   aspect_ratios=((0.5, 1.0, 2.0),))

roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=[0],

                                                output_size=7,

                                                sampling_ratio=2)

# put the pieces together inside a FasterRCNN model

model = FasterRCNN(backbone,

                   num_classes=2,

                   rpn_anchor_generator=anchor_generator,

                   box_roi_pool=roi_pooler)

# move model to the right device

model.to(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.005,

                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by

# 10x every 3 epochs

lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,

                                               step_size=3,

                                               gamma=0.1)

This is the error that I am running into:

RuntimeError: Given groups=1, weight of size [64, 1, 7, 7], expected input[2, 3, 1344, 800] to have 1 channels, but got 3 channels instead

Based on FasterRCNN architecture, I assume problem is in the transform component because it tries to normalize images that are initially grayscale, and not RGB:

FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): Sequential(
    (conv0): Conv2d(1, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu0): ReLU(inplace=True)
    (pool0): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (denseblock1): _DenseBlock(
      (denselayer1): _DenseLayer(
        (norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu1): ReLU(inplace=True)
        (conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
        (norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu2): ReLU(inplace=True)
        (conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      )
      
      ...............
        
    (norm5): BatchNorm2d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  )
  (rpn): RegionProposalNetwork(
    (anchor_generator): AnchorGenerator()
    (head): RPNHead(
      (conv): Conv2d(1024, 1024, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (cls_logits): Conv2d(1024, 15, kernel_size=(1, 1), stride=(1, 1))
      (bbox_pred): Conv2d(1024, 60, kernel_size=(1, 1), stride=(1, 1))
    )
  )
  (roi_heads): RoIHeads(
    (box_roi_pool): MultiScaleRoIAlign()
    (box_head): TwoMLPHead(
      (fc6): Linear(in_features=50176, out_features=1024, bias=True)
      (fc7): Linear(in_features=1024, out_features=1024, bias=True)
    )
    (box_predictor): FastRCNNPredictor(
      (cls_score): Linear(in_features=1024, out_features=2, bias=True)
      (bbox_pred): Linear(in_features=1024, out_features=8, bias=True)
    )
  )
)

Am I correct? If so, how do I resolve this issue? Is there a STANDARD PRACTICE on dealing with grayscale images and FasterRCNN?

Thanks in advance! Really appreciate it!

Original Q&A

There are 1 best solutions below

**planet_pluto** · Accepted Answer

Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) means the normalization process is applied to all the 3 channels of the input image. 0.485 is applied to the R channel, 0.456 is applied to the G channel and 0.406 is applied to the B channel. The same goes for the standard deviation values.

The 1st Conv. layer of the backbone expects a 1 channel input and that's the reason you get this error.

You could do the following to solve the issue.

Re-define the GeneralizedRCNNTransform and attach it to your model. You could do something like this:

# put the pieces together inside a FasterRCNN model

model = FasterRCNN(backbone, num_classes=2, rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)

# Changes
grcnn = torchvision.models.detection.transform.GeneralizedRCNNTransform(min_size=800, max_size=1333, image_mean=[0.485], image_std=[0.229])
model.transform = grcnn
model.to(device)

Error when trying to train FasterRCNN with custom backbone on GRAYSCALE images

There are 1 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in PYTORCH

Related Questions in OBJECT-DETECTION

Related Questions in FASTER-RCNN

Trending Questions

Popular # Hahtags

Popular Questions