Our team records video streams of moving trains from various camera locations with differing backgrounds and distances from the rails. Our task is to collect information about each wagon, which requires detecting gaps between them. We have trained a deep neural network using the Yolov5 architecture with default data augmentation on a dataset of over 2000 labeled images, as well as unlabeled images without gaps. However, we are experiencing several issues with false positives and poor performance in low-light conditions.
Our current post-processing step involves running the dbscan algorithm to group frames with "couplers" (see the image below for an example with bbox around coupler), and filtering out low-confidence examples based on mean confidence and standard deviation.
Additionally, we have recently collected 50k images from different locations, including images with couplers and without couplers. Images were collected dynamically using the current application where the image was assigned a class "GAP" if we found a coupler in it with at least 60% confidence. Images with couplers below 60% confidence were rejected and images with no couplers were assigned to "NO_GAP" class. Using these images, we trained a binary classifier with labels [GAP, NO_GAP] using the Yolov8 architecture. However, we are unsure if a binary classifier can generalize well enough for our task, as we treat many different concepts as "NO_GAP."
We are considering other deep learning architectures, such as semi-supervised learning and contrastive learning, as potential solutions to our problems. We are also interested in trying different architectures, such as VIT with a patching approach, although we have limited experience with these architectures.
Our main questions are:
What deep learning architectures or techniques would you recommend we explore to improve the accuracy of our model in detecting train wagon gaps in variable lighting and environmental conditions?
Is it worth staying with classification but using a different architecture, such as VIT with a patching approach? Are there any specific implementations or examples of these architectures that we can refer to?
We have a lot of unlabeled data. Is it worth trying to use self-supervised learning as a "pre-train" step? Is there a rule of thumb for things like unlabeled/labelled data ratio, required computing power, selecting algorithm and how to determine when to stop the pretraining process?
Example of video frames (with detected couplers)
maybe some of this will be of use:
I assume a small CNN network with fully connected layers at the end with less than a million parameters should be enough for a binary classifier with such little data. If writing such network yourself it will also be easier to implement an decoder-encoder-network for pre-training the encoder (and later classifier) on your unlabelled data (which I have never done myself).
Well, all in all, this looks like a fun problem :-)