Improving Deep Learning Model to Detect Train Wagon Gaps in Variable Conditions

191 Views Asked by At

Our team records video streams of moving trains from various camera locations with differing backgrounds and distances from the rails. Our task is to collect information about each wagon, which requires detecting gaps between them. We have trained a deep neural network using the Yolov5 architecture with default data augmentation on a dataset of over 2000 labeled images, as well as unlabeled images without gaps. However, we are experiencing several issues with false positives and poor performance in low-light conditions.

Our current post-processing step involves running the dbscan algorithm to group frames with "couplers" (see the image below for an example with bbox around coupler), and filtering out low-confidence examples based on mean confidence and standard deviation.

Additionally, we have recently collected 50k images from different locations, including images with couplers and without couplers. Images were collected dynamically using the current application where the image was assigned a class "GAP" if we found a coupler in it with at least 60% confidence. Images with couplers below 60% confidence were rejected and images with no couplers were assigned to "NO_GAP" class. Using these images, we trained a binary classifier with labels [GAP, NO_GAP] using the Yolov8 architecture. However, we are unsure if a binary classifier can generalize well enough for our task, as we treat many different concepts as "NO_GAP."

We are considering other deep learning architectures, such as semi-supervised learning and contrastive learning, as potential solutions to our problems. We are also interested in trying different architectures, such as VIT with a patching approach, although we have limited experience with these architectures.

Our main questions are:

  1. What deep learning architectures or techniques would you recommend we explore to improve the accuracy of our model in detecting train wagon gaps in variable lighting and environmental conditions?

  2. Is it worth staying with classification but using a different architecture, such as VIT with a patching approach? Are there any specific implementations or examples of these architectures that we can refer to?

  3. We have a lot of unlabeled data. Is it worth trying to use self-supervised learning as a "pre-train" step? Is there a rule of thumb for things like unlabeled/labelled data ratio, required computing power, selecting algorithm and how to determine when to stop the pretraining process?

Example of video frames (with detected couplers)

Example of the gap between wagons during the night

Example of the gap between wagons during the day

1

There are 1 best solutions below

4
On

maybe some of this will be of use:

  1. Why are using a Localization network, such as yolo, to perform a task that is classification? Yolo produces a huge output vector, potentially detecting many objects of many classes anywhere on the image. This seems to be overkill?
  2. Why use label of couplers in the background? This makes the network less confident and doesn't help you much (I guess). Removing those labels should make the problem easier to learn. Alternatively you can give both couplers different labels. This way the network is not confused by tiny, partial hidden couplers in the background and big ones in the fron. I guess you can re-label the data in an automatic fashion (if there is one label, its probably the front coupler, if there are two labels, the bigger one is the front coupler?).
  3. Something you might already consider: If you use video data, there might be some frames allowing easy detection and some frames being more difficult (e.g. light reflections). Using multiple frames of the same wagon might help you to get better results e.g. an average confidence or the like.
  4. Yolo has hyper-parameters to control the importance of localization error, class error or "objectivness"-error. The latter is the most important for you. It might help you to put more focus on this, if you don't want to go with a real classifier in the first place.
  5. Seeing that you detect more stuff than couplers (namely text), I just want to point out, that your link says that left-right flipping is performed as augmentation. This is something sensible to do in general, but probably is something bad to do for text detection (letters on the wagon).
  6. Your problem becomes easier, if you use the external knowledge that couples are always related to the position of the tracks. By aligning the images in relation to the tracks, you could potentially decrease the input image size to the relevant area, decreasing the number of false positives in unlikely places. This can also increase inference and training speed due to smaller images.
  7. In general, a classifier will always predict the class that causes the least "trouble". If your dataset is skewed towards NO_GAP images, it will learn to predict NO_GAP as this will be true in most cases and is less risky. Therefore, you should always provide the same amount of images for all classes. If this is not possible, one has to draw more images from the "GAP" folder than from the "NO_GAP" image folder to make up for it.
  8. As there is a bounty on this question, I assume "money doesn't matter" ;-) and you have resources to provide more manual labels. Classifying images in two classes is very quick. I even suggest that this is performed to some extent by the developers themselves. Knowing your own data will teach you very much on how to tackle the problem and give you further ideas. If there is a lot of light variations, it might be helpful to standardize the data for each instance. E.g. one could experiment with data that has an average HSV "value" channel equal to the datasets average "value" channel. A word of warning: I don't know the exact architecture that you are referring to, but pre-processing the data in a new way will kill the performance on pre-trained / out of the box network without further ado. Also, normalization layers might already provide some sort of channel adjustments.
  9. Given that you do not have excessive training data, a default network might be overkill. For example, you are using gray scale images in a RGB network... this is inefficient. In case you use a pre-trained network (trained on colours) this might be even harmful. In case you don't have a pre-trained network, self supervised pre-training is probably a good idea.
  10. Last but not least here is another way of increasing accuracy for your use case: Since cargo trains have predictable motion, one could post process the gaps: All gaps would have a time-gap of roughly the same size (if wagons have a standard length). This can help to handle false positives and false negatives.

I assume a small CNN network with fully connected layers at the end with less than a million parameters should be enough for a binary classifier with such little data. If writing such network yourself it will also be easier to implement an decoder-encoder-network for pre-training the encoder (and later classifier) on your unlabelled data (which I have never done myself).

Well, all in all, this looks like a fun problem :-)