Unsure how to train a ML model to recognise static imagery

383 Views Asked by At

I'm trying to build a ML model for a specific use case. I've read up on various different libraries, and attempted to train my own classifiers, but I feel like what I'm doing isn't quite right - the setups for object detection all seem based on the idea that the object you're detecting can have a vast number of forms, and thus the training methods are designed to take that into account. My use case is different than that.

I have static, flat imagery that I want to identify, for example a book cover. It therefore makes sense that I shouldn't need to provide many images of it, but just a single image of what it looks like from the front. I want to train a ML model so that I could show it an image of that book cover after training, and it would recognise it.

The image of the book cover after training may include environmental factors, such as different lighting, or an alternate angle, but the idea is that if the book cover itself is in full view, it should be able to be recognised.

It's proven to be quite difficult to figure out what to do here. Every guide I've come across has been designed for training on objects which can potentially take many forms. Adapting those guides for my purpose hasn't been successful.

I've tried using Turi Create's very simple setup, training it on each single data point I have for each book, and then using that same data for validation, as I obviously don't have a training and validation set. Turi Create takes care of all training details, and is obviously designed for many examples for each class. I feel like I'm badly modifying it here for my purposes. Upon testing, it also doesn't work for object detection.

I've had some limited success using OpenCV's keypoint detection and Nearest Neighbour matching features, but the idea is that there'd be a much broader list of items, perhaps 10k books, and so it's not practical to do image comparison in such a way on each of those.

I've been learning more about ML and Computer Vision over the past month, but it's certainly not my area of expertise - I'm a software developer primarily. Would appreciate any advice I could get here.

2

There are 2 best solutions below

2
On

Your question has no neat out of the box answer (sorry to say), but there are a few key areas of computer vision / machine learning that you will want to know about to get this solved.

First: if you really want to stay in opencv and existing libraries (as in, you don't want this to turn into an algorithm research project), I suggest the following:

  1. Make a small training set. Note that here by training set I mean images of the book cover in its "testing" environment: different angles, different lightings, different background clutter, etc. Really this can be something like 50 images, which shouldn't take too much time to just do manually.
  2. Depending on how much you want this to be object detection (as in, is this a picture of just the book cover, or is it a picture of a desk, which has the book on it, but maybe also a stapler or something), you should include the bounding box.
  3. Then use a classic CV algorithm that is implemented in OpenCV, like SIFT or SURF or a Hough transform. Rather than go through those details, I refer you to this related post about extracting Coke cans. There's a neat discussion there and it will probably lead you to the right implementation. From your problem description I suspect that these problems are quite similar (eg your comment about objects taking many forms; that's also not an issue with Coke cans).

Second: if the above is not adequate, you are in for a more advanced research project. I would still recommend something like a Hough transform or SIFT, because the key insight there is that you should be able to find a filter (or filter-like object) that is really good at recognizing this book cover specifically. That means things like typical deep learning approaches are less useful out of the box. If you really want to go down that path, start with reading about data augmentation, then read about one-shot or few-shot learning, and then read about transfer learning. That's a long road, so I'd strongly favor the first approach I suggest.

0
On

The below answer might help you approach/solve the problem using Convolutional Neural Networks(CNN). Please go through these videos to learn more about the subject.

Objective: To identify a flat imagery (Example: A book cover)

  1. Training set Creation: Your training set must have positive and negative images where positive images are the ones containing the book cover and the negative ones are the ones that don't.

    • Positive samples should contain, book cover images blurred, at different positions with respect to the camera, tilted at different angles, with various background, at different lighting etc. (whichever images you want to return as positive)

    • Negative samples should contain, images without book cover with just the background etc.

You can also try to manually create these datasets.

  1. Labelling:

    • It is not clear whether or not the exact coordinates of the book cover is to be found. So in such a case, the output should be X_start, Y_start, Width, and height of the bounding box covering the book cover in all images. For images with no book cover(negative samples), the values are (0,0,0,0).

    • Else simple label the images as 1s and 0s for positive and negative samples respectively.

  2. Model Fine tuning:
    • There are several pre-trained models available. You can simple finetune it for your images.

Take a look at these pages for more information:

  1. Object localization and detection
  2. Binary Image Classification