I have created a point cloud of an irregular (non-planar) complex object using SfM. Each one of those 3D points was viewed in more than one image, so it has multiple (SIFT) features associated with it.
Now, I want to solve for the pose of this object in a new, different set of images using a PnP algorithm matching the features detected in the new images with the features associated with the 3D points in the point cloud.
So my question is: which descriptor do I associate with the 3D point to get the best results?
So far I've come up with a number of possible solutions...
- Average all of the descriptors associated with the 3D point (taken from the SfM pipeline) and use that "mean descriptor" to do the matching in PnP. This approach seems a bit far-fetched to me - I don't know enough about feature descriptors (specifically SIFT) to comment on the merits and downfalls of this approach.
- "Pin" all of the descriptors calculated during the SfM pipeline to their associated 3D point. During PnP, you would essentially have duplicate points to match with (one duplicate for each descriptor). This is obviously intensive.
- Find the "central" viewpoint that the feature appears in (from the SfM pipeline) and use the descriptor from this view for PnP matching. So if the feature appears in images taken at
-30
,10
, and40
degrees ( from surface normal), use the descriptor from the10
degree image. This, to me, seems like the most promising solution.
Is there a standard way of doing this? I haven't been able to find any research or advice online regarding this question, so I'm really just curious if there is a best solution, or if it is dependent on the object/situation.
The descriptors that are used for matching in most SLAM or SFM systems are rotation and scale invariant (and to some extent, robust to intensity changes). That is why we are able to match them from different view points in the first place. So, in general it doesn't make much sense to try to use them all, average them, or use the ones from a particular image. If the matching in your SFM was done correctly, the descriptors of the reprojection of a
3d
point from your point cloud in any of its observations should be very close, so you can use any of them 1.Also, it seems to me that you are trying to directly match the
2d
points to the3d
points. From a computational point of view, I think this is not a very good idea, because by matching2d
points with3d
ones, you lose the spatial information of the images and have to search for matches in a brute force manner. This in turn can introduce noise. But, if you do your matching from image to image and then propagate the results to the 3d points, you will be able to enforce priors (if you roughly know where you are, i.e. from an IMU, or if you know that your images are close), you can determine the neighborhood where you look for matches in your images, etc. Additionally, once you have computed your pose and refined it, you will need to add more points, no? How will you do it if you haven't done any2d/2d
matching, but just2d/3d
matching?Now, the way to implement that usually depends on your application (how much covisibility or baseline you have between the poses from you SFM, etc). As an example, let's note your candidate image
I_0
, and let's note the images from your SFMI_1, ..., I_n
. First, match betweenI_0
andI_1
. Now, assumeq_0
is a 2d point fromI_0
that has successfully been matched toq_1
fromI_1
, which corresponds to some 3d pointQ
. Now, to ensure consistency, consider the reprojection ofQ
inI_2
, and call itq_2
. MatchI_0
andI_2
. Does the point to whichq_0
is match inI_2
fall close toq_2
? If yes, keep the2d/3d
match betweenq_0
andQ
, and so on.I don't have enough information about your data and your application, but I think that depending on your constraints (real-time or not, etc), you could come up with some variation of the above. The key idea anyway is, as I said previously, to try to match from frame to frame and then propagate to the 3d case.
Edit: Thank you for your clarifications in the comments. Here are a few thoughts (feel free to correct me):
Let's consider a SIFT descriptor
s_0
fromI_0
, and let's noteF(s_1,...,s_n)
your aggregated descriptor (that can be an average or a concatenation of the SIFT descriptorss_i
in their correspondingI_i
, etc). Then when matchings_0
withF
, you will only want to use a subset of thes_i
that belong to images that have close viewpoints toI_0
(because of the30deg
problem that you mention, although I think it should be50deg
). That means that you have to attribute a weight to eachs_i
that depends on the pose of your queryI_0
. You obviously can't do that when constructingF
, so you have to do it when matching. However, you don't have a strong prior on the pose (otherwise, I assume you wouldn't be needing PnP). As a result, you can't really determine this weight. Therefore I think there are two conclusions/options here:SIFT descriptors are not adapted to the task. You can try coming up with a perspective-invariant descriptor. There is some literature on the subject.
Try to keep some visual information in the form of "Key-frames", as in many SLAM systems. It wouldn't make sense to keep all of your images anyway, just keep a few that are well distributed (pose-wise) in each area, and use those to propagate 2d matches to the 3d case.
If you only match between the
2d
point of your query and3d
descriptors without any form of consistency check (as the one I proposed earlier), you will introduce a lot of noise...tl;dr I would keep some images.
1 Since you say that you obtain your 3d reconstruction from an SFM pipline, some of them are probably considered inliers and some are outliers (indicated by a boolean flag). If they are outliers, just ignore them, if they are inliers, then they are the result of matching and triangulation, and their position has been refined multiple times, so you can trust any of their descriptors.