I am familiar with two view stereo but fuzzy on how SFM (Structure from motion) and MVS (Multiview stereo) exactly works.
Let's say I have two stereo pairs of cameras, (A, B) and (C, D). I can calculate the depth map for camera A using two-view stereo with cameras A and B. Similarly, I can calculate the depth map for camera C using two-view stereo with cameras C and D. Based on calibration, I can turn depth map A into point clouds and color them with color values from camera A. Similarly, I can turn depth map C into point clouds and color them with color values from camera C. In a perfect world, when I overlay point clouds A and point clouds C, it should look perfect without any obvious color problems, but unfortunately, in the real world, there will be some color difference between what camera A and camera C captures for the same point in space. I tried various ways of color averaging for point clouds that are visible in both camera A and camera C but no matter what there will be an obvious color "seam" between point clouds that are only visible in camera A and point clouds that's visible in both camera A and camera C.
However, this kind of color problem doesn't seem to exist in SFM and MVS. As shown in the results of colmap, AliceVision and RealityCapture. I've read multiple tutorials on how SFM/MVS works but none of them specifically explained how it overcomes the color problem. Most of them focused on explaining how to generate depth, and for the case of SFM estimating the intrinsics and pose. Can someone explain to me what method does conventional SFM/MVS uses to solve the color difference? I would appreciate a link to a tutorial/paper that explains this as well.
This problem needs to be explained in two different scenarios.
Going back to your problem to solve, assuming all the above worked fine for you, in order to obtain a nice looking color for your point cloud, there are two popular solution: 1) averaging the color from all the views of the point, or 2) select only the "best" view per point. Obviously the problem in 1 is that the resulted color will be blurred, and for 2 a way to select the view per point is very important to minimize the transition between different views (and there are many ways to do this, but obviously a global approach would be the best).