RoI cropping in Spatial Transformer Network

304 Views Asked by At

TL;DR: How does RoI cropping method from Spatial Transformer Network works?

Reading PyTorch Spatial Transformer Network tutorial I saw the network uses a special RoI pooling I haven't seen before called RoI cropping.
Reading the docs for the F.affine_grid and F.grid_sample did not explain so much what is happening there, so I tried reading the network's paper in hope to understand as well as some blog post on Faster RCNN elaborating on the method with pictures, but still no help.
I feel every source has different details and can't get the idea of what exactly is happening there, as good as I understand the normal RoI pooling and align methods.

Right now, this is the big picture in my head:
1. As usual, map the suggested RoI coordinates to the feature map space.
2. Normalize the coordinates to the range of [-1, 1] (I guess that's for the following affine transformation).
3. Calculate (using the method in the picture below) the transformation values.
4. Now, I assume we apply the transformation to the RoI pixels? 5. And finally, I assume we do an interpolation (i.e. bilinear interpolation) to the final coordinates.

Could someone briefly explain the whole process of the RoI cropping method? I feel I might be missing something.

enter image description here

0

There are 0 best solutions below