Here is the block diagram of the pipeline for the H.264 encoder (Fundamentals of Multimedia by Ze-Nian Li, Mark S. Drew and Jiangchuan Liu).

The feedback loop is used for motion estimation, but why is it taken after the transform and quantization steps? Why introduce and overhead from the inverse functions?

I know there's likely a very good reason for it, and this might sound like a stupid question, but I'd appreciate any answer that nudges me in the right direction.
Thanks!
EDIT: MOVED HERE