My understanding is that hardware prefetching will never cross page boundaries. I'm wondering if a software prefetch has the same restriction i.e. can I use a software prefetch to avoid a future TLB miss. From searching around, it appears to be possible, but I couldn't find anything definitive in the documentation, so a reference would be good.
I'm specifically interested in Nehalem, Sandy Bridge and Westmere.
According to Intel's Optimization Reference Manual, it depends on the processor. From section 7.4.3:
Software prefetching may or may not avoid TLB misses, depending on the processor. It will not fetch the data if it would cause a page fault.
If you want ensure you avoid TLB misses, you could do a dummy read to load the data instead of a prefetch instruction. This could cause a page fault to swap in a page, which could be either good or bad depending on your use case.