I've a question about the CUDA Fermi's Architecture: I've read somewhere that in Fermi's architecture the global memory's access is fast like the shared memory just because now they use uniform addressing.
So it's true that I can access to data on the global memory with no (big) latency (unlike the "pre-Fermi" GPU)?
It's very important for me to know that just because I'm programming code for an Nvidia Tesla GPU without have it (it's in the University's lab, and I can't access it during the summer...)
This is not true. Global memory access on Fermi is relatively long when compared to shared memory access. However, due to caches, you may directly hit a cach reducing the latency. This is particularly useful in less-than-ideal memory access patterns (e.g. slightly misaligned access).
Uniform memory addressing is a completely different thing, unrelated to the above. Uniform memory addressing allows the GPU to deduct at runtime if given memory pointer is refering to global or shared (or even mapped-pinned-host, or other-GPU) memory. On pre-Fermi cards the type of memory had to be deducible at compile time.