Is a single double load faster than two loads

233 Views Asked by At

I working with a SPARC V8 processor which is connected to memory with a 32-Bit data bus. From the SPARC architecture manual V8, I have learned, that there are instructions to load / store a single 32-Bit register (word), but also instructions to load / store a double word into / from 2 registers atomically. Are the double word instructions somehow faster than the single word instructions on my machine? What else than the data bus width does it depend on?

Further, I discovered a optimized memcpy implementation in the Linux kernel sources, which copies a aligned chunk as follows:

#define MOVE_BIGALIGNCHUNK(...) \
ldd     [%src + (offset) + 0x00], %t0; \
ldd     [%src + (offset) + 0x08], %t2; \
ldd     [%src + (offset) + 0x10], %t4; \
ldd     [%src + (offset) + 0x18], %t6; \
std     %t0, [%dst + (offset) + 0x00]; \
std     %t2, [%dst + (offset) + 0x08]; \
std     %t4, [%dst + (offset) + 0x10]; \
std     %t6, [%dst + (offset) + 0x18]; 

Is there any benfit from grouping loads and stores together? Just curious.. Thanks!

Update: I'm using Gaisler's LEON3 implementation and I'm on the bare metal. ldd and std are implemented and do not trap. I measured that copying a big junk of data with ldd and std is faster by a factor of ~1.5. There are indeed data and instruction caches present and it makes sense to me that they can speed up double word operations. I also agree, that the overhead must be somehow reduced when fetching two consecutive words from memory. Thanks all for your comments.

0

There are 0 best solutions below