I have heard that counting down is faster than counting up in loops. However, I also heard that accessing memory forwards (in ascending memory address order) is faster than accessing memory in reverse. Like if I have a loop that repeats a line of code N times, counting down might be slightly faster. But if I access an array with the current value of N, and I count down, I would be accessing the memory in a backwards fashion. Would this be slower and possibly negate all performance benefits of counting down in the first place?
Would this:
short array[1024];
for (int i = 0; i < 1024; ++i) {
do_something_with(array[i]);
}
Be faster than this:
short array[1024];
for (int i = 1024; i--;) {
do_something_with(array[i]);
}
I am trying to write the most fastest code on modern machines.
This is going to take some explaining, so I will motivate it with a conclusion first: Whether forward or backward memory accesses are likely to differ in time depends primarily on whether they match processor lookahead features and whether those features are symmetric in direction.
In regard to access direction, the primary driver of memory access time is processor preloading features. Memory hardware generally has no difference in time to access an address that is less than or greater than some other recently accessed address. But processors (generally, not all) do look at address access patterns and attempt to preload memory that is likely to be used soon.
For example, if a process accesses cache lines 13, 14, and 15, the processor may request cache line 16 from memory before the process actually executes any load instruction for it. This gets complicated in several ways.
First, if the process is racing through memory, working on data as fast as it can be read from memory, there may no advantage to processor preloading because the process is requesting the next memory as soon as the processor would try it anyway. This is especially so on superscalar processors that have multiple instructions “in flight” at one time, since that allows “future” load instructions to request memory before prior data-processing instructions are working on the data from prior loads. So processor preloading works best on algorithms that have at least a bit more data processing compared to memory use.
Second, most algorithms do not simply use data in a stream, either forward or backward. The process may be using data from cache lines 13, 14, 15, and so on but also referencing stack locations, and it may be using data from cache lines 13, 14, and 15 but also using data from cache lines 79, 80, 81, and so on. So processor designers may try to recognize consecutive accesses even when they are interrupted by other accesses. And then the processor behavior gets complicated to predict. How many “other” accesses will your process have in between its consecutive accesses? Will that be too many for the processor to deal with?
Third, a regular stream of accesses might not look regular from some points of view. Suppose you have an array of elements with four elements per cache line, and you access a regular sequence of elements: 0, 5, 10, 15, 20, 25, 30, 35, 40,… These are in cache lines 0 (elements 0-3, containing 0), 1 (4-7, containing 5), 2 (8-11), 3 (12-15), 5 (20-23), 6 (24-27), 7 (28-31), 8 (32-35), and 10 (40-43). If the processor is monitoring which cache lines are accessed, not which addresses are accessed, it sees an irregular sequence: 0, 1, 2, 3, 5, 6, 7, 8, 10, which is missing 4 and 9. When the processor saw 0, 1, 2, and 3, it may have preloaded cache line 4, and that wasted time and cache space because 4 is not used. Or, since 4 is not used by the program, the processor might not preload 5.
What this means is that your memory access time is likely to be affected by your data access pattern (what memory it actually accesses, not particularly how the loop is written) and the processor’s lookahead features. And, in answer to your question, whether forward or backward memory accesses are likely to differ in time depends primarily on whether they match processor lookahead features and whether those features are symmetric in direction.
If the processor design recognizes the same (reflected) backward sequences as it does forward sequences, then accessing memory forward or backward should not matter. If the processor design recognizes some forward sequences but not corresponding backward sequences, then code that matches such a forward sequence may perform better than code that matches such a backward sequence.
That said, I am unable to comment on how common it is for processors to have asymmetric lookahead features. There are too many processor models in existence. I have some recollection of working with an Intel processor in which reverse lookahead was a new feature, but I expect that is common now. But whether all the recognized access patterns are symmetric or what portion of processors on the market have symmetric lookahead features, I cannot say.