What prevents the compiler do a peephole optimization on expression templates?

330 Views Asked by At

I have the code listed below:

for(auto i =0;i<k;++i)                                                         
    printf("%d\n",(va1+va2-(va1-va2))[i]);    

where va1 and va2 are two valarray<int> objects, k is the size of va1 and va2. What I am expecting is the compiler optimizes the printf line like:

printf("%d\n", 2*va2[i]);

But instead, both Intel compiler (13.1) and CLang (3.4) did not do such optimization. For example, the Intel compiler outputed the assembly code:

..B1.18:                        # Preds ..B1.19 ..B1.17                            
        movl      (%r14,%r15,4), %r9d                           #18.42             
        movl      $.L_2__STRING.1, %edi                         #18.9              
        movl      (%r12,%r15,4), %r8d                           #18.42             
        xorl      %eax, %eax                                    #18.9              
        lea       (%r9,%r8), %esi                               #18.9              
        subl      %r8d, %r9d                                    #18.9              
        subl      %r9d, %esi                                    #18.9              
..___tag_value_main.18:                                         #18.9              
        call      printf                                        #18.9              
..___tag_value_main.19:                                         #                  
                                # LOE rbx r12 r13 r14 r15                          
..B1.19:                        # Preds ..B1.18                                    
        incq      %r15                                          #17.25             
        cmpq      %r13, %r15                                    #17.21             
        jl        ..B1.18       # Prob 82%                      #17.21    

where r13 stores the value of k, r14 and r12 are the beginning of memory of va1 and va2, respectively. r15 is the iterator variable i. From the code, what it does is:

load va1[i]
load va2[i]
add va1[i], va2[i] ==> %esi
sub va1[i], va2[i] ==> %r9d
sub %esi, %r9d ==> esi
print %esi

Why it is not optimized (even with -O3) to

load va2[i]
add va2[i], va2[i] => %esi
print %esi

with a peephole optimization? Gcc 4.8.2 does the optimization in this case, but cannot handle -(va1[i]+va2[i])+(va1[i]-va2[i]).

It looks like one possible reason is expression templates are used in the code shown previously. Now the question is, why the compiler stopped optimization just one step before perfection? How expression template prevented the step forward?

NOTE Well, the answer can always be "because the compiler is not designed to do that optimization". But as far as I learned from the dragon book, the compiler should do the optimization iteratively until it cannot do anything better.

0

There are 0 best solutions below