thr3ads.net - llvm dev - [LLVMdev] Area for improvement [Feb 2005]

If this information is useful, please help other people find it:
Share via:

Jeff Cohen

2005-Feb-22 06:11 UTC

[LLVMdev] Area for improvement

Sorry, I thought I was running selection dag isel but I screwed up when 
trying out the really big array.  You're right, it does clean it up 
except for the multiplication.

So LoopStrengthReduce is not ready for prime time and doesn't actually 
get used?

I might consider whipping it into shape.  Does it still have to handle 
getelementptr in its full generality?

Chris Lattner wrote:
> On Mon, 21 Feb 2005, Jeff Cohen wrote:
>
>> I noticed that fourinarow is one of the programs in which LLVM is 
>> much slower than GCC, so I decided to take a look and see why that is 
>> so.  The program has many loops that look like this:
>>
>>   #define ROWS 6
>>   #define COLS 7
>>
>>   void init_board(char b[COLS][ROWS+1])
>>   {
>>     int i,j;
>>
>>     for (i=0;i<COLS;i++)
>>           for (j=0;j<ROWS;j++)
>>             b[i][j]='.';
>>                         for (i=0;i<COLS;i++)
>>           b[i][ROWS]=0;
>>   }
>>
>> This generates the following X86 code:
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>>       movb $46, (%esi)
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>>       leal 1(%esi), %edx
>
> ... (many many copies of this, see the end of the email for full 
> output) ...
>
>> The code generated by GCC is much faster.  LLVM does unroll the loop, 
>> which GCC does not do, but the addressing code is a constant 
>> repetition of these three instructions:
>>
>>       imull $7, %ecx, %edx
>>       movl %eax, %esi
>>       addl %edx, %esi
>
>
> Yup, this is an issue of simple "peephole expansion" of the 
> getelementptr instruction.
>
>> All but the first occurrance do nothing but put the same address into 
>> %esi that was already there to being with, and does it with an 
>> expensive multiply at that.  GCC correctly creates a suitable 
>> induction variable and strength reduces the multiplication down to 
>> addition.
>
>
> This is called loop strength reduction.
>
>> It turns out that using the new selection dag code generator, the 
>> outer loop is unrolled also and nothing is left but a bunch of movb 
>> instructions.  Much, much faster.
>
>
> For the record, instead of the ugly code above, the new isel generates:
>
> init_board:
>         mov %EAX, DWORD PTR [%ESP + 4]
>         mov %ECX, 0
> .LBBinit_board_1:       # loopexit.1
>         imul %EDX, %ECX, 7
>         mov BYTE PTR [%EAX + %EDX], 46
>         mov BYTE PTR [%EAX + %EDX + 1], 46
>         mov BYTE PTR [%EAX + %EDX + 2], 46
>         mov BYTE PTR [%EAX + %EDX + 3], 46
>         mov BYTE PTR [%EAX + %EDX + 4], 46
>         mov BYTE PTR [%EAX + %EDX + 5], 46
>         inc %ECX
>         cmp %ECX, 7
>         jne .LBBinit_board_1    # loopexit.1
> .LBBinit_board_2:       # return
>         mov BYTE PTR [%EAX + 6], 0
>         mov BYTE PTR [%EAX + 13], 0
>         mov BYTE PTR [%EAX + 20], 0
>         mov BYTE PTR [%EAX + 27], 0
>         mov BYTE PTR [%EAX + 34], 0
>         mov BYTE PTR [%EAX + 41], 0
>         mov BYTE PTR [%EAX + 48], 0
>         ret
>
> This code is almost "perfect" with the exception of the multiply
by 7
> inside of the loop.  Loop strength reduction would transform this into 
> an add like GCC produces.
>
>
>> But still, the optimization passes shouldn't be leaving so much 
>> garbage around for instruction selection to clean up.  At first, I 
>> thought all that was needed was to another round of basic block 
>> optimizations after the loop was unrolled, until I saw the actual 
>> LLVM bytecodes:
>
>
> There are two issues here.  First, LLVM code is NOT designed for 
> trivial peephole code generation.  This is *possible*, but not going 
> to give you good quality code (see above ;-) ).  The new isel uses an 
> extremely light-weight DAG-based optimizer that both simplifies 
> instruction selectors and optimizes the code before selection.  LLVM 
> code is meant to be a host to mid-level and interprocedural 
> optimizations, and I think it does that job well.
>
>> Now the problem is obvious.  A two dimensional array access is being 
>> performed by a single instruction.  The arithmetic needed to address 
>> the element is implicit, and therefore inaccessible to 
>> optimizations.  The redundant calculations can not be eliminated, nor 
>> can strength reduction be performed.  getelementptr needs to be 
>> broken down into its constituent operations at some point.
>
>
> Actually strength reduction can be trivially done with getelementptr 
> instructions, that was part of their design.  Part of the problem is 
> that we have an induction variable canonicalization pass which UNDOES 
> strength reduction to make loop analysis simpler, but we don't have 
> strength reduction to redo it.   Thus if you compile a loop like this 
> (complex double chosen to show a size that X86 can scale "for
free"):
>
> #include <complex.h>
> void test(unsigned N, complex double *P) {
>   for (; N != 0; --N)
>     *P++ = 0;
> }
>
> We compile it to this LLVM code:
>
> void %test(uint %N, "complex long double"* %P) {
> entry:
>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>         br bool %tmp.15, label %return, label %no_exit
>
> no_exit:                ; preds = %no_exit, %entry
>         %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry 
> ]            ; <uint> [#uses=3]
>         %tmp.4 = getelementptr "complex long double"* %P, uint 
> %indvar, uint 0          ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.4
>         %tmp.5 = getelementptr "complex long double"* %P, uint 
> %indvar, uint 1          ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.5
>         %indvar.next = add uint %indvar, 1              ; <uint> 
> [#uses=2]
>         %exitcond = seteq uint %indvar.next, %N         ; <bool> 
> [#uses=1]
>         br bool %exitcond, label %return, label %no_exit
>
> return:         ; preds = %no_exit, %entry
>         ret void
> }
>
> Note the accesses to P are using P[indvar].{real|imag}, not 
> incrementing P.  Compile this to X86, and you get this loop (using the 
> simple instruction selector):
>
> .LBBtest_1:     # no_exit
>         mov %ESI, %EDX
>         shl %ESI, 4                       --> scale by sizeof(complex 
> double)
>         mov %EDI, %ECX
>         add %EDI, %ESI
>         mov DWORD PTR [%EDI], 0
>         mov DWORD PTR [%EDI + 4], 0
>         mov %ESI, %EDX
>         shl %ESI, 4                       --> scale by sizeof(complex 
> double)
>         mov %EDI, %ECX
>         add %EDI, %ESI
>         lea %ESI, DWORD PTR [%EDI + 8]
>         mov DWORD PTR [%ESI], 0
>         mov DWORD PTR [%ESI + 4], 0
>         inc %EDX
>         cmp %EDX, %EAX
>         jne .LBBtest_1  # no_exit
>
>
> Now, if you disable the -indvars pass, you'll get this loop instead 
> (note the pointer increment of P is retained):
>
> void %test(uint %N, "complex long double"* %P) {
> entry:
>         %tmp.15 = seteq uint %N, 0              ; <bool> [#uses=1]
>         br bool %tmp.15, label %return, label %no_exit
> no_exit:                ; preds = %no_exit, %entry
>         %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit
], [
> %P, %entry ]             ; <"complex long double"*>
[#uses=3]
>         %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry 
> ]               ; <uint> [#uses=1]
>         %inc = getelementptr "complex long double"* %P_addr.0.0,
int
> 1          ; <"complex long double"*> [#uses=1]
>         %tmp.4 = getelementptr "complex long double"*
%P_addr.0.0, int
> 0, uint 0                ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.4
>         %tmp.5 = getelementptr "complex long double"*
%P_addr.0.0, int
> 0, uint 1                ; <double*> [#uses=1]
>         store double 0.000000e+00, double* %tmp.5
>         %dec = add uint %N_addr.0.0, 4294967295         ; <uint> 
> [#uses=2]
>         %tmp.1 = seteq uint %dec, 0             ; <bool> [#uses=1]
>         br bool %tmp.1, label %return, label %no_exit
> return:         ; preds = %no_exit, %entry
>         ret void
> }
>
> which codegens to this loop (simple isel again):
>
> .LBBtest_1:     # no_exit
>         lea %EDX, DWORD PTR [%ECX + 16]
>         mov DWORD PTR [%ECX], 0
>         mov DWORD PTR [%ECX + 4], 0
>         mov DWORD PTR [%ECX + 8], 0
>         mov DWORD PTR [%ECX + 12], 0
>         dec %EAX
>         test %EAX, %EAX
>         mov %ECX, %EDX
>         jne .LBBtest_1  # no_exit
>
> I think you'll agree that this loop is much nicer, though there is 
> some wierd register shuffling going on between ECX and EDX.  This is 
> the code we would generate if we did loop strength reduction on the 
> LLVM level. Note that on X86, this isn't a HUGE problem in all cases: 
> it can scale by 2,4 and 8 for "free".  The problem is much worse
on
> RISC targets.
>
>> Or am I missing something?  Is this breaking down and reapplying 
>> optimizations part of the selection dag process?  I am somewhat 
>> surprised that using it affected what loops got unrolled.
>
>
> I'm not sure what you mean.  The LLVM code input to the selection dag 
> isel is exactly the same as that input to the simple isel, and it does 
> no loop unrolling.  The full code produced by the simple isel (for 
> Jeff's example) is attached below for reference.  Contrast this to the 
> pattern isel output at the top.
>
> init_board:
>         sub %ESP, 4
>         mov DWORD PTR [%ESP], %ESI
>         mov %EAX, DWORD PTR [%ESP + 8]
>         mov %ECX, 0
> .LBBinit_board_1:       # loopexit.1
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         mov BYTE PTR [%ESI], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 1]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 2]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 3]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 4]
>         mov BYTE PTR [%EDX], 46
>         imul %EDX, %ECX, 7
>         mov %ESI, %EAX
>         add %ESI, %EDX
>         lea %EDX, DWORD PTR [%ESI + 5]
>         mov BYTE PTR [%EDX], 46
>         inc %ECX
>         cmp %ECX, 7
>         jne .LBBinit_board_1    # loopexit.1
> .LBBinit_board_2:       # return
>         mov BYTE PTR [%EAX + 6], 0
>         mov BYTE PTR [%EAX + 13], 0
>         mov BYTE PTR [%EAX + 20], 0
>         mov BYTE PTR [%EAX + 27], 0
>         mov BYTE PTR [%EAX + 34], 0
>         mov BYTE PTR [%EAX + 41], 0
>         mov BYTE PTR [%EAX + 48], 0
>         mov %ESI, DWORD PTR [%ESP]
>         add %ESP, 4
>     ret
>
> -Chris
>

Chris Lattner

2005-Feb-22 06:20 UTC

head link

[LLVMdev] Area for improvement

On Mon, 21 Feb 2005, Jeff Cohen wrote:
> Sorry, I thought I was running selection dag isel but I screwed up when 
> trying out the really big array.  You're right, it does clean it up
except
> for the multiplication.
>
> So LoopStrengthReduce is not ready for prime time and doesn't actually
get
> used?
I don't know what the status of it is.  You could try it out, and see what 
it does. :)  I would suggest adding it to gccas immediately after the SCCP 
pass.
> I might consider whipping it into shape.  Does it still have to handle 
> getelementptr in its full generality?
What do you mean?

-Chris
> Chris Lattner wrote:
>
>> On Mon, 21 Feb 2005, Jeff Cohen wrote:
>> 
>>> I noticed that fourinarow is one of the programs in which LLVM is
much
>>> slower than GCC, so I decided to take a look and see why that is
so.  The
>>> program has many loops that look like this:
>>> 
>>>   #define ROWS 6
>>>   #define COLS 7
>>> 
>>>   void init_board(char b[COLS][ROWS+1])
>>>   {
>>>     int i,j;
>>> 
>>>     for (i=0;i<COLS;i++)
>>>           for (j=0;j<ROWS;j++)
>>>             b[i][j]='.';
>>>                         for (i=0;i<COLS;i++)
>>>           b[i][ROWS]=0;
>>>   }
>>> 
>>> This generates the following X86 code:
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>>>       movb $46, (%esi)
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>>>       leal 1(%esi), %edx
>> 
>> ... (many many copies of this, see the end of the email for full
output)
>> ...
>> 
>>> The code generated by GCC is much faster.  LLVM does unroll the
loop,
>>> which GCC does not do, but the addressing code is a constant
repetition of
>>> these three instructions:
>>> 
>>>       imull $7, %ecx, %edx
>>>       movl %eax, %esi
>>>       addl %edx, %esi
>> 
>> 
>> Yup, this is an issue of simple "peephole expansion" of the
getelementptr
>> instruction.
>> 
>>> All but the first occurrance do nothing but put the same address
into %esi
>>> that was already there to being with, and does it with an expensive
>>> multiply at that.  GCC correctly creates a suitable induction
variable and
>>> strength reduces the multiplication down to addition.
>> 
>> 
>> This is called loop strength reduction.
>> 
>>> It turns out that using the new selection dag code generator, the
outer
>>> loop is unrolled also and nothing is left but a bunch of movb 
>>> instructions.  Much, much faster.
>> 
>> 
>> For the record, instead of the ugly code above, the new isel generates:
>> 
>> init_board:
>>         mov %EAX, DWORD PTR [%ESP + 4]
>>         mov %ECX, 0
>> .LBBinit_board_1:       # loopexit.1
>>         imul %EDX, %ECX, 7
>>         mov BYTE PTR [%EAX + %EDX], 46
>>         mov BYTE PTR [%EAX + %EDX + 1], 46
>>         mov BYTE PTR [%EAX + %EDX + 2], 46
>>         mov BYTE PTR [%EAX + %EDX + 3], 46
>>         mov BYTE PTR [%EAX + %EDX + 4], 46
>>         mov BYTE PTR [%EAX + %EDX + 5], 46
>>         inc %ECX
>>         cmp %ECX, 7
>>         jne .LBBinit_board_1    # loopexit.1
>> .LBBinit_board_2:       # return
>>         mov BYTE PTR [%EAX + 6], 0
>>         mov BYTE PTR [%EAX + 13], 0
>>         mov BYTE PTR [%EAX + 20], 0
>>         mov BYTE PTR [%EAX + 27], 0
>>         mov BYTE PTR [%EAX + 34], 0
>>         mov BYTE PTR [%EAX + 41], 0
>>         mov BYTE PTR [%EAX + 48], 0
>>         ret
>> 
>> This code is almost "perfect" with the exception of the
multiply by 7
>> inside of the loop.  Loop strength reduction would transform this into
an
>> add like GCC produces.
>> 
>> 
>>> But still, the optimization passes shouldn't be leaving so much
garbage
>>> around for instruction selection to clean up.  At first, I thought
all
>>> that was needed was to another round of basic block optimizations
after
>>> the loop was unrolled, until I saw the actual LLVM bytecodes:
>> 
>> 
>> There are two issues here.  First, LLVM code is NOT designed for
trivial
>> peephole code generation.  This is *possible*, but not going to give
you
>> good quality code (see above ;-) ).  The new isel uses an extremely 
>> light-weight DAG-based optimizer that both simplifies instruction
selectors
>> and optimizes the code before selection.  LLVM code is meant to be a
host
>> to mid-level and interprocedural optimizations, and I think it does
that
>> job well.
>> 
>>> Now the problem is obvious.  A two dimensional array access is
being
>>> performed by a single instruction.  The arithmetic needed to
address the
>>> element is implicit, and therefore inaccessible to optimizations. 
The
>>> redundant calculations can not be eliminated, nor can strength
reduction
>>> be performed.  getelementptr needs to be broken down into its
constituent
>>> operations at some point.
>> 
>> 
>> Actually strength reduction can be trivially done with getelementptr 
>> instructions, that was part of their design.  Part of the problem is
that
>> we have an induction variable canonicalization pass which UNDOES
strength
>> reduction to make loop analysis simpler, but we don't have strength
>> reduction to redo it.   Thus if you compile a loop like this (complex 
>> double chosen to show a size that X86 can scale "for free"):
>> 
>> #include <complex.h>
>> void test(unsigned N, complex double *P) {
>>   for (; N != 0; --N)
>>     *P++ = 0;
>> }
>> 
>> We compile it to this LLVM code:
>> 
>> void %test(uint %N, "complex long double"* %P) {
>> entry:
>>         %tmp.15 = seteq uint %N, 0              ; <bool>
[#uses=1]
>>         br bool %tmp.15, label %return, label %no_exit
>> 
>> no_exit:                ; preds = %no_exit, %entry
>>         %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry ] 
>> ; <uint> [#uses=3]
>>         %tmp.4 = getelementptr "complex long double"* %P,
uint %indvar,
>> uint 0          ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.4
>>         %tmp.5 = getelementptr "complex long double"* %P,
uint %indvar,
>> uint 1          ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.5
>>         %indvar.next = add uint %indvar, 1              ; <uint>
[#uses=2]
>>         %exitcond = seteq uint %indvar.next, %N         ; <bool>
[#uses=1]
>>         br bool %exitcond, label %return, label %no_exit
>> 
>> return:         ; preds = %no_exit, %entry
>>         ret void
>> }
>> 
>> Note the accesses to P are using P[indvar].{real|imag}, not
incrementing P.
>> Compile this to X86, and you get this loop (using the simple
instruction
>> selector):
>> 
>> .LBBtest_1:     # no_exit
>>         mov %ESI, %EDX
>>         shl %ESI, 4                       --> scale by
sizeof(complex
>> double)
>>         mov %EDI, %ECX
>>         add %EDI, %ESI
>>         mov DWORD PTR [%EDI], 0
>>         mov DWORD PTR [%EDI + 4], 0
>>         mov %ESI, %EDX
>>         shl %ESI, 4                       --> scale by
sizeof(complex
>> double)
>>         mov %EDI, %ECX
>>         add %EDI, %ESI
>>         lea %ESI, DWORD PTR [%EDI + 8]
>>         mov DWORD PTR [%ESI], 0
>>         mov DWORD PTR [%ESI + 4], 0
>>         inc %EDX
>>         cmp %EDX, %EAX
>>         jne .LBBtest_1  # no_exit
>> 
>> 
>> Now, if you disable the -indvars pass, you'll get this loop instead
(note
>> the pointer increment of P is retained):
>> 
>> void %test(uint %N, "complex long double"* %P) {
>> entry:
>>         %tmp.15 = seteq uint %N, 0              ; <bool>
[#uses=1]
>>         br bool %tmp.15, label %return, label %no_exit
>> no_exit:                ; preds = %no_exit, %entry
>>         %P_addr.0.0 = phi "complex long double"* [ %inc,
%no_exit ], [ %P,
>> %entry ]             ; <"complex long double"*>
[#uses=3]
>>         %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry ] 
>> ; <uint> [#uses=1]
>>         %inc = getelementptr "complex long double"*
%P_addr.0.0, int 1
>> ; <"complex long double"*> [#uses=1]
>>         %tmp.4 = getelementptr "complex long double"*
%P_addr.0.0, int 0,
>> uint 0                ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.4
>>         %tmp.5 = getelementptr "complex long double"*
%P_addr.0.0, int 0,
>> uint 1                ; <double*> [#uses=1]
>>         store double 0.000000e+00, double* %tmp.5
>>         %dec = add uint %N_addr.0.0, 4294967295         ; <uint>
[#uses=2]
>>         %tmp.1 = seteq uint %dec, 0             ; <bool>
[#uses=1]
>>         br bool %tmp.1, label %return, label %no_exit
>> return:         ; preds = %no_exit, %entry
>>         ret void
>> }
>> 
>> which codegens to this loop (simple isel again):
>> 
>> .LBBtest_1:     # no_exit
>>         lea %EDX, DWORD PTR [%ECX + 16]
>>         mov DWORD PTR [%ECX], 0
>>         mov DWORD PTR [%ECX + 4], 0
>>         mov DWORD PTR [%ECX + 8], 0
>>         mov DWORD PTR [%ECX + 12], 0
>>         dec %EAX
>>         test %EAX, %EAX
>>         mov %ECX, %EDX
>>         jne .LBBtest_1  # no_exit
>> 
>> I think you'll agree that this loop is much nicer, though there is
some
>> wierd register shuffling going on between ECX and EDX.  This is the
code we
>> would generate if we did loop strength reduction on the LLVM level.
Note
>> that on X86, this isn't a HUGE problem in all cases: it can scale
by 2,4
>> and 8 for "free".  The problem is much worse on RISC targets.
>> 
>>> Or am I missing something?  Is this breaking down and reapplying 
>>> optimizations part of the selection dag process?  I am somewhat
surprised
>>> that using it affected what loops got unrolled.
>> 
>> 
>> I'm not sure what you mean.  The LLVM code input to the selection
dag isel
>> is exactly the same as that input to the simple isel, and it does no
loop
>> unrolling.  The full code produced by the simple isel (for Jeff's
example)
>> is attached below for reference.  Contrast this to the pattern isel
output
>> at the top.
>> 
>> init_board:
>>         sub %ESP, 4
>>         mov DWORD PTR [%ESP], %ESI
>>         mov %EAX, DWORD PTR [%ESP + 8]
>>         mov %ECX, 0
>> .LBBinit_board_1:       # loopexit.1
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         mov BYTE PTR [%ESI], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 1]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 2]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 3]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 4]
>>         mov BYTE PTR [%EDX], 46
>>         imul %EDX, %ECX, 7
>>         mov %ESI, %EAX
>>         add %ESI, %EDX
>>         lea %EDX, DWORD PTR [%ESI + 5]
>>         mov BYTE PTR [%EDX], 46
>>         inc %ECX
>>         cmp %ECX, 7
>>         jne .LBBinit_board_1    # loopexit.1
>> .LBBinit_board_2:       # return
>>         mov BYTE PTR [%EAX + 6], 0
>>         mov BYTE PTR [%EAX + 13], 0
>>         mov BYTE PTR [%EAX + 20], 0
>>         mov BYTE PTR [%EAX + 27], 0
>>         mov BYTE PTR [%EAX + 34], 0
>>         mov BYTE PTR [%EAX + 41], 0
>>         mov BYTE PTR [%EAX + 48], 0
>>         mov %ESI, DWORD PTR [%ESP]
>>         add %ESP, 4
>>     ret
>> 
>> -Chris
>> 
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-Chris

-- 
http://nondot.org/sabre/
http://llvm.cs.uiuc.edu/

Jeff Cohen

2005-Feb-22 06:58 UTC

head link

[LLVMdev] Area for improvement

Chris Lattner wrote:
> On Mon, 21 Feb 2005, Jeff Cohen wrote:
>
>> Sorry, I thought I was running selection dag isel but I screwed up 
>> when trying out the really big array.  You're right, it does clean
it
>> up except for the multiplication.
>>
>> So LoopStrengthReduce is not ready for prime time and doesn't 
>> actually get used?
>
>
> I don't know what the status of it is.  You could try it out, and see 
> what it does. :)  I would suggest adding it to gccas immediately after 
> the SCCP pass.
I know it has a lot of FIXMEs in it.
>
>> I might consider whipping it into shape.  Does it still have to 
>> handle getelementptr in its full generality?
>
>
> What do you mean?
>
> -Chris
Consider the bytecode I originally posted for the nested for loop.  If 
the getelementptrs haven't been replaced with something simpler, LSR 
will have to take them apart itself.  The multiplication cannot be 
replaced with addition without the two-dimensional access being replaced 
with one-dimensional access plus some pointer arithmetic.  I would guess 
that they will not have been decomposed before LSR is run, correct?

It also appears the current LSR reduces each GEP in isolation of all the 
others.  That clearly is bad for this example, as all the GEPs in this 
loop do the same thing with the induction variable.  Creating a separate 
pointer for each one only makes things worse.  Ordinarily the addressing 
is explicit so the redundancy would have been removed and the LSR 
algorithm doesn't have to worry about it, but here it does.  Bottom line 
is it has to strength reduce operations that are implicit, so it first 
has to determine what those operations are and detect common 
subexpressions among those operations itself (possibly across multiple 
basic blocks).  It doesn't appear to handle explicit multiplications at 
all (probably doesn't add much value to do so).

It's a challenge :)

Seemingly Similar Threads

Search for more possibly parallel threads

llvm dev - Feb 2005 - [LLVMdev] Area for improvement

[LLVMdev] Area for improvement

[LLVMdev] Area for improvement

[LLVMdev] Area for improvement

Seemingly Similar Threads