Sorry, I thought I was running selection dag isel but I screwed up when trying out the really big array. You're right, it does clean it up except for the multiplication. So LoopStrengthReduce is not ready for prime time and doesn't actually get used? I might consider whipping it into shape. Does it still have to handle getelementptr in its full generality? Chris Lattner wrote:> On Mon, 21 Feb 2005, Jeff Cohen wrote: > >> I noticed that fourinarow is one of the programs in which LLVM is >> much slower than GCC, so I decided to take a look and see why that is >> so. The program has many loops that look like this: >> >> #define ROWS 6 >> #define COLS 7 >> >> void init_board(char b[COLS][ROWS+1]) >> { >> int i,j; >> >> for (i=0;i<COLS;i++) >> for (j=0;j<ROWS;j++) >> b[i][j]='.'; >> for (i=0;i<COLS;i++) >> b[i][ROWS]=0; >> } >> >> This generates the following X86 code: >> imull $7, %ecx, %edx >> movl %eax, %esi >> addl %edx, %esi >> movb $46, (%esi) >> imull $7, %ecx, %edx >> movl %eax, %esi >> addl %edx, %esi >> leal 1(%esi), %edx > > ... (many many copies of this, see the end of the email for full > output) ... > >> The code generated by GCC is much faster. LLVM does unroll the loop, >> which GCC does not do, but the addressing code is a constant >> repetition of these three instructions: >> >> imull $7, %ecx, %edx >> movl %eax, %esi >> addl %edx, %esi > > > Yup, this is an issue of simple "peephole expansion" of the > getelementptr instruction. > >> All but the first occurrance do nothing but put the same address into >> %esi that was already there to being with, and does it with an >> expensive multiply at that. GCC correctly creates a suitable >> induction variable and strength reduces the multiplication down to >> addition. > > > This is called loop strength reduction. > >> It turns out that using the new selection dag code generator, the >> outer loop is unrolled also and nothing is left but a bunch of movb >> instructions. Much, much faster. > > > For the record, instead of the ugly code above, the new isel generates: > > init_board: > mov %EAX, DWORD PTR [%ESP + 4] > mov %ECX, 0 > .LBBinit_board_1: # loopexit.1 > imul %EDX, %ECX, 7 > mov BYTE PTR [%EAX + %EDX], 46 > mov BYTE PTR [%EAX + %EDX + 1], 46 > mov BYTE PTR [%EAX + %EDX + 2], 46 > mov BYTE PTR [%EAX + %EDX + 3], 46 > mov BYTE PTR [%EAX + %EDX + 4], 46 > mov BYTE PTR [%EAX + %EDX + 5], 46 > inc %ECX > cmp %ECX, 7 > jne .LBBinit_board_1 # loopexit.1 > .LBBinit_board_2: # return > mov BYTE PTR [%EAX + 6], 0 > mov BYTE PTR [%EAX + 13], 0 > mov BYTE PTR [%EAX + 20], 0 > mov BYTE PTR [%EAX + 27], 0 > mov BYTE PTR [%EAX + 34], 0 > mov BYTE PTR [%EAX + 41], 0 > mov BYTE PTR [%EAX + 48], 0 > ret > > This code is almost "perfect" with the exception of the multiply by 7 > inside of the loop. Loop strength reduction would transform this into > an add like GCC produces. > > >> But still, the optimization passes shouldn't be leaving so much >> garbage around for instruction selection to clean up. At first, I >> thought all that was needed was to another round of basic block >> optimizations after the loop was unrolled, until I saw the actual >> LLVM bytecodes: > > > There are two issues here. First, LLVM code is NOT designed for > trivial peephole code generation. This is *possible*, but not going > to give you good quality code (see above ;-) ). The new isel uses an > extremely light-weight DAG-based optimizer that both simplifies > instruction selectors and optimizes the code before selection. LLVM > code is meant to be a host to mid-level and interprocedural > optimizations, and I think it does that job well. > >> Now the problem is obvious. A two dimensional array access is being >> performed by a single instruction. The arithmetic needed to address >> the element is implicit, and therefore inaccessible to >> optimizations. The redundant calculations can not be eliminated, nor >> can strength reduction be performed. getelementptr needs to be >> broken down into its constituent operations at some point. > > > Actually strength reduction can be trivially done with getelementptr > instructions, that was part of their design. Part of the problem is > that we have an induction variable canonicalization pass which UNDOES > strength reduction to make loop analysis simpler, but we don't have > strength reduction to redo it. Thus if you compile a loop like this > (complex double chosen to show a size that X86 can scale "for free"): > > #include <complex.h> > void test(unsigned N, complex double *P) { > for (; N != 0; --N) > *P++ = 0; > } > > We compile it to this LLVM code: > > void %test(uint %N, "complex long double"* %P) { > entry: > %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1] > br bool %tmp.15, label %return, label %no_exit > > no_exit: ; preds = %no_exit, %entry > %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry > ] ; <uint> [#uses=3] > %tmp.4 = getelementptr "complex long double"* %P, uint > %indvar, uint 0 ; <double*> [#uses=1] > store double 0.000000e+00, double* %tmp.4 > %tmp.5 = getelementptr "complex long double"* %P, uint > %indvar, uint 1 ; <double*> [#uses=1] > store double 0.000000e+00, double* %tmp.5 > %indvar.next = add uint %indvar, 1 ; <uint> > [#uses=2] > %exitcond = seteq uint %indvar.next, %N ; <bool> > [#uses=1] > br bool %exitcond, label %return, label %no_exit > > return: ; preds = %no_exit, %entry > ret void > } > > Note the accesses to P are using P[indvar].{real|imag}, not > incrementing P. Compile this to X86, and you get this loop (using the > simple instruction selector): > > .LBBtest_1: # no_exit > mov %ESI, %EDX > shl %ESI, 4 --> scale by sizeof(complex > double) > mov %EDI, %ECX > add %EDI, %ESI > mov DWORD PTR [%EDI], 0 > mov DWORD PTR [%EDI + 4], 0 > mov %ESI, %EDX > shl %ESI, 4 --> scale by sizeof(complex > double) > mov %EDI, %ECX > add %EDI, %ESI > lea %ESI, DWORD PTR [%EDI + 8] > mov DWORD PTR [%ESI], 0 > mov DWORD PTR [%ESI + 4], 0 > inc %EDX > cmp %EDX, %EAX > jne .LBBtest_1 # no_exit > > > Now, if you disable the -indvars pass, you'll get this loop instead > (note the pointer increment of P is retained): > > void %test(uint %N, "complex long double"* %P) { > entry: > %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1] > br bool %tmp.15, label %return, label %no_exit > no_exit: ; preds = %no_exit, %entry > %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ > %P, %entry ] ; <"complex long double"*> [#uses=3] > %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry > ] ; <uint> [#uses=1] > %inc = getelementptr "complex long double"* %P_addr.0.0, int > 1 ; <"complex long double"*> [#uses=1] > %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int > 0, uint 0 ; <double*> [#uses=1] > store double 0.000000e+00, double* %tmp.4 > %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int > 0, uint 1 ; <double*> [#uses=1] > store double 0.000000e+00, double* %tmp.5 > %dec = add uint %N_addr.0.0, 4294967295 ; <uint> > [#uses=2] > %tmp.1 = seteq uint %dec, 0 ; <bool> [#uses=1] > br bool %tmp.1, label %return, label %no_exit > return: ; preds = %no_exit, %entry > ret void > } > > which codegens to this loop (simple isel again): > > .LBBtest_1: # no_exit > lea %EDX, DWORD PTR [%ECX + 16] > mov DWORD PTR [%ECX], 0 > mov DWORD PTR [%ECX + 4], 0 > mov DWORD PTR [%ECX + 8], 0 > mov DWORD PTR [%ECX + 12], 0 > dec %EAX > test %EAX, %EAX > mov %ECX, %EDX > jne .LBBtest_1 # no_exit > > I think you'll agree that this loop is much nicer, though there is > some wierd register shuffling going on between ECX and EDX. This is > the code we would generate if we did loop strength reduction on the > LLVM level. Note that on X86, this isn't a HUGE problem in all cases: > it can scale by 2,4 and 8 for "free". The problem is much worse on > RISC targets. > >> Or am I missing something? Is this breaking down and reapplying >> optimizations part of the selection dag process? I am somewhat >> surprised that using it affected what loops got unrolled. > > > I'm not sure what you mean. The LLVM code input to the selection dag > isel is exactly the same as that input to the simple isel, and it does > no loop unrolling. The full code produced by the simple isel (for > Jeff's example) is attached below for reference. Contrast this to the > pattern isel output at the top. > > init_board: > sub %ESP, 4 > mov DWORD PTR [%ESP], %ESI > mov %EAX, DWORD PTR [%ESP + 8] > mov %ECX, 0 > .LBBinit_board_1: # loopexit.1 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > mov BYTE PTR [%ESI], 46 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > lea %EDX, DWORD PTR [%ESI + 1] > mov BYTE PTR [%EDX], 46 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > lea %EDX, DWORD PTR [%ESI + 2] > mov BYTE PTR [%EDX], 46 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > lea %EDX, DWORD PTR [%ESI + 3] > mov BYTE PTR [%EDX], 46 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > lea %EDX, DWORD PTR [%ESI + 4] > mov BYTE PTR [%EDX], 46 > imul %EDX, %ECX, 7 > mov %ESI, %EAX > add %ESI, %EDX > lea %EDX, DWORD PTR [%ESI + 5] > mov BYTE PTR [%EDX], 46 > inc %ECX > cmp %ECX, 7 > jne .LBBinit_board_1 # loopexit.1 > .LBBinit_board_2: # return > mov BYTE PTR [%EAX + 6], 0 > mov BYTE PTR [%EAX + 13], 0 > mov BYTE PTR [%EAX + 20], 0 > mov BYTE PTR [%EAX + 27], 0 > mov BYTE PTR [%EAX + 34], 0 > mov BYTE PTR [%EAX + 41], 0 > mov BYTE PTR [%EAX + 48], 0 > mov %ESI, DWORD PTR [%ESP] > add %ESP, 4 > ret > > -Chris >
On Mon, 21 Feb 2005, Jeff Cohen wrote:> Sorry, I thought I was running selection dag isel but I screwed up when > trying out the really big array. You're right, it does clean it up except > for the multiplication. > > So LoopStrengthReduce is not ready for prime time and doesn't actually get > used?I don't know what the status of it is. You could try it out, and see what it does. :) I would suggest adding it to gccas immediately after the SCCP pass.> I might consider whipping it into shape. Does it still have to handle > getelementptr in its full generality?What do you mean? -Chris> Chris Lattner wrote: > >> On Mon, 21 Feb 2005, Jeff Cohen wrote: >> >>> I noticed that fourinarow is one of the programs in which LLVM is much >>> slower than GCC, so I decided to take a look and see why that is so. The >>> program has many loops that look like this: >>> >>> #define ROWS 6 >>> #define COLS 7 >>> >>> void init_board(char b[COLS][ROWS+1]) >>> { >>> int i,j; >>> >>> for (i=0;i<COLS;i++) >>> for (j=0;j<ROWS;j++) >>> b[i][j]='.'; >>> for (i=0;i<COLS;i++) >>> b[i][ROWS]=0; >>> } >>> >>> This generates the following X86 code: >>> imull $7, %ecx, %edx >>> movl %eax, %esi >>> addl %edx, %esi >>> movb $46, (%esi) >>> imull $7, %ecx, %edx >>> movl %eax, %esi >>> addl %edx, %esi >>> leal 1(%esi), %edx >> >> ... (many many copies of this, see the end of the email for full output) >> ... >> >>> The code generated by GCC is much faster. LLVM does unroll the loop, >>> which GCC does not do, but the addressing code is a constant repetition of >>> these three instructions: >>> >>> imull $7, %ecx, %edx >>> movl %eax, %esi >>> addl %edx, %esi >> >> >> Yup, this is an issue of simple "peephole expansion" of the getelementptr >> instruction. >> >>> All but the first occurrance do nothing but put the same address into %esi >>> that was already there to being with, and does it with an expensive >>> multiply at that. GCC correctly creates a suitable induction variable and >>> strength reduces the multiplication down to addition. >> >> >> This is called loop strength reduction. >> >>> It turns out that using the new selection dag code generator, the outer >>> loop is unrolled also and nothing is left but a bunch of movb >>> instructions. Much, much faster. >> >> >> For the record, instead of the ugly code above, the new isel generates: >> >> init_board: >> mov %EAX, DWORD PTR [%ESP + 4] >> mov %ECX, 0 >> .LBBinit_board_1: # loopexit.1 >> imul %EDX, %ECX, 7 >> mov BYTE PTR [%EAX + %EDX], 46 >> mov BYTE PTR [%EAX + %EDX + 1], 46 >> mov BYTE PTR [%EAX + %EDX + 2], 46 >> mov BYTE PTR [%EAX + %EDX + 3], 46 >> mov BYTE PTR [%EAX + %EDX + 4], 46 >> mov BYTE PTR [%EAX + %EDX + 5], 46 >> inc %ECX >> cmp %ECX, 7 >> jne .LBBinit_board_1 # loopexit.1 >> .LBBinit_board_2: # return >> mov BYTE PTR [%EAX + 6], 0 >> mov BYTE PTR [%EAX + 13], 0 >> mov BYTE PTR [%EAX + 20], 0 >> mov BYTE PTR [%EAX + 27], 0 >> mov BYTE PTR [%EAX + 34], 0 >> mov BYTE PTR [%EAX + 41], 0 >> mov BYTE PTR [%EAX + 48], 0 >> ret >> >> This code is almost "perfect" with the exception of the multiply by 7 >> inside of the loop. Loop strength reduction would transform this into an >> add like GCC produces. >> >> >>> But still, the optimization passes shouldn't be leaving so much garbage >>> around for instruction selection to clean up. At first, I thought all >>> that was needed was to another round of basic block optimizations after >>> the loop was unrolled, until I saw the actual LLVM bytecodes: >> >> >> There are two issues here. First, LLVM code is NOT designed for trivial >> peephole code generation. This is *possible*, but not going to give you >> good quality code (see above ;-) ). The new isel uses an extremely >> light-weight DAG-based optimizer that both simplifies instruction selectors >> and optimizes the code before selection. LLVM code is meant to be a host >> to mid-level and interprocedural optimizations, and I think it does that >> job well. >> >>> Now the problem is obvious. A two dimensional array access is being >>> performed by a single instruction. The arithmetic needed to address the >>> element is implicit, and therefore inaccessible to optimizations. The >>> redundant calculations can not be eliminated, nor can strength reduction >>> be performed. getelementptr needs to be broken down into its constituent >>> operations at some point. >> >> >> Actually strength reduction can be trivially done with getelementptr >> instructions, that was part of their design. Part of the problem is that >> we have an induction variable canonicalization pass which UNDOES strength >> reduction to make loop analysis simpler, but we don't have strength >> reduction to redo it. Thus if you compile a loop like this (complex >> double chosen to show a size that X86 can scale "for free"): >> >> #include <complex.h> >> void test(unsigned N, complex double *P) { >> for (; N != 0; --N) >> *P++ = 0; >> } >> >> We compile it to this LLVM code: >> >> void %test(uint %N, "complex long double"* %P) { >> entry: >> %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1] >> br bool %tmp.15, label %return, label %no_exit >> >> no_exit: ; preds = %no_exit, %entry >> %indvar = phi uint [ %indvar.next, %no_exit ], [ 0, %entry ] >> ; <uint> [#uses=3] >> %tmp.4 = getelementptr "complex long double"* %P, uint %indvar, >> uint 0 ; <double*> [#uses=1] >> store double 0.000000e+00, double* %tmp.4 >> %tmp.5 = getelementptr "complex long double"* %P, uint %indvar, >> uint 1 ; <double*> [#uses=1] >> store double 0.000000e+00, double* %tmp.5 >> %indvar.next = add uint %indvar, 1 ; <uint> [#uses=2] >> %exitcond = seteq uint %indvar.next, %N ; <bool> [#uses=1] >> br bool %exitcond, label %return, label %no_exit >> >> return: ; preds = %no_exit, %entry >> ret void >> } >> >> Note the accesses to P are using P[indvar].{real|imag}, not incrementing P. >> Compile this to X86, and you get this loop (using the simple instruction >> selector): >> >> .LBBtest_1: # no_exit >> mov %ESI, %EDX >> shl %ESI, 4 --> scale by sizeof(complex >> double) >> mov %EDI, %ECX >> add %EDI, %ESI >> mov DWORD PTR [%EDI], 0 >> mov DWORD PTR [%EDI + 4], 0 >> mov %ESI, %EDX >> shl %ESI, 4 --> scale by sizeof(complex >> double) >> mov %EDI, %ECX >> add %EDI, %ESI >> lea %ESI, DWORD PTR [%EDI + 8] >> mov DWORD PTR [%ESI], 0 >> mov DWORD PTR [%ESI + 4], 0 >> inc %EDX >> cmp %EDX, %EAX >> jne .LBBtest_1 # no_exit >> >> >> Now, if you disable the -indvars pass, you'll get this loop instead (note >> the pointer increment of P is retained): >> >> void %test(uint %N, "complex long double"* %P) { >> entry: >> %tmp.15 = seteq uint %N, 0 ; <bool> [#uses=1] >> br bool %tmp.15, label %return, label %no_exit >> no_exit: ; preds = %no_exit, %entry >> %P_addr.0.0 = phi "complex long double"* [ %inc, %no_exit ], [ %P, >> %entry ] ; <"complex long double"*> [#uses=3] >> %N_addr.0.0 = phi uint [ %dec, %no_exit ], [ %N, %entry ] >> ; <uint> [#uses=1] >> %inc = getelementptr "complex long double"* %P_addr.0.0, int 1 >> ; <"complex long double"*> [#uses=1] >> %tmp.4 = getelementptr "complex long double"* %P_addr.0.0, int 0, >> uint 0 ; <double*> [#uses=1] >> store double 0.000000e+00, double* %tmp.4 >> %tmp.5 = getelementptr "complex long double"* %P_addr.0.0, int 0, >> uint 1 ; <double*> [#uses=1] >> store double 0.000000e+00, double* %tmp.5 >> %dec = add uint %N_addr.0.0, 4294967295 ; <uint> [#uses=2] >> %tmp.1 = seteq uint %dec, 0 ; <bool> [#uses=1] >> br bool %tmp.1, label %return, label %no_exit >> return: ; preds = %no_exit, %entry >> ret void >> } >> >> which codegens to this loop (simple isel again): >> >> .LBBtest_1: # no_exit >> lea %EDX, DWORD PTR [%ECX + 16] >> mov DWORD PTR [%ECX], 0 >> mov DWORD PTR [%ECX + 4], 0 >> mov DWORD PTR [%ECX + 8], 0 >> mov DWORD PTR [%ECX + 12], 0 >> dec %EAX >> test %EAX, %EAX >> mov %ECX, %EDX >> jne .LBBtest_1 # no_exit >> >> I think you'll agree that this loop is much nicer, though there is some >> wierd register shuffling going on between ECX and EDX. This is the code we >> would generate if we did loop strength reduction on the LLVM level. Note >> that on X86, this isn't a HUGE problem in all cases: it can scale by 2,4 >> and 8 for "free". The problem is much worse on RISC targets. >> >>> Or am I missing something? Is this breaking down and reapplying >>> optimizations part of the selection dag process? I am somewhat surprised >>> that using it affected what loops got unrolled. >> >> >> I'm not sure what you mean. The LLVM code input to the selection dag isel >> is exactly the same as that input to the simple isel, and it does no loop >> unrolling. The full code produced by the simple isel (for Jeff's example) >> is attached below for reference. Contrast this to the pattern isel output >> at the top. >> >> init_board: >> sub %ESP, 4 >> mov DWORD PTR [%ESP], %ESI >> mov %EAX, DWORD PTR [%ESP + 8] >> mov %ECX, 0 >> .LBBinit_board_1: # loopexit.1 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> mov BYTE PTR [%ESI], 46 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> lea %EDX, DWORD PTR [%ESI + 1] >> mov BYTE PTR [%EDX], 46 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> lea %EDX, DWORD PTR [%ESI + 2] >> mov BYTE PTR [%EDX], 46 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> lea %EDX, DWORD PTR [%ESI + 3] >> mov BYTE PTR [%EDX], 46 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> lea %EDX, DWORD PTR [%ESI + 4] >> mov BYTE PTR [%EDX], 46 >> imul %EDX, %ECX, 7 >> mov %ESI, %EAX >> add %ESI, %EDX >> lea %EDX, DWORD PTR [%ESI + 5] >> mov BYTE PTR [%EDX], 46 >> inc %ECX >> cmp %ECX, 7 >> jne .LBBinit_board_1 # loopexit.1 >> .LBBinit_board_2: # return >> mov BYTE PTR [%EAX + 6], 0 >> mov BYTE PTR [%EAX + 13], 0 >> mov BYTE PTR [%EAX + 20], 0 >> mov BYTE PTR [%EAX + 27], 0 >> mov BYTE PTR [%EAX + 34], 0 >> mov BYTE PTR [%EAX + 41], 0 >> mov BYTE PTR [%EAX + 48], 0 >> mov %ESI, DWORD PTR [%ESP] >> add %ESP, 4 >> ret >> >> -Chris >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://mail.cs.uiuc.edu/mailman/listinfo/llvmdev >-Chris -- http://nondot.org/sabre/ http://llvm.cs.uiuc.edu/
Chris Lattner wrote:> On Mon, 21 Feb 2005, Jeff Cohen wrote: > >> Sorry, I thought I was running selection dag isel but I screwed up >> when trying out the really big array. You're right, it does clean it >> up except for the multiplication. >> >> So LoopStrengthReduce is not ready for prime time and doesn't >> actually get used? > > > I don't know what the status of it is. You could try it out, and see > what it does. :) I would suggest adding it to gccas immediately after > the SCCP pass.I know it has a lot of FIXMEs in it.> >> I might consider whipping it into shape. Does it still have to >> handle getelementptr in its full generality? > > > What do you mean? > > -ChrisConsider the bytecode I originally posted for the nested for loop. If the getelementptrs haven't been replaced with something simpler, LSR will have to take them apart itself. The multiplication cannot be replaced with addition without the two-dimensional access being replaced with one-dimensional access plus some pointer arithmetic. I would guess that they will not have been decomposed before LSR is run, correct? It also appears the current LSR reduces each GEP in isolation of all the others. That clearly is bad for this example, as all the GEPs in this loop do the same thing with the induction variable. Creating a separate pointer for each one only makes things worse. Ordinarily the addressing is explicit so the redundancy would have been removed and the LSR algorithm doesn't have to worry about it, but here it does. Bottom line is it has to strength reduce operations that are implicit, so it first has to determine what those operations are and detect common subexpressions among those operations itself (possibly across multiple basic blocks). It doesn't appear to handle explicit multiplications at all (probably doesn't add much value to do so). It's a challenge :)