Juliusz Sompolski
2010-Nov-23 07:21 UTC
[LLVMdev] Unrolling an arithmetic expression inside a loop
Hello, I've been redirected from cfe-dev, as code optimizations in clang are done in llvm layer. I'm investigating how optimized code clang generates, and have come across such an example: I have two procedures: void exec0(const int *X, const int *Y, int *res, const int N) { int t1[N],t2[N],t3[N],t4[N],t5[N],t6[N]; for(int i = 0; i < N; i++) { t1[i] = X[i]+Y[i]; // X+Y t2[i] = t1[i]*Y[i]; // XY + Y2 t3[i] = X[i]*Y[i]; // XY t4[i] = Y[i]*Y[i]; // Y2 t5[i] = t3[i]+t4[i]; // XY + Y2 t6[i] = t2[i]-t5[i]; // 0 res[i] = t6[i] + t3[i]; // XY } } vs void exec1(const int *X, const int *Y, int *res, const int N) { for(int i = 0; i < N; i++) { int t1,t2,t3,t4,t5,t6; t1 = X[i]+Y[i]; // X+Y t2 = t1*Y[i]; // XY + Y2 t3 = X[i]*Y[i]; // XY t4 = Y[i]*Y[i]; // Y2 t5 = t3+t4; // XY + Y2 t6 = t2-t5; // 0 res[i] = t6 + t3; // XY } } where I perform some arithmetic operations on a vector. The expression in written in an obfuscated way, but it actually boils down to just res = X*Y. In the first case I create tables to store temporaries, in the second case I have just temporary variables to store intermediate results. I compile with clang -S -emit-llvm -std=gnu99 -O3 clang version 2.9 (trunk 118238) Target: x86_64-unknown-linux-gnu Thread model: posix What I get from clang is: 1) In exec0 it recognizes that it doesn't need the tables and doesn't create them and also, it simplifies the expression, so that the generated code inside the loop is just a multiplication: %scevgep = getelementptr i32* %X, i64 %indvar %scevgep2 = getelementptr i32* %Y, i64 %indvar %scevgep9 = getelementptr i32* %res, i64 %indvar %3 = load i32* %scevgep, align 4, !tbaa !0 %4 = load i32* %scevgep2, align 4, !tbaa !0 %5 = mul nsw i32 %4, %3 store i32 %5, i32* %scevgep9, align 4, !tbaa !0 2) In exec1 however it fails to recognize that the temporary variables are not reused anywhere and fails to simplify the arithmetic expression, producing: %scevgep = getelementptr i32* %X, i64 %indvar %scevgep4 = getelementptr i32* %Y, i64 %indvar %scevgep5 = getelementptr i32* %res, i64 %indvar %3 = load i32* %scevgep, align 4, !tbaa !0 %4 = load i32* %scevgep4, align 4, !tbaa !0 %5 = add nsw i32 %4, %3 %6 = mul nsw i32 %5, %4 %7 = mul nsw i32 %4, %4 %8 = sub i32 %6, %7 store i32 %8, i32* %scevgep5, align 4, !tbaa !0 It is interesting, that I get better compilation output when inputting clang with worse input source (creating redundant temporary tables on stack). Cheers, Juliusz Sompolski
Duncan Sands
2010-Nov-23 08:50 UTC
[LLVMdev] Unrolling an arithmetic expression inside a loop
Hi Juliusz,> %5 = add nsw i32 %4, %3 > %6 = mul nsw i32 %5, %4 > %7 = mul nsw i32 %4, %4 > %8 = sub i32 %6, %7this looks like an instcombine deficiency (not realizing that %8 equals %3 * %4). Ciao, duncan.
Duncan Sands
2010-Nov-23 14:27 UTC
[LLVMdev] Unrolling an arithmetic expression inside a loop
Hi Juliusz,> 2) In exec1 however it fails to recognize that the temporary variables > are not reused anywhere and fails to simplify the arithmetic expression, > producing: > %scevgep = getelementptr i32* %X, i64 %indvar > %scevgep4 = getelementptr i32* %Y, i64 %indvar > %scevgep5 = getelementptr i32* %res, i64 %indvar > %3 = load i32* %scevgep, align 4, !tbaa !0 > %4 = load i32* %scevgep4, align 4, !tbaa !0 > %5 = add nsw i32 %4, %3 > %6 = mul nsw i32 %5, %4 > %7 = mul nsw i32 %4, %4 > %8 = sub i32 %6, %7 > store i32 %8, i32* %scevgep5, align 4, !tbaa !0 > > It is interesting, that I get better compilation output when inputting > clang with worse input source (creating redundant temporary tables on > stack).this is now fixed in the development branch of LLVM. Ciao, Duncan.