Hi,
I have been studying LLVM infrastructure from past 1 month and trying out
some hands on to get familiar with it. I am keen to propose loop reversal
pass in LLVM as a part of GSoC 2015. Following are the details of my
proposal.
A for loop can run in two ways – it can count up or it can count down.
1.
for(i=0; i<10 ;i++) {// Do something}
2.
for(i=9; i--;) {// Do something}
Counting down to zero with the decrement operator (i--) is faster than
counting up to a number of iterations with the increment operator (i++).
I have run a program with sufficient iteration for both types of loops (The
program file Loop.cpp is attached with the mail). The average run time
observed for both the loops over 100 runs
Loop counting up 0.15 ms
Loop counting down 0.08 ms
As observed, there is an execution time gain of almost 50%. The test case
is rather simple though; need to check in wide variety of test cases.
It is a legal transformation if the resulting dependence vector remains
lexicographically positive. Although trivial, it is a useful optimization
since it may enable others such as loop interchange and can reduce the loop
exit condition to a single branch-not-equal-to-zero instruction.
For example, the following code cannot be interchanged or have its inner
loop parallelized because of (1,−1) dependencies.
*do i = 1, n*
*do j = 1, n*
*a[i,j] = a[i-1,j+1] + 1*
*end do*
*end do*
Reversing the inner loop yields (1, 1) dependencies. The loops can now be
interchanged and/or the inner loop made parallel.
*do i = 1, n*
*do j = n, 1, -1*
*a[i,j] = a[i-1,j+1] + 1*
*end do*
*end do*
To see the difference in LLVM IR, I have taken a simpler test case:
*int a[5], b[5], c[5], d[5];*
*void foo() {*
*int i;*
*for(i=0; i<5; i++)*
*a[i] = b[i] + c[i];*
*}*
*void foo2() {*
*int i;*
*for(i=4; i--;)*
*d[i] = a[i] + b[i];*
*}*
*IR for foo()*
*define void @foo() {*
*br label %1*
*; <label>:1 ; preds = %13, %0*
*%i.0 = phi i32 [ 0, %0 ], [ %14, %13 ]*
*%2 = icmp slt i32 %i.0, 5*
*br i1 %2, label %3, label %15*
*; <label>:3 ; preds = %1*
*%4 = sext i32 %i.0 to i64*
*%5 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %4*
*%6 = load i32, i32* %5, align 4*
*%7 = sext i32 %i.0 to i64*
*%8 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %7*
*%9 = load i32, i32* %8, align 4*
*%10 = add nsw i32 %6, %9*
*%11 = sext i32 %i.0 to i64*
*%12 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i32 0, i64 %11*
*store i32 %10, i32* %12, align 4*
*br label %13*
*; <label>:13 ; preds = %3*
*%14 = add nsw i32 %i.0, 1*
*br label %1*
*; <label>:15 ; preds = %1*
*ret void*
*}*
*IR for foo2()*
*define void @foo2() #0 {*
*br label %1*
*; <label>:1 ; preds = %4, %0*
*%i.0 = phi i32 [ 4, %0 ], [ %2, %4 ]*
*%2 = add nsw i32 %i.0, -1*
*%3 = icmp ne i32 %i.0, 0*
*br i1 %3, label %4, label %14*
*; <label>:4 ; preds = %1*
*%5 = sext i32 %2 to i64*
*%6 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %5*
*%7 = load i32, i32* %6, align 4*
*%8 = sext i32 %2 to i64*
*%9 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %8*
*%10 = load i32, i32* %9, align 4*
*%11 = add nsw i32 %7, %10*
*%12 = sext i32 %2 to i64*
*%13 = getelementptr inbounds [5 x i32], [5 x i32]* @d, i32 0, i64 %12*
*store i32 %11, i32* %13, align 4*
*br label %1*
*; <label>:14 ; preds = %1*
*ret void*
*}*
As visible, one of the blocks is eliminated in case of loop reverse
traversal, and induction variable updating instruction (i--) has been
hoisted to the start block.
(This has constant inbounds of the loop, which in further optimizations
eliminates all the blocks. The above IR is for demo purpose)
As of now I can think of 3 stages to achieve this :
1.
Check legality if the loop can be reversed – check dependence analysis
in the loop , various scenarios will come – exit condition being constant,
variable, nested loops, etc.
2.
Check profitability of the loop – check if somehow the loop exit
condition can be trimmed down to comparison with zero ( More Inputs on this
are required )
3.
If steps 1 and 2 are true, then reverse the loop.
Inputs on 1st and 2nd steps are the critical for implementation.
I am pasting some of the links of the papers, where I found mention of loop
traversal as one of the prominent loop optimization techniques.
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf
http://perso.citi.insa-lyon.fr/trisset/papers/sympa08.pdf
https://www.complang.tuwien.ac.at/andi/papers/ijcsee_13.pdf
ftp://gcc.gnu.org/pub/gcc/summit/2004/High%20Level%20Loop%20Optimizations.pdf
Can anyone review my idea and suggest improvements? I will be glad if
someone mentors me on this, especially on the first 2 steps, if idea is
found to be useful.
Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/21525a71/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Loop.cpp
Type: text/x-c++src
Size: 978 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/21525a71/attachment.cpp>