thr3ads.net - llvm dev - [LLVMdev] GSoC:Loop Reversal Transformation [Mar 2015]

If this information is useful, please help other people find it:
Share via:

vishal sarda

2015-Mar-17 19:00 UTC

[LLVMdev] GSoC:Loop Reversal Transformation

Hi,

I have been studying LLVM infrastructure from past 1 month and trying out
some hands on to get familiar with it. I am keen to propose loop reversal
pass in LLVM as a part of GSoC 2015. Following are the details of my
proposal.

A for loop can run in two ways – it can count up or it can count down.

   1.

   for(i=0; i<10 ;i++) {// Do something}
   2.

   for(i=9; i--;) {// Do something}

Counting down to zero with the decrement operator (i--) is faster than
counting up to a number of iterations with the increment operator (i++).

I have run a program with sufficient iteration for both types of loops (The
program file Loop.cpp is attached with the mail). The average run time
observed for both the loops over 100 runs

Loop counting up 0.15 ms

Loop counting down 0.08 ms

As observed, there is an execution time gain of almost 50%. The test case
is rather simple though; need to check in wide variety of test cases.

It is a legal transformation if the resulting dependence vector remains
lexicographically positive. Although trivial, it is a useful optimization
since it may enable others such as loop interchange and can reduce the loop
exit condition to a single branch-not-equal-to-zero instruction.

For example, the following code cannot be interchanged or have its inner
loop parallelized because of (1,−1) dependencies.

*do i = 1, n*

*do j = 1, n*

*a[i,j] = a[i-1,j+1] + 1*

*end do*

*end do*

Reversing the inner loop yields (1, 1) dependencies. The loops can now be
interchanged and/or the inner loop made parallel.

*do i = 1, n*

*do j = n, 1, -1*

*a[i,j] = a[i-1,j+1] + 1*

*end do*

*end do*

 To see the difference in LLVM IR, I have taken a simpler test case:

*int a[5], b[5], c[5], d[5];*

*void foo() {*

*int i;*

*for(i=0; i<5; i++)*

*a[i] = b[i] + c[i];*

*}*

*void foo2() {*

*int i;*

*for(i=4; i--;)*

*d[i] = a[i] + b[i];*

*}*



 *IR for foo()*

*define void @foo() {*

*br label %1*

*; <label>:1 ; preds = %13, %0*

*%i.0 = phi i32 [ 0, %0 ], [ %14, %13 ]*

*%2 = icmp slt i32 %i.0, 5*

*br i1 %2, label %3, label %15*

*; <label>:3 ; preds = %1*

*%4 = sext i32 %i.0 to i64*

*%5 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %4*

*%6 = load i32, i32* %5, align 4*

*%7 = sext i32 %i.0 to i64*

*%8 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %7*

*%9 = load i32, i32* %8, align 4*

*%10 = add nsw i32 %6, %9*

*%11 = sext i32 %i.0 to i64*

*%12 = getelementptr inbounds [5 x i32], [5 x i32]* @a, i32 0, i64 %11*

*store i32 %10, i32* %12, align 4*

*br label %13*

*; <label>:13 ; preds = %3*

*%14 = add nsw i32 %i.0, 1*

*br label %1*

*; <label>:15 ; preds = %1*

*ret void*

*}*



 *IR for foo2()*

*define void @foo2() #0 {*

*br label %1*

*; <label>:1 ; preds = %4, %0*

*%i.0 = phi i32 [ 4, %0 ], [ %2, %4 ]*

*%2 = add nsw i32 %i.0, -1*

*%3 = icmp ne i32 %i.0, 0*

*br i1 %3, label %4, label %14*

*; <label>:4 ; preds = %1*

*%5 = sext i32 %2 to i64*

*%6 = getelementptr inbounds [5 x i32], [5 x i32]* @b, i32 0, i64 %5*

*%7 = load i32, i32* %6, align 4*

*%8 = sext i32 %2 to i64*

*%9 = getelementptr inbounds [5 x i32], [5 x i32]* @c, i32 0, i64 %8*

*%10 = load i32, i32* %9, align 4*

*%11 = add nsw i32 %7, %10*

*%12 = sext i32 %2 to i64*

*%13 = getelementptr inbounds [5 x i32], [5 x i32]* @d, i32 0, i64 %12*

*store i32 %11, i32* %13, align 4*

*br label %1*



 *; <label>:14 ; preds = %1*

*ret void*

*}*

As visible, one of the blocks is eliminated in case of loop reverse
traversal, and induction variable updating instruction (i--) has been
hoisted to the start block.

(This has constant inbounds of the loop, which in further optimizations
eliminates all the blocks. The above IR is for demo purpose)

 As of now I can think of 3 stages to achieve this :

   1.

   Check legality if the loop can be reversed – check dependence analysis
   in the loop , various scenarios will come – exit condition being constant,
   variable, nested loops, etc.
   2.

   Check profitability of the loop – check if somehow the loop exit
   condition can be trimmed down to comparison with zero ( More Inputs on this
   are required )
   3.

   If steps 1 and 2 are true, then reverse the loop.

Inputs on 1st and 2nd steps are the critical for implementation.

I am pasting some of the links of the papers, where I found mention of loop
traversal as one of the prominent loop optimization techniques.

http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

http://perso.citi.insa-lyon.fr/trisset/papers/sympa08.pdf

https://www.complang.tuwien.ac.at/andi/papers/ijcsee_13.pdf

ftp://gcc.gnu.org/pub/gcc/summit/2004/High%20Level%20Loop%20Optimizations.pdf

Can anyone review my idea and suggest improvements? I will be glad if
someone mentors me on this, especially on the first 2 steps, if idea is
found to be useful.

Regards,

 Vishal Sarda,

 3rd Year Undergraduate,

 Department of Computer Engineering,

 College of Engineering, Pune
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/21525a71/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Loop.cpp
Type: text/x-c++src
Size: 978 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/21525a71/attachment.cpp>

Matt Godbolt

2015-Mar-17 19:50 UTC

head link

[LLVMdev] GSoC:Loop Reversal Transformation

Hi,

On Tue, Mar 17, 2015 at 2:00 PM, vishal sarda <vishalksarda at gmail.com>
wrote:
> [snip]
>
> Loop counting up 0.15 ms
>
> Loop counting down 0.08 ms
>I'm no llvm expert, but as an interest bystander: I suspect you compiled
the source without any optimizations applied - I tried to replicate this
behaviour and found the optimzer happily replaces both the inner loops you
had with a constant, and thus I got the same time on both loops. (e.g. see
http://goo.gl/aXFkVb )

Benchmarks of this nature where the run time is so small are notoriously
prone to measurement errors, so I'd be a little careful drawing conclusions
from the sample you listed. Also; what architecture did you measure on, and
what spec machine?

Not at all to dissuade you from investigating this optimization! I'm just a
little sceptical of the benchmark you posted!

Best regards,

Matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150317/d511fb80/attachment.html>

vishal sarda

2015-Mar-18 15:33 UTC

head link

[LLVMdev] GSoC:Loop Reversal Transformation

Hi Matt, All,

Thanks for looking into this and your suggestions.

I compiled the program with -O3 optimization level (clang -O3) on X86_64
target.
After your suggestion, i ran the program with iteration of 10000 runs and
found that average runtimes are (attaching data collected as well)

Forward loop traverse : 2.006 milli seconds
Reverse loop traverse : 1.531 milli seconds

Yes, i agree that this sample program may not be sufficient to say that
loop reversal traversing will always be faster, however, difference in
runtime is visible though. And that's where the profitability calculation
comes into picture.

I found mention of loop traverse in various papers (links mentioned in
previous mail), and hence thought of implementing it.

I am not sure though if its really helpful. Above papers mentioned that it
might not be beneficial in itself, but opens up the opportunity for other
optimizations.

Suggestions on this are most welcomed. Waiting for others to pitch in too.

Regards,
Vishal Sarda,
3rd Year Undergraduate,
Department of Computer Engineering,
College of Engineering, Pune

On Wed, Mar 18, 2015 at 1:20 AM, Matt Godbolt <matt at godbolt.org> wrote:
> Hi,
>
> On Tue, Mar 17, 2015 at 2:00 PM, vishal sarda <vishalksarda at
gmail.com>
> wrote:
>
>> [snip]
>>
>> Loop counting up 0.15 ms
>>
>> Loop counting down 0.08 ms
>>
> I'm no llvm expert, but as an interest bystander: I suspect you
compiled
> the source without any optimizations applied - I tried to replicate this
> behaviour and found the optimzer happily replaces both the inner loops you
> had with a constant, and thus I got the same time on both loops. (e.g. see
> http://goo.gl/aXFkVb )
>
> Benchmarks of this nature where the run time is so small are notoriously
> prone to measurement errors, so I'd be a little careful drawing
conclusions
> from the sample you listed. Also; what architecture did you measure on, and
> what spec machine?
>
> Not at all to dissuade you from investigating this optimization! I'm
just
> a little sceptical of the benchmark you posted!
>
> Best regards,
>
> Matt
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/ecd31c8c/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Compare.xlsx
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size: 276128 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150318/ecd31c8c/attachment.xlsx>

llvm dev - Mar 2015 - [LLVMdev] GSoC:Loop Reversal Transformation

[LLVMdev] GSoC:Loop Reversal Transformation

[LLVMdev] GSoC:Loop Reversal Transformation

[LLVMdev] GSoC:Loop Reversal Transformation