thr3ads.net - llvm dev - [LLVMdev] Trip count and Loop Vectorizer [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Murali, Sriram

2013-Sep-27 17:17 UTC

[LLVMdev] Trip count and Loop Vectorizer

Hi Nadav,
Thanks for the response. I forgot to mention that there is an upper limit of 16
for the Trip Count check,
TinyTripCountVectorThreshold = 16;
if (TC > 0u && TC < TinyTripCountVectorThreshold). So right now,
any loop with Trip Count as 0, or with  value >=16, LV with unroll. With the
change to the lower bound, it will also include the loop with 0 trip count.
SCEV returns 0 trip count for this case, because it identifies that there is no
backedge taken.

ScalarEvolution::ComputeExitLimitFromCond () {
...
 if (L->contains(FBB) == !CI->getZExtValue())
   { }
else
   // The backedge is never taken.
   return getConstant(CI->getType(), 0);
}
From: Nadav Rotem [mailto:nrotem at apple.com]
Sent: Friday, September 27, 2013 1:03 PM
To: Murali, Sriram
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Trip count and Loop Vectorizer

Hi Sriram,

Thanks for performing this analysis. The problem here, both for memcpy and the
vectorizer, is that we can't predict the size of "n", even though
the only use of 'n' is for the loop bound for the alloca [4 x [8 x
i32]]. If you change the unroll condition to TC >= 0 then you will disable
loop unrolling for all loops because getSmallConstantTripCount returns an
unsigned number. You can control the unroll factor using metadata or using the
command line tools.

Thanks,
Nadav


On Sep 27, 2013, at 9:41 AM, Murali, Sriram <sriram.murali at
intel.com<mailto:sriram.murali at intel.com>> wrote:


Hi,
I am trying to get a small loop to *not vectorize* for cases where it
doesn't make sense. For instance, this loop:
void foo(int a[4][8], int n)
{
    int b[4][8];
    for(int i = 0; i < 4; i++) {
        for(int j = 0; j < n; j++) {
            a[i][j] = b[i][j];
        }
    }
}
* Has maximum of 8ints copy. LLVM tries to use Memcpy for the inner loop. It is
not helpful to perform memcpy for such small moves, especially when the outer
loop is unrolled since the trip count is constant (4). The 4 calls to memcpy is
not efficient.
* Therefore, I disabled the memcpy optimization for such cases, and found that
LLVM  LoopVectorizer successfully vectorizes and unrolls the inner loop.
However, in order to take the fast path (vmovups) it must copy at least 32 ints,
where as in this case we only do an 8int copy.
** Upon closer look, LoopVectorizer obtains the TripCount for the innerloop
using getSmallConstantTripCount(Loop,...). This value is 0 for the loop with
unknown trip count. Loop unrolling is disabled when TC > 0. Should this be
changed to TC >= 0 (which does the job for this testcase)? Or is there a
better way to disable loop unrolling for such trivial loops, at least the ones
with known array size?

Thanks for your feedback

Sriram

--
Sriram Murali
SSG/DPD/ECDL/DMP
+1 (519) 772 - 2579

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu<mailto:LLVMdev at cs.uiuc.edu>        
http://llvm.cs.uiuc.edu<http://llvm.cs.uiuc.edu/>
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130927/9c5b57c8/attachment.html>

Arnold Schwaighofer

2013-Sep-27 17:47 UTC

head link

[LLVMdev] Trip count and Loop Vectorizer

Sriram,

The problem is that you want to unroll/vectorize many loops with non-constant
loop count - it is a trade-off of which case you estimate as more likely.


int foo(int *ptr, int n) {
  for ( .. i <n)
    ptr[i] = ...
}

The question is: is it more likely to have “n” such that unrolling is beneficial
or not.

Now, you could probably write an analysis that bounds the loop count (for the
purpose of this heuristic) based on the only possible legal access in the loop.
In your example you have an access to an alloca of which the size is known, so
you could infer that n must be smaller than 8 (because you know the range of the
other dimension). The question is how often does such an example occur, where
this is possible, to make such an effort justifiable?


Best,
Arnold

On Sep 27, 2013, at 12:17 PM, Murali, Sriram <sriram.murali at intel.com>
wrote:
> Hi Nadav,
> Thanks for the response. I forgot to mention that there is an upper limit
of 16 for the Trip Count check,
> TinyTripCountVectorThreshold = 16;
> if (TC > 0u && TC < TinyTripCountVectorThreshold). So right
now, any loop with Trip Count as 0, or with  value >=16, LV with unroll. With
the change to the lower bound, it will also include the loop with 0 trip count.
> SCEV returns 0 trip count for this case, because it identifies that there
is no backedge taken.
>  
> ScalarEvolution::ComputeExitLimitFromCond () {
> …  
>  if (L->contains(FBB) == !CI->getZExtValue())
>    { }
> else
>    // The backedge is never taken.
>    return getConstant(CI->getType(), 0);
> }
> From: Nadav Rotem [mailto:nrotem at apple.com] 
> Sent: Friday, September 27, 2013 1:03 PM
> To: Murali, Sriram
> Cc: llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Trip count and Loop Vectorizer
>  
> Hi Sriram, 
>  
> Thanks for performing this analysis. The problem here, both for memcpy and
the vectorizer, is that we can’t predict the size of “n”, even though the only
use of ’n’ is for the loop bound for the alloca [4 x [8 x i32]]. If you change
the unroll condition to TC >= 0 then you will disable loop unrolling for all
loops because getSmallConstantTripCount returns an unsigned number. You can
control the unroll factor using metadata or using the command line tools.
>  
> Thanks,
> Nadav
>  
>  
> On Sep 27, 2013, at 9:41 AM, Murali, Sriram <sriram.murali at
intel.com> wrote:
> 
> 
> Hi,
> I am trying to get a small loop to *not vectorize* for cases where it
doesn’t make sense. For instance, this loop:
> void foo(int a[4][8], int n)
> {   
>     int b[4][8];
>     for(int i = 0; i < 4; i++) {
>         for(int j = 0; j < n; j++) {
>             a[i][j] = b[i][j];           
>         }
>     }
> }
> * Has maximum of 8ints copy. LLVM tries to use Memcpy for the inner loop.
It is not helpful to perform memcpy for such small moves, especially when the
outer loop is unrolled since the trip count is constant (4). The 4 calls to
memcpy is not efficient.
> * Therefore, I disabled the memcpy optimization for such cases, and found
that LLVM  LoopVectorizer successfully vectorizes and unrolls the inner loop.
However, in order to take the fast path (vmovups) it must copy at least 32 ints,
where as in this case we only do an 8int copy.
> ** Upon closer look, LoopVectorizer obtains the TripCount for the innerloop
using getSmallConstantTripCount(Loop,…). This value is 0 for the loop with
unknown trip count. Loop unrolling is disabled when TC > 0. Should this be
changed to TC >= 0 (which does the job for this testcase)? Or is there a
better way to disable loop unrolling for such trivial loops, at least the ones
with known array size?
>  
> Thanks for your feedback
>  
> Sriram
>  
> --
> Sriram Murali
> SSG/DPD/ECDL/DMP
> +1 (519) 772 – 2579
>  
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>  
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Arnold Schwaighofer

2013-Sep-27 17:54 UTC

head link

[LLVMdev] Trip count and Loop Vectorizer

On Sep 27, 2013, at 12:47 PM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
>  so you could infer that n must be smaller than 8 (because you know the
range of the other dimension). The question is how often does such an example
occur, where this is possible, to make such an effort justifiable?smaller equal, of course ;)

Maybe Matching Threads

Search for more seemingly similar threads

llvm dev - Sep 2013 - [LLVMdev] Trip count and Loop Vectorizer

[LLVMdev] Trip count and Loop Vectorizer

[LLVMdev] Trip count and Loop Vectorizer

[LLVMdev] Trip count and Loop Vectorizer

Maybe Matching Threads