We have frame work for simple heuristics in place:
1.) We vectorize loops with unknown trip count.
2.) You can find our partial unroll heuristic in selectUnrollFactor. It probably
needs tuning.
// We unroll the loop in order to expose ILP and reduce the loop overhead.
// There are many micro-architectural considerations that we can't predict
// at this level. For example frontend pressure (on decode or fetch) due to
// code size, or the number and capabilities of the execution ports.
//
// We use the following heuristics to select the unroll factor:
// 1. If the code has reductions the we unroll in order to break the cross
// iteration dependency.
// 2. If the loop is really small then we unroll in order to reduce the loop
// overhead.
<<< This is the heuristic that works against your example.
// 3. We don't unroll if we think that we will spill registers to memory
due
// to the increased register pressure.
On Sep 27, 2013, at 2:21 PM, Murali, Sriram <sriram.murali at intel.com>
wrote:
> Hey Arnold,
> I have run into this situation many times while benchmarking.
> I think it is best if this is addressed using a simple heuristic. For that,
we need to identify the loop cost and decide if it makes sense to completely
unroll the loop, or partially unroll. I am unsure of the optimal way to
implement this though.
>
> I want to run it by the list to get any ideas floating around :)
> Thanks
> Sriram
>
> -----Original Message-----
> From: Arnold Schwaighofer [mailto:aschwaighofer at apple.com]
> Sent: Friday, September 27, 2013 1:54 PM
> To: Murali, Sriram
> Cc: llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Trip count and Loop Vectorizer
>
>
> On Sep 27, 2013, at 12:47 PM, Arnold Schwaighofer <aschwaighofer at
apple.com> wrote:
>
>> so you could infer that n must be smaller than 8 (because you know the
range of the other dimension). The question is how often does such an example
occur, where this is possible, to make such an effort justifiable?
> smaller equal, of course ;)