Renato Golin
2013-Oct-21 18:00 UTC
[LLVMdev] First attempt at recognizing pointer reduction
Hi Arnold, To sum up my intentions, I want to understand how the reduction/induction variable detection works in LLVM, so that I can know better how to detect different patterns in memory, not just the stride vectorization. For instance, even if the relationship between each loop would be complicated, I know that in each loop, all three reads are sequential. So, at least, I could use a Load<3> rather than three loads. I guess this is why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the operation on all three loaded values are exactly the same AND the writes are also sequential, I can reduce that to: load<3> -> op<3> -> store<3>. That should work even on machines that don't have de-interleaved access (assuming they're pointers to simple types, etc). On 21 October 2013 17:29, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:> can you post a hand-created vectorized IR of how a reduction would work on > your example? >I'm not there yet, just trying to understand the induction/reduction code first and what comes out of it. I think the right approach to get examples like this is that we need to> recognize strided pointer inductions (we only support strides by one > currently). >I see. I'll have a look at IK_PtrInduction and see what patterns I can spot. Do you envisage a new IK type for strided induction, or just work with the PtrInduction to extend the concept of a non-unit stride (ie. separate Step from ElementSize)? cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/4320c3b6/attachment.html>
Arnold Schwaighofer
2013-Oct-21 19:58 UTC
[LLVMdev] First attempt at recognizing pointer reduction
On Oct 21, 2013, at 1:00 PM, Renato Golin <renato.golin at linaro.org> wrote:> Hi Arnold, > > To sum up my intentions, I want to understand how the reduction/induction variable detection works in LLVM, so that I can know better how to detect different patterns in memory, not just the stride vectorization.To detect memory access patterns you will want to look at the SCEV of a pointer (or at a set of SCEVs/pointers). This way you get a canonical form. For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”: {ptr, +, 8}_loop {ptr+4, +, 8}_loop Each access on its own requires a gather/scather (2 loads/stores when vectorized (VF=2) + inserts/extracts). But when we look at both at once we see that we only need two load/store in total (plus some interleaving operations). What other patterns (than strided accesses) do you want to vectorize?> > For instance, even if the relationship between each loop would be complicated, I know that in each loop, all three reads are sequential. So, at least, I could use a Load<3> rather than three loads.Yes. Detecting this is what is described in the paper I referred in a post before (Auto-vectorization of interleaved data for SIMD). And what gcc is doing (AFAICT). You want to take a set of accesses in the loop and recognize that they are consecutive in memory (currently we look at every access on it is own, if an access is not sequential we assume we have to gather/scather). Once you know that you have consecutive accesses spread across different instructions you can generate more efficient code instead of scather/gathers. You would want to do the evaluation of which accesses are consecutive in SCEV I think. For your example, you want to recognize strided accesses (this is separate from the induction/reduction mechanism), first: for (i = 0 .. n, +1) a[2*i] = … a[2*1+1] = … You want this part first because without it the loop vectorizer is not going to vectorize because of the cost of gather/scather. But if it can recognize that in cases like this the “gather/scather” is not as costly and we emit the right code we will start to vectorize such loops. Once we can do that. We need to support strided pointer inductions to get your example. for (i = 0..n, +1) *a++ *a++> I guess this is why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the operation on all three loaded values are exactly the same AND the writes are also sequential, I can reduce that to: load<3> -> op<3> -> > store<3>. That should work even on machines that don't have de-interleaved access (assuming they're pointers to simple types, etc).Getting this example in the slp vectorizer is easier but we won’t get the most efficient code (i.e. the one that gcc emits) because we would have <3 x i8> stores/loads. With vectorization of interleaved data you can load/store more elements (from several iterations) with a single load.> > > On 21 October 2013 17:29, Arnold Schwaighofer <aschwaighofer at apple.com> wrote: > can you post a hand-created vectorized IR of how a reduction would work on your example? > > I'm not there yet, just trying to understand the induction/reduction code first and what comes out of it. > > > I think the right approach to get examples like this is that we need to recognize strided pointer inductions (we only support strides by one currently). > > I see. I'll have a look at IK_PtrInduction and see what patterns I can spot. > > Do you envisage a new IK type for strided induction, or just work with the PtrInduction to extend the concept of a non-unit stride (ie. separate Step from ElementSize)?Either representation should be fine. I think the bigger task is not recognizing the induction but recognizing consecutive strided memory accesses, though. First, I think we want to be able to do: for (i = 0 … n, +1) a[3*i] a[3*i+1] a[3*i+2] And next, for (i = 0 … n, +1) *a++ *a++ *a++ Because to get the latter, you need the former. Have you compared the performance of the kernel (gcc vectorized) you showed vs a scalar version? I would be curious about the speed-up. Thanks, Arnold
Renato Golin
2013-Oct-21 20:40 UTC
[LLVMdev] First attempt at recognizing pointer reduction
On 21 October 2013 20:58, Arnold Schwaighofer <aschwaighofer at apple.com>wrote:> For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”: > > {ptr, +, 8}_loop > {ptr+4, +, 8}_loop > > Each access on its own requires a gather/scather (2 loads/stores when > vectorized (VF=2) + inserts/extracts). But when we look at both at once we > see that we only need two load/store in total (plus some interleaving > operations). >Yes, I've been studying SCEV when trying to understand some other patterns where the vectorizer was unable to detect the exit count (basically this case, with a nested loop). It does make things easier to spot patterns in the code. The patch I attached here was not to help vectorize anything, but to let me jump over the validation step, so that I could start working with the patterns themselves during the actual vectorization. The review request was only to understand if the checks I was making made sense, but it turned out a lot more valuable than that. Getting this example in the slp vectorizer is easier but we won’t get the> most efficient code (i.e. the one that gcc emits) because we would have <3 > x i8> stores/loads. With vectorization of interleaved data you can > load/store more elements (from several iterations) with a single load. >So, this was the other patterns I was looking for, as a stepping stone into the full vectorizer. But I'm not sure this will help in any way the strided access. Either representation should be fine. I think the bigger task is not> recognizing the induction but recognizing consecutive strided memory > accesses, though. First, I think we want to be able to do: > > for (i = 0 … n, +1) > a[3*i] > a[3*i+1] > a[3*i+2] > > And next, > > for (i = 0 … n, +1) > *a++ > *a++ > *a++ > > Because to get the latter, you need the former. >Makes total sense. I'll change my approach. Have you compared the performance of the kernel (gcc vectorized) you showed> vs a scalar version? I would be curious about the speed-up. >4x faster, on both Cortex A9 and A15. :) Thanks for the tips, I hope I can find more time to work on it this week, since Linaro Connect is in the coming week and the US dev meeting is on the next. cheers, --renato -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131021/c0d72575/attachment.html>
Possibly Parallel Threads
- [LLVMdev] First attempt at recognizing pointer reduction
- [LLVMdev] First attempt at recognizing pointer reduction
- [LLVMdev] LLVMdev Digest, Vol 112, Issue 56
- [LLVMdev] First attempt at recognizing pointer reduction
- [LLVMdev] First attempt at recognizing pointer reduction