Dmitry Mikushin
2013-Feb-03 17:55 UTC
[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
Oops, sorry for the message title, making it more descriptive now... 2013/2/3 Dmitry Mikushin <dmitry at kernelgen.org>> Dear all, > > Yesterday, from the customer's code I observed relatively simple case, > where Polly is unable to detect parallelizable outer loop. Consider two > versions of Fortran routine: > > 1) > > subroutine filter(H, szh, X, szx, Y, szy) > > implicit none > integer(kind=IKIND), intent(in) :: szh, szx, szy > real(kind=RKIND), intent(in) :: H(szh), X(szx) > real(kind=RKIND), intent(out) :: Y(szy) > integer(kind=IKIND) :: i, j, i1, i2 > integer, parameter :: rkind = RKIND > > do i = 1, szy > Y(i) = 0.0_rkind > do j = 1, szh > Y(i) = Y(i) + X(i + j - 1) * H(j) > enddo > enddo > > end subroutine filter > > 2) > > subroutine filter(H, szh, X, szx, Y, szy) > > implicit none > integer(kind=IKIND), intent(in) :: szh, szx, szy > real(kind=RKIND), intent(in) :: H(szh), X(szx) > real(kind=RKIND), intent(out) :: Y(szy) > integer(kind=IKIND) :: i, j, i1, i2 > integer, parameter :: rkind = RKIND > > do i = 1, szx - szh > Y(i) = 0.0_rkind > i1 = i > i2 = i + szh - 1 > Y(i) = Y(i) + sum(X(i1:i2) * H) > enddo > > end subroutine filter > > The difference between two versions is only in the way how inner reduction > loop is implemented. In first case it is explicit and in second case - uses > sum intrinsic, which is lowered to plain code by DragonEgg. In first case > Polly successfully detects parallel outer loop, in second case - it can't. > > Now let's see how IR codes look like upon their arrival at the beginning > of Polly pass manager in KernelGen: 1.ll and 2.ll. As far as I see, the > only difference between IRs is that in case of 1.ll memory location for > accumulating element is updated after each fma in inner loop, while in 2.ll > accumulator is updated only after the end of inner loop, producing a PHI > node (or reg2mem allocas - with Polly's preopt passes). Interestingly, if > you will try to optimize 1.ll further, it would yield to 2.ll too. > > After some attempts to modify the Polly behavior, I concluded it cannot > resist the presence of non-IV PHI-node. Polly attempts to produce > independent basic blocks, trying to gather dependent instructions in single > block and allocating scratch variables. Allocas make the situation even > worse, because they are always placed into the first block of function, > completely killing potential parallelism. > > Do you see any possible workarounds in this case? Maybe alloca-s could be > placed inside the inner loop to give it a chance to be parallel? > > Thanks, > - D. >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/4b288119/attachment.html>
Tobias Grosser
2013-Feb-04 15:24 UTC
[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
Hi Dmitry, [FORTRAN code]> The difference between two versions is only in the way how inner > reduction loop is implemented. In first case it is explicit and in > second case - uses sum intrinsic, which is lowered to plain code by > DragonEgg. In first case Polly successfully detects parallel outer > loop, in second case - it can't.Thanks for the test case.> After some attempts to modify the Polly behavior, I concluded it > cannot resist the presence of non-IV PHI-node. Polly attempts to > produce independent basic blocks, trying to gather dependent > instructions in single block and allocating scratch variables. > Allocas make the situation even worse, because they are always > placed into the first block of function, completely killing > potential parallelism.As a first step, I was trying to reproduce your observations. When optimizing the kernels with polly, possible aliasing prevents polly to detect the scop. $ polly-opt -O3 -polly 1.ll -debug-only=polly-detect Checking region: Loop Function Root => <Function Return> Top level region is invalid Checking region: 3.cloned.header => return.loopexit.exitStub Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)" Checking region: 5.cloned => 6.cloned Possible aliasing: "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246749696 to [0 x double]*)" $ polly-opt -O3 -polly 2.ll -debug-only=polly-detect Checking region: Loop Function Root => <Function Return> Top level region is invalid Checking region: 3.cloned.header => return.loopexit.exitStub Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)" Checking region: 4.cloned => 5.cloned Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x double]*)" The problematic aliasing is due to these inttoptr expressions to some fixed based addresses. The basic-alias-analysis can not analyze them at all. The scev-alias-analysis understands that the base addresses can not alias, however it is impossible to prove complete independence of any derived address. Hence, to reason about the addresses we would need to teach Polly to understand that there is no aliasing for the subset of the memory space that is accessed within the Scop. In any case, you seemed to have in some way convinced Polly to accept this code. Would you mind to share what you did? Regarding your problem. As far as I understand, the problem is that the following code: for (i A[i] = 0 for (j A[i] + ... = A[i] is changed by gcc (and other optimizations) to: for (i A[i] = 0 tmp = A[i] for (j tmp + A[i] = tmp ... = A[i] This is a common optimization that unfortunately introduces a lot of dependences on tmp that block any direct parallelization. To parallelize the loop anyway we would need to expand the memory of tmp, such that each parallel thread has a private copy of 'tmp'. Deciding where and how to expand memory to enable further transformations is in general difficult such that I would normally run Polly before such optimizations are performed. Tough, in your case it may still be possible to parallelize the loop. To do this, you would need to ignore all dependences that can be removed by creating thread private copies of 'tmp'. If you are interested in having this implemented either open a bug report or give it a try yourself. I am happy to assist. Cheers Tobi
Dmitry Mikushin
2013-Feb-04 15:55 UTC
[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
Hi Tobi, Thanks for looking into this! 2013/2/4 Tobias Grosser <tobias at grosser.es>> > In any case, you seemed to have in some way convinced Polly to accept this > code. Would you mind to share what you did? >Sure. Aliasing is simply ignored. Instead we have substituted pointers and sizes for arrays and a special pass that converts memory accesses from every scop statement into ISL general form. Sorry, we are quite far from standard polly invocation process, maybe I should prepare some simplified plugin for testing purposes...> > > Regarding your problem. As far as I understand, the problem is that the > following code: > > for (i > A[i] = 0 > for (j > A[i] +> ... = A[i] > > is changed by gcc (and other optimizations) to: > > for (i > A[i] = 0 > tmp = A[i] > > for (j > tmp +> > A[i] = tmp > ... = A[i] >Yes, exactly!> > This is a common optimization that unfortunately introduces a lot of > dependences on tmp that block any direct parallelization. To parallelize > the loop anyway we would need to expand the memory of tmp, such that each > parallel thread has a private copy of 'tmp'. Deciding where and how to > expand memory to enable further transformations is in general difficult > such that I would normally run Polly before such optimizations are > performed. Tough, in your case it may still be possible to parallelize the > loop. To do this, you would need to ignore all dependences that can be > removed by creating thread private copies of 'tmp'. If you are interested > in having this implemented either open a bug report or give it a try > yourself. I am happy to assist. >Hm, how to create thread-private copies of tmp at that point and how useful could it be? The problem is that platform-dependent view of threads only steps into the process, once the loop is proved to be parallel. Before that, as far as I know, Polly/CLooG/ISL can't be aware of such things. I thought more about pushing AllocaInst-s down closer to the nested array header - would that work? Thanks, - D. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/0f706507/attachment.html>
Reasonably Related Threads
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Compile-time of Polly's code generation
- [LLVMdev] [Polly] Compile-time of Polly's code generation