thr3ads.net - llvm dev - [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop [Feb 2013]

If this information is useful, please help other people find it:
Share via:

Dmitry Mikushin

2013-Feb-03 17:55 UTC

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Oops, sorry for the message title, making it more descriptive now...

2013/2/3 Dmitry Mikushin <dmitry at kernelgen.org>
> Dear all,
>
> Yesterday, from the customer's code I observed relatively simple case,
> where Polly is unable to detect parallelizable outer loop. Consider two
> versions of Fortran routine:
>
> 1)
>
> subroutine filter(H, szh, X, szx, Y, szy)
>
>   implicit none
>   integer(kind=IKIND), intent(in) :: szh, szx, szy
>   real(kind=RKIND), intent(in) :: H(szh), X(szx)
>   real(kind=RKIND), intent(out) :: Y(szy)
>   integer(kind=IKIND) :: i, j, i1, i2
>   integer, parameter :: rkind = RKIND
>
>   do i = 1, szy
>     Y(i) = 0.0_rkind
>     do j = 1, szh
>       Y(i) = Y(i) + X(i + j - 1) * H(j)
>     enddo
>   enddo
>
> end subroutine filter
>
> 2)
>
> subroutine filter(H, szh, X, szx, Y, szy)
>
>   implicit none
>   integer(kind=IKIND), intent(in) :: szh, szx, szy
>   real(kind=RKIND), intent(in) :: H(szh), X(szx)
>   real(kind=RKIND), intent(out) :: Y(szy)
>   integer(kind=IKIND) :: i, j, i1, i2
>   integer, parameter :: rkind = RKIND
>
>   do i = 1, szx - szh
>     Y(i) = 0.0_rkind
>     i1 = i
>     i2 = i + szh - 1
>     Y(i) = Y(i) + sum(X(i1:i2) * H)
>   enddo
>
> end subroutine filter
>
> The difference between two versions is only in the way how inner reduction
> loop is implemented. In first case it is explicit and in second case - uses
> sum intrinsic, which is lowered to plain code by DragonEgg. In first case
> Polly successfully detects parallel outer loop, in second case - it
can't.
>
> Now let's see how IR codes look like upon their arrival at the
beginning
> of Polly pass manager in KernelGen: 1.ll and 2.ll. As far as I see, the
> only difference between IRs is that in case of 1.ll memory location for
> accumulating element is updated after each fma in inner loop, while in 2.ll
> accumulator is updated only after the end of inner loop, producing a PHI
> node (or reg2mem allocas - with Polly's preopt passes). Interestingly,
if
> you will try to optimize 1.ll further, it would yield to 2.ll too.
>
> After some attempts to modify the Polly behavior, I concluded it cannot
> resist the presence of non-IV PHI-node. Polly attempts to produce
> independent basic blocks, trying to gather dependent instructions in single
> block and allocating scratch variables. Allocas make the situation even
> worse, because they are always placed into the first block of function,
> completely killing potential parallelism.
>
> Do you see any possible workarounds in this case? Maybe alloca-s could be
> placed inside the inner loop to give it a chance to be parallel?
>
> Thanks,
> - D.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130203/4b288119/attachment.html>

Tobias Grosser

2013-Feb-04 15:24 UTC

head link

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Hi Dmitry,

[FORTRAN code]
>     The difference between two versions is only in the way how inner
>     reduction loop is implemented. In first case it is explicit and in
>     second case - uses sum intrinsic, which is lowered to plain code by
>     DragonEgg. In first case Polly successfully detects parallel outer
>     loop, in second case - it can't.
Thanks for the test case.
>     After some attempts to modify the Polly behavior, I concluded it
>     cannot resist the presence of non-IV PHI-node. Polly attempts to
>     produce independent basic blocks, trying to gather dependent
>     instructions in single block and allocating scratch variables.
>     Allocas make the situation even worse, because they are always
>     placed into the first block of function, completely killing
>     potential parallelism.
As a first step, I was trying to reproduce your observations. When 
optimizing the kernels with polly, possible aliasing prevents polly to 
detect the scop.

$ polly-opt -O3 -polly 1.ll -debug-only=polly-detect

Checking region: Loop Function Root => <Function Return>
	Top level region is invalid
Checking region: 3.cloned.header => return.loopexit.exitStub
	Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x 
double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x
double]*)"
Checking region: 5.cloned => 6.cloned
	Possible aliasing: "[0 x double]* inttoptr (i64 47246786560 to [0 x 
double]*)", "[0 x double]* inttoptr (i64 47246749696 to [0 x
double]*)"

$ polly-opt -O3 -polly 2.ll -debug-only=polly-detect
Checking region: Loop Function Root => <Function Return>
	Top level region is invalid
Checking region: 3.cloned.header => return.loopexit.exitStub
	Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x 
double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x
double]*)"
Checking region: 4.cloned => 5.cloned
	Possible aliasing: "[0 x double]* inttoptr (i64 47255179264 to [0 x 
double]*)", "[0 x double]* inttoptr (i64 47246786560 to [0 x
double]*)"

The problematic aliasing is due to these inttoptr expressions to some 
fixed based addresses. The basic-alias-analysis can not analyze them at 
all. The scev-alias-analysis understands that the base addresses can not 
alias, however it is impossible to prove complete independence of
any derived address. Hence, to reason about the addresses we would need
to teach Polly to understand that there is no aliasing for the subset of 
the memory space that is accessed within the Scop.

In any case, you seemed to have in some way convinced Polly to accept 
this code. Would you mind to share what you did?


Regarding your problem. As far as I understand, the problem is that the 
following code:

for (i
   A[i] = 0
   for (j
       A[i] +   ... = A[i]

is changed by gcc (and other optimizations) to:

for (i
   A[i] = 0
   tmp = A[i]

   for (j
       tmp +
   A[i] = tmp
   ... = A[i]

This is a common optimization that unfortunately introduces a lot of 
dependences on tmp that block any direct parallelization. To parallelize 
the loop anyway we would need to expand the memory of tmp, such that 
each parallel thread has a private copy of 'tmp'. Deciding where and how
to expand memory to enable further transformations is in general 
difficult such that I would normally run Polly before such optimizations 
are performed. Tough, in your case it may still be possible to 
parallelize the loop. To do this, you would need to ignore all 
dependences that can be removed by creating thread private copies of 
'tmp'. If you are interested in having this implemented either open a 
bug report or give it a try yourself. I am happy to assist.

Cheers
Tobi

Dmitry Mikushin

2013-Feb-04 15:55 UTC

head link

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Hi Tobi,

Thanks for looking into this!

2013/2/4 Tobias Grosser <tobias at grosser.es>>
> In any case, you seemed to have in some way convinced Polly to accept this
> code. Would you mind to share what you did?
>
Sure. Aliasing is simply ignored. Instead we have substituted pointers and
sizes for arrays and a special pass that converts memory accesses from
every scop statement into ISL general form. Sorry, we are quite far from
standard polly invocation process, maybe I should prepare some simplified
plugin for testing purposes...

>
>
> Regarding your problem. As far as I understand, the problem is that the
> following code:
>
> for (i
>   A[i] = 0
>   for (j
>       A[i] +>   ... = A[i]
>
> is changed by gcc (and other optimizations) to:
>
> for (i
>   A[i] = 0
>   tmp = A[i]
>
>   for (j
>       tmp +>
>   A[i] = tmp
>   ... = A[i]
>
Yes, exactly!

>
> This is a common optimization that unfortunately introduces a lot of
> dependences on tmp that block any direct parallelization. To parallelize
> the loop anyway we would need to expand the memory of tmp, such that each
> parallel thread has a private copy of 'tmp'. Deciding where and how
to
> expand memory to enable further transformations is in general difficult
> such that I would normally run Polly before such optimizations are
> performed. Tough, in your case it may still be possible to parallelize the
> loop. To do this, you would need to ignore all dependences that can be
> removed by creating thread private copies of 'tmp'. If you are
interested
> in having this implemented either open a bug report or give it a try
> yourself. I am happy to assist.
>
Hm, how to create thread-private copies of tmp at that point and how useful
could it be? The problem is that platform-dependent view of threads only
steps into the process, once the loop is proved to be parallel. Before
that, as far as I know, Polly/CLooG/ISL can't be aware of such things. I
thought more about pushing AllocaInst-s down closer to the nested array
header - would that work?

Thanks,
- D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/0f706507/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Feb 2013 - [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop

Apparently Analagous Threads