Preston Briggs
2013-Feb-04 18:42 UTC
[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
---------- Forwarded message ----------> From: Tobias Grosser <tobias at grosser.es> > To: Dmitry Mikushin <dmitry at kernelgen.org> > Cc: polly-dev <polly-dev at googlegroups.com>, LLVM Developers Mailing List < > llvmdev at cs.uiuc.edu> > Date: Mon, 04 Feb 2013 17:53:11 +0100 > Subject: Re: [LLVMdev] [Polly] Parallelizing outer loop containing inner > reduction loop > On 02/04/2013 04:55 PM, Dmitry Mikushin wrote: > >> Hi Tobi, >> >> Thanks for looking into this! >> >> 2013/2/4 Tobias Grosser <tobias at grosser.es <mailto:tobias at grosser.es>> >> >> In any case, you seemed to have in some way convinced Polly to >> accept this code. Would you mind to share what you did? >> >> >> Sure. Aliasing is simply ignored. Instead we have substituted pointers >> and sizes for arrays and a special pass that converts memory accesses >> from every scop statement into ISL general form. >> > > Interesting, why did you do this? > > Sorry, we are quite far >> from standard polly invocation process, maybe I should prepare some >> simplified plugin for testing purposes... >> > > Or at least state that this is not standard polly. ;-) Then I would not > keep trying to reproduce the problem. > > Regarding your problem. As far as I understand, the problem is that >> the following code: >> >> for (i >> A[i] = 0 >> for (j >> A[i] +>> ... = A[i] >> >> is changed by gcc (and other optimizations) to: >> >> for (i >> A[i] = 0 >> tmp = A[i] >> >> for (j >> tmp +>> >> A[i] = tmp >> ... = A[i] >> >> >> Yes, exactly! >> >> >> This is a common optimization that unfortunately introduces a lot of >> dependences on tmp that block any direct parallelization. To >> parallelize the loop anyway we would need to expand the memory of >> tmp, such that each parallel thread has a private copy of 'tmp'. >> Deciding where and how to expand memory to enable further >> transformations is in general difficult such that I would normally >> run Polly before such optimizations are performed. Tough, in your >> case it may still be possible to parallelize the loop. To do this, >> you would need to ignore all dependences that can be removed by >> creating thread private copies of 'tmp'. If you are interested in >> having this implemented either open a bug report or give it a try >> yourself. I am happy to assist. >> >> >> Hm, how to create thread-private copies of tmp at that point and how >> useful could it be? > >Many auto-vectorizing compilers do such a thing (sometimes called "scalar expansion") as a way to attack otherwise parallel loops. There are different approaches how to create thread private copies. You> could aim to get back to the code in '1.ll' by creating an array with as > many elements as there are iterations in the loop. That should be very > general and independent of the platform specific parallel implementation. > However, such a general implementation may introduce a lot of additional > memory, which requires us to be careful with such transformations. >A better solution is to create an array with as many elements as there are threads, or, in the case of vectorization, as many elements as the length of the vector. Much less expensive!> The problem is that platform-dependent view of > >> threads only steps into the process, once the loop is proved to be >> parallel. Before that, as far as I know, Polly/CLooG/ISL can't be aware >> of such things. >> > > True. In general Polly does not know how parallelism is implemented > exactly. Though we can perform some preparing transformations. > > > I thought more about pushing AllocaInst-s down closer to > >> the nested array header - would that work? >> > > I don't think pushing AllocaInst-s into the loop is a better solution. > In terms of memory behavior > > int A[i]; // stack allocated > for i > A[i] > > and > > for i > int *A = alloca(sizeof(int)) > *A > > are the same. We really do not want to allocate a full temporary array on > the stack. > > I see three options: > > 0) We avoid optimizations that yield '2.ll' > > If we can (in some way) retain the version '1.ll', this is the > best case and we should try hard to stay here. > > 1) We malloc a temporary array > > In some cases it may be OK to do full memory expansion, e.g. > by allocating the temporary array on the heap. > > 2) Mark loops conditionally parallel > > We could mark loops parallel under the condition that certain > memory references are privatized. For your use case, you could > use the new loop.parallel meta-data and mark loops partial > parallel. Meaning, you only mark the accesses parallel that > are not involved in loop carried dependences. The remaining > accesses then either need to be privatized or they need to > be removed by mem2reg. For your example, mem2reg should remove > the arrays, and the loop is then detected as parallel. > > Cheers > Tobi >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130204/f819821e/attachment.html>
Tobias Grosser
2013-Feb-04 20:57 UTC
[LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
On 02/04/2013 07:42 PM, Preston Briggs wrote:> > > ---------- Forwarded message ---------- > From: Tobias Grosser <tobias at grosser.es <mailto:tobias at grosser.es>> > To: Dmitry Mikushin <dmitry at kernelgen.org <mailto:dmitry at kernelgen.org>> > Cc: polly-dev <polly-dev at googlegroups.com > <mailto:polly-dev at googlegroups.com>>, LLVM Developers Mailing List > <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>> > Date: Mon, 04 Feb 2013 17:53:11 +0100 > Subject: Re: [LLVMdev] [Polly] Parallelizing outer loop containing > inner reduction loop > On 02/04/2013 04:55 PM, Dmitry Mikushin wrote: > > Hi Tobi, > > Thanks for looking into this! > > 2013/2/4 Tobias Grosser <tobias at grosser.es > <mailto:tobias at grosser.es> <mailto:tobias at grosser.es > <mailto:tobias at grosser.es>>> > > In any case, you seemed to have in some way convinced Polly to > accept this code. Would you mind to share what you did? > > > Sure. Aliasing is simply ignored. Instead we have substituted > pointers > and sizes for arrays and a special pass that converts memory > accesses > from every scop statement into ISL general form. > > > Interesting, why did you do this? > > Sorry, we are quite far > from standard polly invocation process, maybe I should prepare some > simplified plugin for testing purposes... > > > Or at least state that this is not standard polly. ;-) Then I would > not keep trying to reproduce the problem. > > Regarding your problem. As far as I understand, the problem > is that > the following code: > > for (i > A[i] = 0 > for (j > A[i] +> ... = A[i] > > is changed by gcc (and other optimizations) to: > > for (i > A[i] = 0 > tmp = A[i] > > for (j > tmp +> > A[i] = tmp > ... = A[i] > > > Yes, exactly! > > > This is a common optimization that unfortunately introduces > a lot of > dependences on tmp that block any direct parallelization. To > parallelize the loop anyway we would need to expand the > memory of > tmp, such that each parallel thread has a private copy of > 'tmp'. > Deciding where and how to expand memory to enable further > transformations is in general difficult such that I would > normally > run Polly before such optimizations are performed. Tough, > in your > case it may still be possible to parallelize the loop. To > do this, > you would need to ignore all dependences that can be removed by > creating thread private copies of 'tmp'. If you are > interested in > having this implemented either open a bug report or give it > a try > yourself. I am happy to assist. > > > Hm, how to create thread-private copies of tmp at that point and how > useful could it be? > > > Many auto-vectorizing compilers do such a thing (sometimes cry true. alled > "scalar expansion") as a way to attack otherwise parallel loops. > > There are different approaches how to create thread private copies. > You could aim to get back to the code in '1.ll' by creating an array > with as > many elements as there are iterations in the loop. That should be > very general and independent of the platform specific parallel > implementation. However, such a general implementation may introduce > a lot of additional memory, which requires us to be careful with > such transformations. > > > A better solution is to create an array with as many elements as there > are threads, or, in the case of vectorization, as many elements as the > length of the vector. Much less expensive!Very true. If possible we should go this way. Tobi
Maybe Matching Threads
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Parallelizing outer loop containing inner reduction loop
- [LLVMdev] [Polly] Summary of some expensive compiler passes, especially PollyDependence
- [LLVMdev] [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?