On 02/06/2012 10:02 PM, Preston Briggs wrote:> On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at codeaurora.org > <mailto:spop at codeaurora.org>> wrote: >> [many things, but I'm only going to focus on one of them] Would >> you consider using Polly http://polly.grosser.es to avoid writing >> this code? > > My impression is that Polly (and polyhedral analysis generally) > doesn't do want I want. But I'm happy to talk about it 'cause I > might be wrong.That depends what you want. ;-) Polly may offer good possibilities (and maybe even code) or it may not help at all. The best is to discuss this stuff.> If I look at all the ways I know how to sort in parallel, they all > look roughly like this simple counting sort: > > for (unsigned i = 0; i < buckets; i++) > count[i] = 0; > > for (unsigned i = 0; i < n; i++)> count[src[i]]++;> > start[0] = 0;> for (unsigned i = 1; i < buckets; i++) > start[i] = start[i - 1] + count[i - 1];> > #pragma assert parallel> for (unsigned i = 0; i < n; i++) { > unsigned loc = int_fetch_add(start + src[i], 1); Should this be: unsigned loc = int_fetch_add(start[src[i]], 1); > dst[loc] = src[i]; > }> > > The 1st loop is trivially parallel. I think Polly would recognize > this and do good things.This case is trivial. But keep in mind that unsigned loop ivs and integers can cause modulo wrapping which is not trivial to handle - Both Polly, but also any other tool, need to take into account additional (possibly non-uniform) dependences that may appear in addition to the dependences for non-wrapping integers. (In LLVM even signed integers can have modulo wrapping, so actually we need to always be careful here). isl can actually represent modulo constraints pretty well.> The 2nd loop has a race condition that can be handled by using an > atomic increment provided by the architecture, if the compiler knows > about such things. I don't think Polly would help here.> for (unsigned i = 0; i < n; i++)> count[src[i]]++; I do not see why it should be difficult to teach polyhedral tools to model an increment and to understand that it can be done atomically. Though, if you just want to do this specific transformation, Polly is probably overhead. The big selling point of polyhedral transformations is uniform handling of several transformations. In case you want additional loop transformations to expose such patterns Polly might be helpful. Another reason may be GPU code generation. Here you often use techniques like tiling to create the thread groups. Within the polyhedral model this is a very simple transformation. On LLVM-IR that may become complex.> The 3rd loop is a linear recurrence that can be rewritten and solved > in parallel, if the compiler knows about such things. I don't think > Polly would help here.Polly has not perform this transformation yet. But it may still be helpful as it abstracts away the low level details of LLVM-IR and allows you to focus on the higher level problem.> start[0] = 0;> for (unsigned i = 1; i < buckets; i++) > start[i] = start[i - 1] + count[i - 1]; The previous code will be translated to: Statements: Init[]; Body[i] : 1 <= i < buckets Writes: Init[] -> start[0]; Body[i] -> start[i]; Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1]; When we additionally extract the information about the operation performed by body The idea is to go away from low level pattern matching, but to use a high level approach to implement the transformations proposed by you. For code that does not fit into the polyhedral model, this is obviously not possible, but the examples you have given so far can be represented easily.> The 4th loop relies on the programmer to use both an explicit > assertion and an explicit fetch-and-add intrinsic. I think both of > these are outside of Polly's scope.We do not yet use this within Polly, but I think asking the programmer to provide additional knowledge (when available) is more than reasonable.> #pragma assert parallel> for (unsigned i = 0; i < n; i++) { > unsigned loc = int_fetch_add(start[src[i]], 1);> dst[loc] = src[i]; > }As the int_fetch_add is side effect free, it is fully polyhedral. It can be implemented with the relevant LLVM atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions, but adding support for them should not be too difficult. If polly get's the additional #pragma information, it can remove all dependences and it can execute the code in parallel. This is also not implemented, but seems doable.> This is the sort of code I want to handle -- parallel code written > by knowledgable programmers who're counting on the compiler to > recognize parallel loops, recognize and rewrite recurrences and > reductions, and insert synchronization where helpful, using a > variety of directives to clarify their intent.Sounds very much like what we want to handle with Polly. This does not mean we handle all of this already, but we already have some high level optimizations based on the Pluto algorithm [2] and more importantly, we already are able to abstract away a lot of low level details. So I definitely expect significant overlap. Let's see where we can reuse each others work. (I am e.g. very interested in array delinearization or in how you pass loop pragma-annotations to LLVM-IR) :-) Cheers Tobi [1] http://llvm.org/docs/LangRef.html#i_atomicrmw [2] http://pluto-compiler.sourceforge.net/
>> for (unsigned i = 0; i < buckets; i++) >> count[i] = 0; >> >> for (unsigned i = 0; i < n; i++) >> count[src[i]]++; >> >> start[0] = 0; >> for (unsigned i = 1; i < buckets; i++) >> start[i] = start[i - 1] + count[i - 1]; >> >> #pragma assert parallel >> for (unsigned i = 0; i < n; i++) { >> unsigned loc = int_fetch_add(start + src[i], 1); >> dst[loc] = src[i]; >> }> Should this be: > > unsigned loc = int_fetch_add(start[src[i]], 1);Our intrinsic wants a pointer, so either int_fetch_add(start + src[i], 1) or int_fetch_add(&start[src[i]], 1) wil work.>> The 1st loop is trivially parallel. I think Polly would recognize >> this and do good things. > > This case is trivial. > > But keep in mind that unsigned loop ivs and integers can cause modulo > wrapping which is not trivial to handle - Both Polly, but also any other > tool, need to take into account additional (possibly non-uniform) > dependences that may appear in addition to the dependences for non-wrapping > integers. (In LLVM even signed integers can have modulo wrapping, so > actually we need to always be careful here).Sure. In this case, it would affect the code that determines the number of iterations. If the loop terminates, there's no dependence.>> The 2nd loop has a race condition that can be handled by using an >> atomic increment provided by the architecture, if the compiler knows >> about such things. I don't think Polly would help here. > >> for (unsigned i = 0; i < n; i++) >> count[src[i]]++; > > I do not see why it should be difficult to teach polyhedral tools to > model an increment and to understand that it can be done atomically.When I read about polyhedral frame works, the first thing I see is the restriction to affine subscripts. For loops 2 and 4, this seems to be the stumbling block.> The big selling point of polyhedral transformations is uniform handling > of several transformations.Right. In olden times, people use a bunch of independent loop xforms, composed in different ways to achieve different effects. Then came the unimodular xforms which could achieve the effect of combining several xforms (but not all). The came the polyhedral approach, which could do quite a bit more, including the actual dependence analysis, as long as your problem fit in the framework. I think the sorting examples don't fit within the framework. Indeed, I think it'd be tough to express a sort (something easy, like a vector of unsigned values) in a way that Polly would parallelize all the loops.> Another reason may be GPU code generation. Here you often use techniques > like tiling to create the thread groups. Within the polyhedral model this is > a very simple transformation. On LLVM-IR that may become complex.Yes, of course Polly is good for many things, but I use reductions and recurrences every day; I must have a good way to handle them.>> The 3rd loop is a linear recurrence that can be rewritten and solved >> in parallel, if the compiler knows about such things. I don't think >> Polly would help here. > > Polly has not perform this transformation yet. But it may still be helpful > as it abstracts away the low level details of LLVM-IR and allows you to > focus on the higher level problem.That's the way I think of the dependence graph. It gets me above the IR and lets me study the flow of dependences.>> start[0] = 0; >> for (unsigned i = 1; i < buckets; i++) >> start[i] = start[i - 1] + count[i - 1]; > > The previous code will be translated to: > > Statements: Init[]; Body[i] : 1 <= i < buckets > > Writes: Init[] -> start[0]; Body[i] -> start[i]; > Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1]; > > When we additionally extract the information about the operation performed > by body > > The idea is to go away from low level pattern matching, but to use a > high level approach to implement the transformations proposed by you. > For code that does not fit into the polyhedral model, this is obviously > not possible, but the examples you have given so far can be represented > easily. > >> The 4th loop relies on the programmer to use both an explicit >> assertion and an explicit fetch-and-add intrinsic. I think both of >> these are outside of Polly's scope. > > We do not yet use this within Polly, but I think asking the programmer > to provide additional knowledge (when available) is more than reasonable.Yep. My doubts arose because nobody ever mentions any way to take directives into account when analyzing loops with the polyhedral approach.>> #pragma assert parallel >> for (unsigned i = 0; i < n; i++) { >> unsigned loc = int_fetch_add(&start[src[i]], 1); >> dst[loc] = src[i]; >> } > > As the int_fetch_add is side effect free, it is fully > polyhedral. It can be implemented with the relevant LLVM > atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions, > but adding support for them should not be too difficult.The non-affine subscript is a problem, isn't it? And do we really consider the int_fetch_add side-effect free? After all, memory is modified.> If polly get's the additional #pragma information, it can remove all > dependences and it can execute the code in parallel. This is also not > implemented, but seems doable.But is it? Nobody seems to write about it. Makes me wonder why not.>> This is the sort of code I want to handle -- parallel code written >> by knowledgable programmers who're counting on the compiler to >> recognize parallel loops, recognize and rewrite recurrences and >> reductions, and insert synchronization where helpful, using a >> variety of directives to clarify their intent. > > Sounds very much like what we want to handle with Polly. This does > not mean we handle all of this already, but we already have some high level > optimizations based on the Pluto algorithm [2] and more importantly, we > already are able to abstract away a lot of low level details. So I > definitely expect significant overlap. Let's see where we can reuse each > others work.Sounds good.> (I am e.g. very interested in array delinearization or in how you pass loop > pragma-annotations to LLVM-IR)I'm a long ways from doing these things, so don't have much to say yet. It's just the direction I'm headed. Preston
On 02/07/2012 07:04 PM, Preston Briggs wrote:>>> The 1st loop is trivially parallel. I think Polly would recognize >>> this and do good things. >> >> This case is trivial. >> >> But keep in mind that unsigned loop ivs and integers can cause modulo >> wrapping which is not trivial to handle - Both Polly, but also any other >> tool, need to take into account additional (possibly% non-uniform) >> dependences that may appear in addition to the dependences for non-wrapping >> integers. (In LLVM even signed integers can have modulo wrapping, so >> actually we need to always be careful here). > > Sure. In this case, it would affect the code that determines the > number of iterations. > If the loop terminates, there's no dependence.This becomes interesting if you have an upper bound that is a + b. Keep in mind that you need to calculate this module an some integer value that is derived from the type size. This becomes tricky when fusing loops.>>> The 2nd loop has a race condition that can be handled by using an >>> atomic increment provided by the architecture, if the compiler knows >>> about such things. I don't think Polly would help here. >> >>> for (unsigned i = 0; i< n; i++) >>> count[src[i]]++; >> >> I do not see why it should be difficult to teach polyhedral tools to >> model an increment and to understand that it can be done atomically. > > When I read about polyhedral frame works, the first thing I see is > the restriction to affine subscripts. For loops 2 and 4, this seems to be > the stumbling block.You need affine subscripts to get exact dependence information (meaning to know statically which statement instance has written the value read by another statement instance). As this information is not available at compile time, neither a polyhedral framework nor any other framework can derive it statically. In such a case you can over approximate. For the example above this would be a read of count[p], where p is an unknown value. In case you have additional information you may get away with less over approximation. E.g. you could take into account that p is even or that you know src[i] is a bijective function. Just because you can represent affine subscripts very detailed, does not mean non affine subscripts cannot be worked with at all. They just provide less information.>> The big selling point of polyhedral transformations is uniform handling >> of several transformations. > > Right. In olden times, people use a bunch of independent loop xforms, > composed in different ways to achieve different effects. Then came the > unimodular xforms which could achieve the effect of combining several > xforms (but not all). The came the polyhedral approach, which could do > quite a bit more, including the actual dependence analysis, as long as > your problem fit in the framework.>> I think the sorting examples don't fit within the framework.Actually they are pretty close. At least from the ones that I have seen.> Indeed, I think it'd be tough to express a sort (something easy, like a > vector of unsigned values) in a way that Polly would parallelize all the loops.Knowing if Polly would parallelize some code is hard to judge without seeing the code, but as Polly just recently has grown the first complex transformations I would doubt it. The more interesting question is how much of the stuff in Polly can be reused, even though Polly does not do this transformation (and may never do it). Maybe you can use the RegionInfo infrastructure from LLVM which was developed for Polly, some of the canonicalization passes, the dependency analysis or you can steal ideas from the OpenMP code generation. As there is a lot of overlap, we should make sure to share code. I just realized that I would love to use an independent, generic OpenMP code generation in Polly. Let me know if you start work in this area>> Another reason may be GPU code generation. Here you often use techniques >> like tiling to create the thread groups. Within the polyhedral model this is >> a very simple transformation. On LLVM-IR that may become complex. > > Yes, of course Polly is good for many things, but I use reductions and > recurrences every day; I must have a good way to handle them.So how does such a representation look like? From my point of view reductions are not very special. Basically a sequence of statement iterations that work on the same data element. If the operation applied is associative some nice optimizations can be performed and some dependences can be removed. I need to read the paper more thoroughly to get the exact definition and constraints of (bounded) recurrences, but from looking at your code I believe it can be detected and matched with Polly. The transformation you apply do not seem to be a pure scheduling transformation, but rather introduces new statements and may even create non affine loops. This is more complex to handle and maybe not even possible or clever to do in a polyhedral framework. I will try to have a close look into the paper to understand what transformations you are doing.>>> The 3rd loop is a linear recurrence that can be rewritten and solved >>> in parallel, if the compiler knows about such things. I don't think >>> Polly would help here. >> >> Polly has not perform this transformation yet. But it may still be helpful >> as it abstracts away the low level details of LLVM-IR and allows you to >> focus on the higher level problem. > > That's the way I think of the dependence graph. It gets me above the IR > and lets me study the flow of dependences.OK. So let me know a little bit more about your dependence graph. What is a statement in your graph? An LLVM-IR instruction? Or do you group them somehow?>>> start[0] = 0; >>> for (unsigned i = 1; i< buckets; i++) >>> start[i] = start[i - 1] + count[i - 1]; >> >> The previous code will be translated to: >> >> Statements: Init[]; Body[i] : 1<= i< buckets >> >> Writes: Init[] -> start[0]; Body[i] -> start[i]; >> Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1]; >> >> When we additionally extract the information about the operation performed >> by body >> >> The idea is to go away from low level pattern matching, but to use a >> high level approach to implement the transformations proposed by you. >> For code that does not fit into the polyhedral model, this is obviously >> not possible, but the examples you have given so far can be represented >> easily. >> >>> The 4th loop relies on the programmer to use both an explicit >>> assertion and an explicit fetch-and-add intrinsic. I think both of >>> these are outside of Polly's scope. >> >> We do not yet use this within Polly, but I think asking the programmer >> to provide additional knowledge (when available) is more than reasonable. > > Yep. My doubts arose because nobody ever mentions any way to take > directives into account when analyzing loops with the polyhedral approach.It is not directly obvious how you would pass this information to LLVM-IR. But for source to source tools people commonly use annotations to provide information about live-in, live-out values, the sizes of arrays, ... It makes perfect sense to state the absence of dependences.>>> #pragma assert parallel >>> for (unsigned i = 0; i< n; i++) { >>> unsigned loc = int_fetch_add(&start[src[i]], 1); >>> dst[loc] = src[i]; >>> } >> >> As the int_fetch_add is side effect free, it is fully >> polyhedral. It can be implemented with the relevant LLVM >> atomicrmw instruction [1]. Polly does not yet allow atomicrmw instructions, >> but adding support for them should not be too difficult. > > The non-affine subscript is a problem, isn't it?See above. It just limits how detailed we can analyze the problem. As the user stated that there are no dependences in the loop, there is no need for exact dependences. The loop can be parallelized even with a conservative over approximation of the dependences.> And do we really consider the int_fetch_add side-effect free? > After all, memory is modified.If you implement it with the atomicrmw instruction only the memory that is pointed to is modified. side-effect free does not mean nothing is touched at all, but that we know about the side effects.>> If polly get's the additional #pragma information, it can remove all >> dependences and it can execute the code in parallel. This is also not >> implemented, but seems doable. > > But is it? Nobody seems to write about it. Makes me wonder why not.The problem is not removing the dependences. This is simple. In a source to source tool, we could do it easily. Passing such information from a C loop to a LLVM-IR loop is more difficult (With or without Polly). There is no explicit LLVM-IR so you may need to add meta-data to the loop header or the loop induction variable. That can work, but is a little difficult to maintain through transformations.>>> This is the sort of code I want to handle -- parallel code written >>> by knowledgable programmers who're counting on the compiler to >>> recognize parallel loops, recognize and rewrite recurrences and >>> reductions, and insert synchronization where helpful, using a >>> variety of directives to clarify their intent. >> >> Sounds very much like what we want to handle with Polly. This does >> not mean we handle all of this already, but we already have some high level >> optimizations based on the Pluto algorithm [2] and more importantly, we >> already are able to abstract away a lot of low level details. So I >> definitely expect significant overlap. Let's see where we can reuse each >> others work. > > Sounds good. > >> (I am e.g. very interested in array delinearization or in how you pass loop >> pragma-annotations to LLVM-IR) > > I'm a long ways from doing these things, > so don't have much to say yet. > It's just the direction I'm headed.Just keep me up to date. Tobi