thr3ads.net - llvm dev - [LLVMdev] Vectorization: Next Steps [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Sebastian Pop

2012-Feb-06 20:13 UTC

[LLVMdev] Vectorization: Next Steps

On Sat, Feb 4, 2012 at 2:27 PM, Hal Finkel <hfinkel at anl.gov>
wrote:> On Fri, 2012-02-03 at 20:59 -0800, Preston Briggs wrote:
>> so are building a dependence graph for a complete function.  Of
>> course, such a thing is useful for vectorization and all sorts of
>> other dependence-based loop transforms.
>>
>> I'm looking at the problem in two parts:
>>      1. a dependence test that, given two memory references, decides
>>         if a dependence exists between them, and
Would you consider using Polly http://polly.grosser.es to avoid
writing this code?  If you do not want to use polly, you could use ISL
http://freecode.com/projects/isl to set up the dependence problem and
use ISL's ILP to solve it.

These would be my preferred choices to implement the dependence test:
on the GCC side I have implemented a stand alone dependence tester
that does not use ILP (i.e., sets the dependence test as a Diophantine
equation and then solves it) and I can say that it was quite painful
to write it and to clean the code of bugs.
>>      2. the dependence-graph builder that looks over the complete
>>         function and finds pairs of memory references to pass to the
>>         dependence test, using the results to build a dependence
>>         graph, with edges labeled as to the kind of dependence and
>>         direction vectors.
>> Currently I'm focused on (2), the simpler task I think, as a way to
>> lean my way around infrastructure. The idea, roughly, is to find all
>> the alias sets associated with each load and store and compute what
>> Wolfe calls factored redef-use chains for each alias set (similar to
>> SSA, but supporting computation of all 4 kinds of dependences). By
>> exploring these chains, we can find the right pairs to test,
>> accounting for control flow and the confusion caused by procedure
>> calls, etc.
>>
>
> This sounds very useful.
>
>>
>> I'm not yet sure how I'll proceed for (1).  It's seems
natural and
>> wise to take advantage of the scalar evolutions, but they're new to
me
>> and I'll need to study them more.  I'd like to provide an
interface
>> that's independent of (2), so that it can be used independently,
e.g.,
>> by a software pipeliner.
>
> This makes sense.
>
>> I'd also like to provide a caching interface, so we can avoid
>> re-testing pairs of references with identical subscript forms.  It
>> also seems desirable to start with relatively simple tests, and expand
>> the coverage over time.  We'll see how it goes.
>
> For both (1) and (2), if you'd like to sketch out the interface you
have
> in mind, I'd be happy to provide some feedback.
>
> On a practical note, I think that it is important that the mechanisms
> developed for this support both direct multidimensional indexing
> (a[i][j][k]) and also manually-expanded indexing (a[n*(j+n*i) + k]).
It would be good to have multidimensional arrays in LLVM: note that
"recognizing" the multidimensional access from a linearized access is
not trivial.  This was the original reason why I have started working
on the gimple representation in GCC: the RTL has all the memory
accesses linearized as in gep http://llvm.org/docs/GetElementPtr.html
and so you have lost the type info about the array shape for memory
accesses.
> Recognizing the latter may require some specific idiom-recognition code
> (and scalar evolution analysis should be very useful for this).
Unfortunately the SCEV does not help much in recognizing a multi-dim
access.

Sebastian
--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum

Preston Briggs

2012-Feb-06 21:02 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at codeaurora.org>
wrote:> [many things, but I'm only going to focus on one of them]
> Would you consider using Polly http://polly.grosser.es to avoid
> writing this code?
My impression is that Polly (and polyhedral analysis generally) doesn't do
want I want. But I'm happy to talk about it 'cause I might be wrong.

If I look at all the ways I know how to sort in parallel, they all look
roughly like this simple counting sort:

 for (unsigned i = 0; i < buckets; i++)
   count[i] = 0;

 for (unsigned i = 0; i < n; i++)
   count[src[i]]++;

 start[0] = 0;
 for (unsigned i = 1; i < buckets; i++)
   start[i] = start[i - 1] + count[i - 1];

 #pragma assert parallel
 for (unsigned i = 0; i < n; i++) {
   unsigned loc = int_fetch_add(start + src[i], 1);
   dst[loc] = src[i];
 }

The 1st loop is trivially parallel.  I think Polly would recognize this and
do good things.

The 2nd loop has a race condition that can be handled by using an atomic
increment provided by the architecture, if the compiler knows about such
things. I don't think Polly would help here.

The 3rd loop is a linear recurrence that can be rewritten and solved in
parallel, if the compiler knows about such things. I don't think Polly
would help here.

The 4th loop relies on the programmer to use both an explicit assertion and
an explicit fetch-and-add intrinsic. I think both of these are outside of
Polly's scope.

This is the sort of code I want to handle -- parallel code written by
knowledgable programmers who're counting on the compiler to recognize
parallel loops, recognize and rewrite recurrences and reductions, and
insert synchronization where helpful, using a variety of directives to
clarify their intent.

Preston
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20120206/f35b79a7/attachment.html>

Hal Finkel

2012-Feb-06 21:56 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Mon, 2012-02-06 at 14:13 -0600, Sebastian Pop wrote:> On Sat, Feb 4, 2012 at 2:27 PM, Hal Finkel <hfinkel at anl.gov>
wrote:
> > On Fri, 2012-02-03 at 20:59 -0800, Preston Briggs wrote:
> >> so are building a dependence graph for a complete function.  Of
> >> course, such a thing is useful for vectorization and all sorts of
> >> other dependence-based loop transforms.
> >>
> >> I'm looking at the problem in two parts:
> >>      1. a dependence test that, given two memory references,
decides
> >>         if a dependence exists between them, and
> 
> Would you consider using Polly http://polly.grosser.es to avoid
> writing this code? 
I agree, we should use the (self-contained) passes developed in Polly
where possible.
>  If you do not want to use polly, you could use ISL
> http://freecode.com/projects/isl to set up the dependence problem and
> use ISL's ILP to solve it.
isl is an LGPL project. It is not clear to me what the general consensus
would be on having a core analysis pass carry an LGPL dependency.
> 
> These would be my preferred choices to implement the dependence test:
> on the GCC side I have implemented a stand alone dependence tester
> that does not use ILP (i.e., sets the dependence test as a Diophantine
> equation and then solves it) and I can say that it was quite painful
> to write it and to clean the code of bugs.
> 
> >>      2. the dependence-graph builder that looks over the complete
> >>         function and finds pairs of memory references to pass to
the
> >>         dependence test, using the results to build a dependence
> >>         graph, with edges labeled as to the kind of dependence and
> >>         direction vectors.
> >> Currently I'm focused on (2), the simpler task I think, as a
way to
> >> lean my way around infrastructure. The idea, roughly, is to find
all
> >> the alias sets associated with each load and store and compute
what
> >> Wolfe calls factored redef-use chains for each alias set (similar
to
> >> SSA, but supporting computation of all 4 kinds of dependences). By
> >> exploring these chains, we can find the right pairs to test,
> >> accounting for control flow and the confusion caused by procedure
> >> calls, etc.
> >>
> >
> > This sounds very useful.
> >
> >>
> >> I'm not yet sure how I'll proceed for (1).  It's seems
natural and
> >> wise to take advantage of the scalar evolutions, but they're
new to me
> >> and I'll need to study them more.  I'd like to provide an
interface
> >> that's independent of (2), so that it can be used
independently, e.g.,
> >> by a software pipeliner.
> >
> > This makes sense.
> >
> >> I'd also like to provide a caching interface, so we can avoid
> >> re-testing pairs of references with identical subscript forms.  It
> >> also seems desirable to start with relatively simple tests, and
expand
> >> the coverage over time.  We'll see how it goes.
> >
> > For both (1) and (2), if you'd like to sketch out the interface
you have
> > in mind, I'd be happy to provide some feedback.
> >
> > On a practical note, I think that it is important that the mechanisms
> > developed for this support both direct multidimensional indexing
> > (a[i][j][k]) and also manually-expanded indexing (a[n*(j+n*i) + k]).
> 
> It would be good to have multidimensional arrays in LLVM: note that
> "recognizing" the multidimensional access from a linearized
access is
> not trivial.  This was the original reason why I have started working
> on the gimple representation in GCC: the RTL has all the memory
> accesses linearized as in gep http://llvm.org/docs/GetElementPtr.html
> and so you have lost the type info about the array shape for memory
> accesses.
I agree, to some extent, but an analysis pass would also be quite
useful. There is a lot of real C/C++ code that uses linearized access,
and it will be very important to handle those cases.
> 
> > Recognizing the latter may require some specific idiom-recognition
code
> > (and scalar evolution analysis should be very useful for this).
> 
> Unfortunately the SCEV does not help much in recognizing a multi-dim
> access.
I've not thought this through in detail, but it would seem helpful.
Let's say that I have a loop with two induction variables, i and j, and
I want to see if an expression looks like i*n+j for some constant n (and
I also need to determine n). Then I would take the expression, subtract
j (SE->getMinusSCEV) and divide by i. If the result is a SCEVConstant,
then I have n and I'm done. Otherwise, I can try something else, (or
fail). Maybe, for example, n is not a constant, but is a loop-invariant
scalar, etc.

 -Hal
> 
> Sebastian
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Preston Briggs

2012-Feb-06 22:08 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Mon, Feb 6, 2012 at 1:56 PM, Hal Finkel <hfinkel at anl.gov>
wrote:> I agree, to some extent, but an analysis pass would also be quite
> useful. There is a lot of real C/C++ code that uses linearized access,
> and it will be very important to handle those cases.
Of course.

There's some literature that attacks the problem, e.g.,
http://dl.acm.org/citation.cfm?id=143130
and I expect there's more.

Preston

Chris Lattner

2012-Feb-06 22:28 UTC

head link

[LLVMdev] Vectorization: Next Steps

On Feb 6, 2012, at 1:56 PM, Hal Finkel wrote:>> If you do not want to use polly, you could use ISL
>> http://freecode.com/projects/isl to set up the dependence problem and
>> use ISL's ILP to solve it.
> 
> isl is an LGPL project. It is not clear to me what the general consensus
> would be on having a core analysis pass carry an LGPL dependency.
This is fine for something that wants to live out of tree or be a secondary
subproject, but isn't acceptable for something that wants to go into the
mainline llvm repository and be a default part of the compiler.

-Chris

David A. Greene

2012-Feb-07 00:54 UTC

head link

[LLVMdev] Vectorization: Next Steps

Preston Briggs <preston.briggs at gmail.com> writes:
> On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at
codeaurora.org> wrote:
>> [many things, but I'm only going to focus on one of them]
>> Would you consider using Polly http://polly.grosser.es to avoid
>> writing this code?
>
> My impression is that Polly (and polyhedral analysis generally)
> doesn't do want I want. But I'm happy to talk about it 'cause I
might
> be wrong.
I think you are right.  However, Polly might be useful as one layer in a
dependence tester.  I haven't thought too deeply about it but polyhedral
analysis can do some pretty cool things.

I'm really not clear on what something like CLooG would do with the
linear recurrence example.  But the analysis side of Polly might be
quite useful.

                             -Dave

Tobias Grosser

2012-Feb-07 15:37 UTC

head link

[LLVMdev] Vectorization: Next Steps

On 02/06/2012 10:02 PM, Preston Briggs wrote:> On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at codeaurora.org
> <mailto:spop at codeaurora.org>> wrote:
>> [many things, but I'm only going to focus on one of them] Would
>> you consider using Polly http://polly.grosser.es to avoid writing
>> this code?
>
> My impression is that Polly (and polyhedral analysis generally)
> doesn't do want I want. But I'm happy to talk about it 'cause I
> might be wrong.
That depends what you want. ;-)

Polly may offer good possibilities (and maybe even code) or it may not
help at all. The best is to discuss this stuff.
> If I look at all the ways I know how to sort in parallel, they all
> look roughly like this simple counting sort:
>
> for (unsigned i = 0; i < buckets; i++)
>   count[i] = 0;
>
> for (unsigned i = 0; i < n; i++)
 >   count[src[i]]++;>
> start[0] = 0; > for (unsigned i = 1; i < buckets; i++)
 >   start[i] = start[i - 1] + count[i - 1];>
> #pragma assert parallel > for (unsigned i = 0; i < n; i++) {
 >   unsigned loc = int_fetch_add(start + src[i], 1);

Should this be:

unsigned loc = int_fetch_add(start[src[i]], 1);

 >   dst[loc] = src[i];
 > }>
>
> The 1st loop is trivially parallel.  I think Polly would recognize
> this and do good things.
This case is trivial.

But keep in mind that unsigned loop ivs and integers can cause modulo
wrapping which is not trivial to handle - Both Polly, but also any other
tool, need to take into account additional (possibly non-uniform)
dependences that may appear in addition to the dependences for 
non-wrapping integers. (In LLVM even signed integers can have modulo 
wrapping, so actually we need to always be careful here).

isl can actually represent modulo constraints pretty well.
> The 2nd loop has a race condition that can be handled by using an
> atomic increment provided by the architecture, if the compiler knows
> about such things. I don't think Polly would help here.
> for (unsigned i = 0; i < n; i++) >   count[src[i]]++;

I do not see why it should be difficult to teach polyhedral tools to
model an increment and to understand that it can be done atomically.
Though, if you just want to do this specific transformation, Polly is
probably overhead.

The big selling point of polyhedral transformations is uniform handling
of several transformations. In case you want additional loop
transformations to expose such patterns Polly might be helpful.

Another reason may be GPU code generation. Here you often use techniques 
like tiling to create the thread groups. Within the polyhedral model 
this is a very simple transformation. On LLVM-IR that may become complex.
> The 3rd loop is a linear recurrence that can be rewritten and solved
> in parallel, if the compiler knows about such things. I don't think
> Polly would help here.
Polly has not perform this transformation yet. But it may still be 
helpful as it abstracts away the low level details of LLVM-IR and allows 
you to focus on the higher level problem.
> start[0] = 0; > for (unsigned i = 1; i < buckets; i++)
 >   start[i] = start[i - 1] + count[i - 1];

The previous code will be translated to:

Statements: Init[]; Body[i] : 1 <= i < buckets

Writes: Init[] -> start[0]; Body[i] -> start[i];
Reades: Body[i] -> start[i-1]; Body[i] -> count[i-1];

When we additionally extract the information about the operation 
performed by body

The idea is to go away from low level pattern matching, but to use a
high level approach to implement the transformations proposed by you.
For code that does not fit into the polyhedral model, this is obviously
not possible, but the examples you have given so far can be represented
easily.
> The 4th loop relies on the programmer to use both an explicit
> assertion and an explicit fetch-and-add intrinsic. I think both of
> these are outside of Polly's scope.
We do not yet use this within Polly, but I think asking the programmer
to provide additional knowledge (when available) is more than reasonable.
> #pragma assert parallel > for (unsigned i = 0; i < n; i++) {
 >   unsigned loc = int_fetch_add(start[src[i]], 1);>   dst[loc] = src[i];
> }
As the int_fetch_add is side effect free, it is fully
polyhedral. It can be implemented with the relevant LLVM
atomicrmw instruction [1]. Polly does not yet allow atomicrmw 
instructions, but adding support for them should not be too difficult.

If polly get's the additional #pragma information, it can remove all 
dependences and it can execute the code in parallel. This is also not 
implemented, but seems doable.
> This is the sort of code I want to handle -- parallel code written
> by knowledgable programmers who're counting on the compiler to
> recognize parallel loops, recognize and rewrite recurrences and
> reductions, and insert synchronization where helpful, using a
> variety of directives to clarify their intent.
Sounds very much like what we want to handle with Polly. This does
not mean we handle all of this already, but we already have some high 
level optimizations based on the Pluto algorithm [2] and more 
importantly, we already are able to abstract away a lot of low level 
details. So I definitely expect significant overlap. Let's see where we 
can reuse each others work.

(I am e.g. very interested in array delinearization or in how you pass 
loop pragma-annotations to LLVM-IR)  :-)

Cheers
Tobi

[1] http://llvm.org/docs/LangRef.html#i_atomicrmw
[2] http://pluto-compiler.sourceforge.net/

David A. Greene

2012-Feb-07 16:12 UTC

head link

[LLVMdev] Vectorization: Next Steps

Preston Briggs <preston.briggs at gmail.com> writes:
> On Mon, Feb 6, 2012 at 12:13 PM, Sebastian Pop <spop at
codeaurora.org> wrote:
>> [many things, but I'm only going to focus on one of them]
>> Would you consider using Polly http://polly.grosser.es to avoid
>> writing this code?
>
> My impression is that Polly (and polyhedral analysis generally)
> doesn't do want I want. But I'm happy to talk about it 'cause I
might
> be wrong.
I think you are right.  However, Polly might be useful as one layer in a
dependence tester.  I haven't thought too deeply about it but polyhedral
analysis can do some pretty cool things.

I'm really not clear on what something like CLooG would do with the
linear recurrence example.  But the analysis side of Polly might be
quite useful.

                             -Dave

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Feb 2012 - [LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

[LLVMdev] Vectorization: Next Steps

Apparently Analagous Threads