thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] OpenCL support [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne

2010-Dec-06 22:55 UTC

[LLVMdev] [cfe-dev] OpenCL support

Hi David,

On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto
wrote:> >> It
> >> seems it would be a good idea to transform the code so that uses
of x
> >> become loads and stores from memory, and the address for that
memory
> >> is returned by a builtin function that itself is dependent on work
> >> group ids.
> >>
> >> I'm just learning Clang now, so I'm not prepared to say
how that would
> >> be done.  Is it okay to transform the AST before semantic
analysis?
> >> Where should I start looking?  (I would guess lib/Sema...)
> >
> > This transformation may be useful for a CPU based OpenCL
> > implementation, but would not be appropriate in Sema for a few
> > reasons.  The first is that the AST should at all times be an accurate
> > representation of the input source code.
> >
> > The second is that such a transformation would be specific to the
> > OpenCL implementation -- not only would it be inappropriate for
> > GPUs but there are a number of feasible CPU based implementation
> > techniques which we shouldn't have to teach Sema or in fact any
part
> > of Clang about.
> >
> > The best place to do this transformation would be at the LLVM level
> > with an implementation specific transformation pass.
> 
> Ok.  Now I'm even more convinced that your patch [1] is incorrect
because:
> (a) it's specific to GPU-style implementations of OpenCL, not the
> generic semantics of OpenCL.
> (b) it pushes target-specific assumptions into Sema.  But you've just
> argued that the AST should reflect the original source code as much as
> possible.
Yes, that's why I don't like the patch so much :)  It was really
designed to work with the current infrastructure, which isn't
very well suited to more exotic languages like OpenCL.
> On (a):   I understand that ARM is preparing to contribute a more
> complete OpenCL front-end to Clang.   It would be great to nail down a
> common front end with generic OpenCL semantics, and let later stages
> (Clang's CodeGen? LLVM IR pass?) handle more target-specific
> assumptions.  E.g. it would be nice to standardize on how Clang
> handles OpenCL's local, global, etc. etc. etc.  E.g. just agreeing on
> address space numbering would be a step forward.  (e.g. global is 1,
> local is 2...)
+llvmdev, as this is also a LLVM-relevant issue.

I agree.  We should set a standard for address spaces in LLVM - a low
range for 'standard' address spaces (with a defined semantics for each
value in that range) and a high range for target-specific spaces.
It looks like address spaces are already being used this way to a
certain extent in the targets (X86 uses 256 -> GS, 257 -> FS).  And
I think 256 'standard' address spaces should be enough, but I'm
happy
to be proven wrong :)
> What do I think your patch should look like?  It's true that the
> diag::err_as_qualified_auto_decl is inappropriate for OpenCL when it's
> the __local addres space.
> 
> But we need to implement the semantics somehow.  Conceptually I think
> of it as a CL source-to-source transformation that lowers
> function-scope-local-address-space variables into a more primitive
> form.
> 
> I think I disagree that the Clang is an inappropriate spot for
> implementing this type of transform: Clang "knows" the source
language
> semantics, and has a lot of machinery required for the transform.
> Also, Clang also knows a lot about the target machine (e.g. type
> sizes, builtins, more?).
> 
> So I believe the "auto var in different address space" case
should be
> allowed in the AST in the OpenCL case, and the local-lowering
> transform should be applied in CodeGen.  Perhaps the lowering is
> target-specific, e.g. GPU-style, or more generic style as I proposed.
> 
> Thoughts?
I've been rethinking this and perhaps coming around to this way
of thinking.  Allocating variables in the __local address space
is really something that can't be represented at the LLVM level,
at least in a standard form.

But to a certain extent both auto and static storage-classes are wrong
here.  Auto implies that each invocation of the function gets its own
variable, while static implies that all invocations share a variable.

Perhaps the right thing to do here is to introduce a new storage-class
for __local variables (let's call it the 'wg-local' storage-class).
A variable cannot be made wg-local with a storage-class specifier but
function-scope-local-address-space variables would be made so in a
similar way to my original patch.  The target would then be required
to define at CodeGen the semantics of declaring wg-local variables and
loading and storing from local address space in the way you propose.

A side effect of this is that we will require a mapping of
target-unsupported address spaces to supported address spaces in
CodeGen.  For example, a CPU based implementation should map the
local address space to 0.

Thanks,
-- 
Peter

Villmow, Micah

2010-Dec-06 23:16 UTC

head link

[LLVMdev] [cfe-dev] OpenCL support

> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
> On Behalf Of Peter Collingbourne
> Sent: Monday, December 06, 2010 2:56 PM
> To: David Neto
> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
> 
> Hi David,
> 
> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
> > >> It
> > >> seems it would be a good idea to transform the code so that
uses
> of x
> > >> become loads and stores from memory, and the address for that
> memory
> > >> is returned by a builtin function that itself is dependent on
work
> > >> group ids.
> > >>
> > >> I'm just learning Clang now, so I'm not prepared to
say how that
> would
> > >> be done.  Is it okay to transform the AST before semantic
> analysis?
> > >> Where should I start looking?  (I would guess lib/Sema...)
> > >
> > > This transformation may be useful for a CPU based OpenCL
> > > implementation, but would not be appropriate in Sema for a few
> > > reasons.  The first is that the AST should at all times be an
> accurate
> > > representation of the input source code.
> > >
> > > The second is that such a transformation would be specific to the
> > > OpenCL implementation -- not only would it be inappropriate for
> > > GPUs but there are a number of feasible CPU based implementation
> > > techniques which we shouldn't have to teach Sema or in fact
any
> part
> > > of Clang about.
> > >
> > > The best place to do this transformation would be at the LLVM
level
> > > with an implementation specific transformation pass.
> >
> > Ok.  Now I'm even more convinced that your patch [1] is incorrect
> because:
> > (a) it's specific to GPU-style implementations of OpenCL, not the
> > generic semantics of OpenCL.
> > (b) it pushes target-specific assumptions into Sema.  But you've
just
> > argued that the AST should reflect the original source code as much
> as
> > possible.
> 
> Yes, that's why I don't like the patch so much :)  It was really
> designed to work with the current infrastructure, which isn't
> very well suited to more exotic languages like OpenCL.
> 
> > On (a):   I understand that ARM is preparing to contribute a more
> > complete OpenCL front-end to Clang.   It would be great to nail down
> a
> > common front end with generic OpenCL semantics, and let later stages
> > (Clang's CodeGen? LLVM IR pass?) handle more target-specific
> > assumptions.  E.g. it would be nice to standardize on how Clang
> > handles OpenCL's local, global, etc. etc. etc.  E.g. just agreeing
on
> > address space numbering would be a step forward.  (e.g. global is 1,
> > local is 2...)
> 
> +llvmdev, as this is also a LLVM-relevant issue.
> 
> I agree.  We should set a standard for address spaces in LLVM - a low
> range for 'standard' address spaces (with a defined semantics for
each
> value in that range) and a high range for target-specific spaces.
> It looks like address spaces are already being used this way to a
> certain extent in the targets (X86 uses 256 -> GS, 257 -> FS).  And
> I think 256 'standard' address spaces should be enough, but I'm
happy
> to be proven wrong :)
> [Villmow, Micah] This would be very beneficial to define these. The only main
issue is the
default address space. In OpenCL it is private, in LLVM, it is closer to global
than private.

> > What do I think your patch should look like?  It's true that the
> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
> it's
> > the __local addres space.
> >
> > But we need to implement the semantics somehow.  Conceptually I think
> > of it as a CL source-to-source transformation that lowers
> > function-scope-local-address-space variables into a more primitive
> > form.
> >
> > I think I disagree that the Clang is an inappropriate spot for
> > implementing this type of transform: Clang "knows" the
source
> language
> > semantics, and has a lot of machinery required for the transform.
> > Also, Clang also knows a lot about the target machine (e.g. type
> > sizes, builtins, more?).
> >
> > So I believe the "auto var in different address space" case
should be
> > allowed in the AST in the OpenCL case, and the local-lowering
> > transform should be applied in CodeGen.  Perhaps the lowering is
> > target-specific, e.g. GPU-style, or more generic style as I proposed.
> >
> > Thoughts?
> 
> I've been rethinking this and perhaps coming around to this way
> of thinking.  Allocating variables in the __local address space
> is really something that can't be represented at the LLVM level,
> at least in a standard form.[Villmow, Micah] We ran across this problem in our OpenCL implementation.
However, you can create a global variable with an '__local' address
space and it works fine. There is an issue with collision between auto-arrays in
different kernels, but that can be solved with a little name mangling. There are
other ways to do this, for example, by converting local auto-arrays into kernel
local pointer arguments with a known size.

> 
> But to a certain extent both auto and static storage-classes are wrong
> here.  Auto implies that each invocation of the function gets its own
> variable, while static implies that all invocations share a variable.
> 
> Perhaps the right thing to do here is to introduce a new storage-class
> for __local variables (let's call it the 'wg-local'
storage-class).
> A variable cannot be made wg-local with a storage-class specifier but
> function-scope-local-address-space variables would be made so in a
> similar way to my original patch.  The target would then be required
> to define at CodeGen the semantics of declaring wg-local variables and
> loading and storing from local address space in the way you propose.[Villmow, Micah] I'd move away from adding a new storage class as using the
address space alone is sufficient to handle OpenCL's __local address space.

> 
> A side effect of this is that we will require a mapping of
> target-unsupported address spaces to supported address spaces in
> CodeGen.  For example, a CPU based implementation should map the
> local address space to 0.
> 
> Thanks,
> --
> Peter
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

David Neto

2010-Dec-07 21:02 UTC

head link

[LLVMdev] [cfe-dev] OpenCL support

On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at amd.com>
wrote:>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
>> On Behalf Of Peter Collingbourne
>> Sent: Monday, December 06, 2010 2:56 PM
>> To: David Neto
>> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
>> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
>>
>> Hi David,
>>
>> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
>> > What do I think your patch should look like?  It's true that
the
>> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
>> it's
>> > the __local addres space.
>> >
>> > But we need to implement the semantics somehow.  Conceptually I
think
>> > of it as a CL source-to-source transformation that lowers
>> > function-scope-local-address-space variables into a more primitive
>> > form.
>> >
>> > I think I disagree that the Clang is an inappropriate spot for
>> > implementing this type of transform: Clang "knows" the
source
>> language
>> > semantics, and has a lot of machinery required for the transform.
>> > Also, Clang also knows a lot about the target machine (e.g. type
>> > sizes, builtins, more?).
>> >
>> > So I believe the "auto var in different address space"
case should be
>> > allowed in the AST in the OpenCL case, and the local-lowering
>> > transform should be applied in CodeGen.  Perhaps the lowering is
>> > target-specific, e.g. GPU-style, or more generic style as I
proposed.
>> >
>> > Thoughts?
>>
>> I've been rethinking this and perhaps coming around to this way
>> of thinking.  Allocating variables in the __local address space
>> is really something that can't be represented at the LLVM level,
>> at least in a standard form.
> [Villmow, Micah] We ran across this problem in our OpenCL implementation.
However, you can create a global variable with an '__local' address
space and it works fine. There is an issue with collision between auto-arrays in
different kernels, but that can be solved with a little name mangling. There are
other ways to do this, for example, by converting local auto-arrays into kernel
local pointer arguments with a known size.
Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation.  I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.

The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group.  So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct.  The pointer is returned by a system-defined intrinsic
function dependent on the current work item.  (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)

So a kernel function like this:

void foo(__global int*A) {
   __local int vint;
   __local int *vpint;
   __local int const *vcpint;
   __local int volatile vvint;
   int a = A[0];
   vint = a;
   vvint = a;
   int a2 = vint;
   int va2 = vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):

extern __local void * __get_work_group_local_base_addr(void); // intrinsic
void foo(__global int*A) {
   __local struct __local_vars_s {
      int vint;
      int *vpint;
      int const *vcpint;
      int volatile vvint;
   } * const __local_vars
            // this is a *private* variable, pointing to *local* addresses.
            // it's a const pointer because it shouldn't change; and
being const may expose optimizations
       = __get_work_group_local_base_addr();  // the new intrinsic
   int a = A[0];
   __local_vars->vint = a;   // l-values are translated as memory stores.
   __local_vars->vvint = a;
   int a2 = __local_vars->vint;   // r-values are translated as memory loads
   int va2 = __local_vars->vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases.  For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work.  Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.

I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables.  At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case.  It would be nice to be able to represent any of the sane
possibilities.

@Anton:  Regarding ARM's open-sourcing:  I'm glad to see the
reaffirmation, and I look forward to the contribution.  Yes, I
understand the virtues of patience.  :-)
I assume you plan to commit a document describing how OpenCL is
supported.  (e.g. how details like the above are handled.)

thanks,
david

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Dec 2010 - [LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

Maybe Matching Threads