thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] OpenCL support [Dec 2010]

If this information is useful, please help other people find it:
Share via:

David Neto

2010-Dec-07 21:02 UTC

[LLVMdev] [cfe-dev] OpenCL support

On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at amd.com>
wrote:>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
>> On Behalf Of Peter Collingbourne
>> Sent: Monday, December 06, 2010 2:56 PM
>> To: David Neto
>> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
>> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
>>
>> Hi David,
>>
>> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
>> > What do I think your patch should look like?  It's true that
the
>> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL when
>> it's
>> > the __local addres space.
>> >
>> > But we need to implement the semantics somehow.  Conceptually I
think
>> > of it as a CL source-to-source transformation that lowers
>> > function-scope-local-address-space variables into a more primitive
>> > form.
>> >
>> > I think I disagree that the Clang is an inappropriate spot for
>> > implementing this type of transform: Clang "knows" the
source
>> language
>> > semantics, and has a lot of machinery required for the transform.
>> > Also, Clang also knows a lot about the target machine (e.g. type
>> > sizes, builtins, more?).
>> >
>> > So I believe the "auto var in different address space"
case should be
>> > allowed in the AST in the OpenCL case, and the local-lowering
>> > transform should be applied in CodeGen.  Perhaps the lowering is
>> > target-specific, e.g. GPU-style, or more generic style as I
proposed.
>> >
>> > Thoughts?
>>
>> I've been rethinking this and perhaps coming around to this way
>> of thinking.  Allocating variables in the __local address space
>> is really something that can't be represented at the LLVM level,
>> at least in a standard form.
> [Villmow, Micah] We ran across this problem in our OpenCL implementation.
However, you can create a global variable with an '__local' address
space and it works fine. There is an issue with collision between auto-arrays in
different kernels, but that can be solved with a little name mangling. There are
other ways to do this, for example, by converting local auto-arrays into kernel
local pointer arguments with a known size.
Here's a little example to show the direction I was heading, with an
illustration as a CL-to-C translation.  I believe there are no
namespace issues, but otherwise is essentially the same as the global
variable solution.

The idea is that the func scope local addr variables are like a stack
frame that is shared between the different work items in a group.  So
collect all those variables in an anonymous struct, and then create a
function scope private variable to point to the one copy of that
struct.  The pointer is returned by a system-defined intrinsic
function dependent on the current work item.  (The system knows what
work groups are in flight, which is why you need a system-defined
intrinsic.)

So a kernel function like this:

void foo(__global int*A) {
   __local int vint;
   __local int *vpint;
   __local int const *vcpint;
   __local int volatile vvint;
   int a = A[0];
   vint = a;
   vvint = a;
   int a2 = vint;
   int va2 = vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

is translated to this, which does pass through Clang, with __local
meaning attrib addrspace(2):

extern __local void * __get_work_group_local_base_addr(void); // intrinsic
void foo(__global int*A) {
   __local struct __local_vars_s {
      int vint;
      int *vpint;
      int const *vcpint;
      int volatile vvint;
   } * const __local_vars
            // this is a *private* variable, pointing to *local* addresses.
            // it's a const pointer because it shouldn't change; and
being const may expose optimizations
       = __get_work_group_local_base_addr();  // the new intrinsic
   int a = A[0];
   __local_vars->vint = a;   // l-values are translated as memory stores.
   __local_vars->vvint = a;
   int a2 = __local_vars->vint;   // r-values are translated as memory loads
   int va2 = __local_vars->vvint;
   barrier(CLK_LOCAL_MEM_FENCE);
   A[0] = a2 + va2;
}

As an extension, the backend ought to be able to use some smarts to
simplify this down in simple cases.  For example if the system only
ever allows one work group at a time, then the intrinsic could boil
down to returning a constant, and then link time optimization can
scrub away unneeded work.  Similarly if you have a GPU style
environment where (as Peter described) the "local" addresses are the
same integer value but in different groups point to different storage,
then again the intrinsic returns a constant and again LTO optimizes
the result.

I haven't thought through the implications of a kernel having such
vars calling another kernel having such variables.  At least the
OpenCL spec says that the behaviour is implementation-defined for such
a case.  It would be nice to be able to represent any of the sane
possibilities.

@Anton:  Regarding ARM's open-sourcing:  I'm glad to see the
reaffirmation, and I look forward to the contribution.  Yes, I
understand the virtues of patience.  :-)
I assume you plan to commit a document describing how OpenCL is
supported.  (e.g. how details like the above are handled.)

thanks,
david

Krister Wombell

2010-Dec-08 06:46 UTC

head link

[LLVMdev] [cfe-dev] OpenCL support

I would reconsider Micah's suggestion.  The simple solution is to tag the
variable with an address space and turn it into a global.  You can do that
with a simple change in CodeGenFunction::CreateStaticBlockVarDecl.  It would
give all the benefits you describe in that the target decides how to lower
the code but do it using concepts that LLVM and some targets may already
support.  Kernels that call kernels with locals will also work.

Perhaps an example is useful? Our OpenCL implementation, given the code
above, generates this bitcode (after optimizations that have eliminated the
dead vars):

target datalayout = "e-p:32:32:32-f64:64:64-i64:64:64"
target triple = "zms-ziilabs-opencl10"

@foo.auto.vint = internal addrspace(2) global i32 0, align 4
@foo.auto.vvint = internal addrspace(2) global i32 0, align 4

define void @foo(i32 addrspace(1)* %A) nounwind {
entry:
  %tmp1 = load i32 addrspace(1)* %A, align 4
  store i32 %tmp1, i32 addrspace(2)* @foo.auto.vint, align 4
  volatile store i32 %tmp1, i32 addrspace(2)* @foo.auto.vvint, align 4
  %tmp7 = volatile load i32 addrspace(2)* @foo.auto.vvint, align 4
  tail call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1
false)
  %add = add nsw i32 %tmp7, %tmp1
  store i32 %add, i32 addrspace(1)* %A, align 4
  ret void
}

Krister



On Wed, Dec 8, 2010 at 5:02 AM, David Neto <dneto.llvm at gmail.com>
wrote:
> On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at
amd.com>
> wrote:
> >> -----Original Message-----
> >> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at
cs.uiuc.edu]
> >> On Behalf Of Peter Collingbourne
> >> Sent: Monday, December 06, 2010 2:56 PM
> >> To: David Neto
> >> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
> >> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
> >>
> >> Hi David,
> >>
> >> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
> >> > What do I think your patch should look like?  It's true
that the
> >> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL
when
> >> it's
> >> > the __local addres space.
> >> >
> >> > But we need to implement the semantics somehow.  Conceptually
I think
> >> > of it as a CL source-to-source transformation that lowers
> >> > function-scope-local-address-space variables into a more
primitive
> >> > form.
> >> >
> >> > I think I disagree that the Clang is an inappropriate spot
for
> >> > implementing this type of transform: Clang "knows"
the source
> >> language
> >> > semantics, and has a lot of machinery required for the
transform.
> >> > Also, Clang also knows a lot about the target machine (e.g.
type
> >> > sizes, builtins, more?).
> >> >
> >> > So I believe the "auto var in different address
space" case should be
> >> > allowed in the AST in the OpenCL case, and the local-lowering
> >> > transform should be applied in CodeGen.  Perhaps the lowering
is
> >> > target-specific, e.g. GPU-style, or more generic style as I
proposed.
> >> >
> >> > Thoughts?
> >>
> >> I've been rethinking this and perhaps coming around to this
way
> >> of thinking.  Allocating variables in the __local address space
> >> is really something that can't be represented at the LLVM
level,
> >> at least in a standard form.
> > [Villmow, Micah] We ran across this problem in our OpenCL
implementation.
> However, you can create a global variable with an '__local' address
space
> and it works fine. There is an issue with collision between auto-arrays in
> different kernels, but that can be solved with a little name mangling.
There
> are other ways to do this, for example, by converting local auto-arrays
into
> kernel local pointer arguments with a known size.
>
> Here's a little example to show the direction I was heading, with an
> illustration as a CL-to-C translation.  I believe there are no
> namespace issues, but otherwise is essentially the same as the global
> variable solution.
>
> The idea is that the func scope local addr variables are like a stack
> frame that is shared between the different work items in a group.  So
> collect all those variables in an anonymous struct, and then create a
> function scope private variable to point to the one copy of that
> struct.  The pointer is returned by a system-defined intrinsic
> function dependent on the current work item.  (The system knows what
> work groups are in flight, which is why you need a system-defined
> intrinsic.)
>
> So a kernel function like this:
>
> void foo(__global int*A) {
>   __local int vint;
>   __local int *vpint;
>   __local int const *vcpint;
>   __local int volatile vvint;
>   int a = A[0];
>   vint = a;
>   vvint = a;
>   int a2 = vint;
>   int va2 = vvint;
>   barrier(CLK_LOCAL_MEM_FENCE);
>   A[0] = a2 + va2;
> }
>
> is translated to this, which does pass through Clang, with __local
> meaning attrib addrspace(2):
>
> extern __local void * __get_work_group_local_base_addr(void); // intrinsic
> void foo(__global int*A) {
>   __local struct __local_vars_s {
>      int vint;
>      int *vpint;
>      int const *vcpint;
>      int volatile vvint;
>   } * const __local_vars
>            // this is a *private* variable, pointing to *local* addresses.
>            // it's a const pointer because it shouldn't change; and
> being const may expose optimizations
>       = __get_work_group_local_base_addr();  // the new intrinsic
>   int a = A[0];
>   __local_vars->vint = a;   // l-values are translated as memory stores.
>   __local_vars->vvint = a;
>   int a2 = __local_vars->vint;   // r-values are translated as memory
loads
>   int va2 = __local_vars->vvint;
>   barrier(CLK_LOCAL_MEM_FENCE);
>   A[0] = a2 + va2;
> }
>
>
> As an extension, the backend ought to be able to use some smarts to
> simplify this down in simple cases.  For example if the system only
> ever allows one work group at a time, then the intrinsic could boil
> down to returning a constant, and then link time optimization can
> scrub away unneeded work.  Similarly if you have a GPU style
> environment where (as Peter described) the "local" addresses are
the
> same integer value but in different groups point to different storage,
> then again the intrinsic returns a constant and again LTO optimizes
> the result.
>
> I haven't thought through the implications of a kernel having such
> vars calling another kernel having such variables.  At least the
> OpenCL spec says that the behaviour is implementation-defined for such
> a case.  It would be nice to be able to represent any of the sane
> possibilities.
>
>
> @Anton:  Regarding ARM's open-sourcing:  I'm glad to see the
> reaffirmation, and I look forward to the contribution.  Yes, I
> understand the virtues of patience.  :-)
> I assume you plan to commit a document describing how OpenCL is
> supported.  (e.g. how details like the above are handled.)
>
> thanks,
> david
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20101208/31225870/attachment.html>

Villmow, Micah

2010-Dec-09 02:00 UTC

head link

[LLVMdev] [cfe-dev] OpenCL support

> -----Original Message-----
> From: David Neto [mailto:dneto.llvm at gmail.com]
> Sent: Tuesday, December 07, 2010 1:03 PM
> To: Villmow, Micah
> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
> Subject: Re: [cfe-dev] [LLVMdev] OpenCL support
> 
> On Mon, Dec 6, 2010 at 6:16 PM, Villmow, Micah <Micah.Villmow at
amd.com>
> wrote:
> >> -----Original Message-----
> >> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-
> bounces at cs.uiuc.edu]
> >> On Behalf Of Peter Collingbourne
> >> Sent: Monday, December 06, 2010 2:56 PM
> >> To: David Neto
> >> Cc: cfe-dev at cs.uiuc.edu; llvmdev at cs.uiuc.edu
> >> Subject: Re: [LLVMdev] [cfe-dev] OpenCL support
> >>
> >> Hi David,
> >>
> >> On Mon, Dec 06, 2010 at 11:14:42AM -0500, David Neto wrote:
> >> > What do I think your patch should look like?  It's true
that the
> >> > diag::err_as_qualified_auto_decl is inappropriate for OpenCL
when
> >> it's
> >> > the __local addres space.
> >> >
> >> > But we need to implement the semantics somehow.  Conceptually
I
> think
> >> > of it as a CL source-to-source transformation that lowers
> >> > function-scope-local-address-space variables into a more
primitive
> >> > form.
> >> >
> >> > I think I disagree that the Clang is an inappropriate spot
for
> >> > implementing this type of transform: Clang "knows"
the source
> >> language
> >> > semantics, and has a lot of machinery required for the
transform.
> >> > Also, Clang also knows a lot about the target machine (e.g.
type
> >> > sizes, builtins, more?).
> >> >
> >> > So I believe the "auto var in different address
space" case should
> be
> >> > allowed in the AST in the OpenCL case, and the local-lowering
> >> > transform should be applied in CodeGen.  Perhaps the lowering
is
> >> > target-specific, e.g. GPU-style, or more generic style as I
> proposed.
> >> >
> >> > Thoughts?
> >>
> >> I've been rethinking this and perhaps coming around to this
way
> >> of thinking.  Allocating variables in the __local address space
> >> is really something that can't be represented at the LLVM
level,
> >> at least in a standard form.
> > [Villmow, Micah] We ran across this problem in our OpenCL
> implementation. However, you can create a global variable with an
> '__local' address space and it works fine. There is an issue with
> collision between auto-arrays in different kernels, but that can be
> solved with a little name mangling. There are other ways to do this,
> for example, by converting local auto-arrays into kernel local pointer
> arguments with a known size.
> 
> Here's a little example to show the direction I was heading, with an
> illustration as a CL-to-C translation.  I believe there are no
> namespace issues, but otherwise is essentially the same as the global
> variable solution.
> 
> The idea is that the func scope local addr variables are like a stack
> frame that is shared between the different work items in a group.  So
> collect all those variables in an anonymous struct, and then create a
> function scope private variable to point to the one copy of that
> struct.  The pointer is returned by a system-defined intrinsic
> function dependent on the current work item.  (The system knows what
> work groups are in flight, which is why you need a system-defined
> intrinsic.)
> 
> So a kernel function like this:
> 
> void foo(__global int*A) {
>    __local int vint;
>    __local int *vpint;
>    __local int const *vcpint;
>    __local int volatile vvint;
>    int a = A[0];
>    vint = a;
>    vvint = a;
>    int a2 = vint;
>    int va2 = vvint;
>    barrier(CLK_LOCAL_MEM_FENCE);
>    A[0] = a2 + va2;
> }
> [Villmow, Micah] This example is incorrect. There is a race condition between
the writes to vint and vvint and the reads from vint/vvint. The reason being is
that all threads in a work-group share the memory that vint is allocated in. So
if you have two work-items in a work-group, both work-items are writing to the
same memory location and you don't know which thread is writing to that
location. Also, the barrier is in the wrong location as you need to verify that
all threads writes to your local before you can safely start reading to it. A
quick example using two work items running in parallel.
Cycle WI 1       WI 2
1     read A[0]  
2                read A[0]
3     write vint      
4                write vint
5                write vvint
6                read vint
7     write vvint
8     read vint
10    read vvint
11               read vvint
12    barrier
13               barrier

It is pretty easy to see why it is a race condition. 1, there is no
synchronization between store and load and two both work items are writing to
the same memory. This is even more troublesome on AMD GPU's when we run up
to 256 work items in parallel. I would keep this in mind when attempting to
define some standard way to do this.
> is translated to this, which does pass through Clang, with __local
> meaning attrib addrspace(2):
> 
> extern __local void * __get_work_group_local_base_addr(void); //
> intrinsic
> void foo(__global int*A) {
>    __local struct __local_vars_s {
>       int vint;
>       int *vpint;
>       int const *vcpint;
>       int volatile vvint;
>    } * const __local_vars
>             // this is a *private* variable, pointing to *local*
> addresses.
>             // it's a const pointer because it shouldn't change;
and
> being const may expose optimizations	
>        = __get_work_group_local_base_addr();  // the new intrinsic
>    int a = A[0];
>    __local_vars->vint = a;   // l-values are translated as memory
> stores.
>    __local_vars->vvint = a;
>    int a2 = __local_vars->vint;   // r-values are translated as memory
> loads
>    int va2 = __local_vars->vvint;
>    barrier(CLK_LOCAL_MEM_FENCE);
>    A[0] = a2 + va2;
> }
> [Villmow, Micah] I'd prefer not to embed them in a structure for the simple
reason that it isn't the way we do it. What we do for OpenCL at AMD on the
GPU is we turn it into something like the following:
@cllocal_foo_vint int* addressspace(2) [1xi32]; <-- forgive my pseudo
LLVM-IR.
@cllocal_foo_vvint int* addressspace(2) volatile [1xi32];
void foo(__global int*A) {
    __local int *vint = &cllocal_foo_vint;
    __local int *vpint; <-- this is just a pointer and requires no
modification
    __local int const *vcpint; <-- same here, only arrays or variables
require modification
    __local int volatile *vvint = &cllocal_foo_vvint;
    int a = A[0];
    *vint = a;
    *vvint = a;
    int a2 = *vint;
    int va2 = *vvint;
    A[0] = a2 + va2;
}	

Also, don't forget there is a difference between local int a and local int*
a, one is just a pointer to some pre-allocated memory and the other actually
requires memory to be allocated. Any implementation should ignore local pointers
variables and only deal with local scalar, vector and array
variables.> 
> As an extension, the backend ought to be able to use some smarts to
> simplify this down in simple cases.  For example if the system only
> ever allows one work group at a time, then the intrinsic could boil
> down to returning a constant, and then link time optimization can
> scrub away unneeded work.  Similarly if you have a GPU style
> environment where (as Peter described) the "local" addresses are
the
> same integer value but in different groups point to different storage,
> then again the intrinsic returns a constant and again LTO optimizes
> the result.
> 
> I haven't thought through the implications of a kernel having such
> vars calling another kernel having such variables.  At least the
> OpenCL spec says that the behaviour is implementation-defined for such
> a case.  It would be nice to be able to represent any of the sane
> possibilities.
> [Villmow, Micah] A kernel with locally defined variables cannot call another
kernel with local variables. This is illegal in OpenCL because when a kernel
calls another kernel, that kernel is treated as a normal function. A normal
function is not allowed to have local variables declared inside the body.
Basically this restricts locally defined variables in a kernel body to the top
level kernel function only.

Micah> 
> @Anton:  Regarding ARM's open-sourcing:  I'm glad to see the
> reaffirmation, and I look forward to the contribution.  Yes, I
> understand the virtues of patience.  :-)
> I assume you plan to commit a document describing how OpenCL is
> supported.  (e.g. how details like the above are handled.)
> 
> thanks,
> david

David Neto

2010-Dec-10 15:57 UTC

head link

[LLVMdev] [cfe-dev] OpenCL support

On Wed, Dec 8, 2010 at 9:00 PM, Villmow, Micah <Micah.Villmow at amd.com>
wrote:>> So a kernel function like this:
>>
>> void foo(__global int*A) {
>>    __local int vint;
>>    __local int *vpint;
>>    __local int const *vcpint;
>>    __local int volatile vvint;
>>    int a = A[0];
>>    vint = a;
>>    vvint = a;
>>    int a2 = vint;
>>    int va2 = vvint;
>>    barrier(CLK_LOCAL_MEM_FENCE);
>>    A[0] = a2 + va2;
>> }
>>
> [Villmow, Micah] This example is incorrect. There is a race condition
between the writes to vint and vvint and the reads from vint/vvint.
This is a bit of a side issue for this thread.  Yes, I understand the
race conditions, but thank you for being clear.

My point was to show a translation scheme for code, in particular how
storage is allocated, and how reads and writes are handled.  The
original code was stupid.  The job of the compiler is not to fix the
semantics of the code it is translating, but to faithfully translate
it.  :-)


cheers,
david

Possibly Parallel Threads

Search for more reasonably related threads

llvm dev - Dec 2010 - [LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

[LLVMdev] [cfe-dev] OpenCL support

Possibly Parallel Threads