thr3ads.net - llvm dev - [LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne

2011-Oct-14 17:40 UTC

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

On Thu, Oct 13, 2011 at 04:14:09PM -0400, Justin Holewinski
wrote:> On Thu, Oct 13, 2011 at 11:57 AM, Peter Collingbourne <peter at
pcc.me.uk>wrote:
> 
> > Hi Justin,
> >
> > Thanks for bringing this up, I think it's important to discuss
> > these issues here.
> >
> > On Thu, Oct 13, 2011 at 09:46:28AM -0400, Justin Holewinski wrote:
> > > It is becoming increasingly clear to me that LLVM address spaces
are not
> > the
> > > general solution to OpenCL/CUDA memory spaces. They are a
convenient hack
> > to
> > > get things working in the short term, but I think a more
long-term
> > approach
> > > should be discussed and decided upon now before the OpenCL and
CUDA
> > > implementations in Clang/LLVM get too mature. To be clear, I am
not
> > > advocating that *targets* change to a different method for
representing
> > > device memory spaces. The current use of address spaces to
represent
> > > different types of device memory is perfectly valid, IMHO.
However, this
> > > knowledge should not be encoded in front-ends and
pre-SelectionDAG
> > > optimization passes.
> >
> > I disagree.  The targets should expose all the address spaces they
> > provide, and the frontend should know about the various address spaces
> > it needs to know about.  It is incumbent on the frontend to deliver
> > a valid IR for a particular language implementation, and part of
> > that involves knowing about the ABI requirements for the language
> > implementation (which may involve using specific address spaces)
> > and the capabilities of each target (including the capabilities of
> > the target's address spaces), together with the language
semantics.
> > It is not the job of the optimisers or backend to know the semantics
> > for a specific language, a specific implementation of that language
> > or a specific ABI.
> >
> 
> But this is assuming that a target's address spaces have a valid 1 to 1
> mapping between OpenCL memory spaces and back-end address spaces.  What
> happens for a target such as x86?  Do we introduce pseudo address spaces
> into the back-end just to satisfy the front-end OpenCL requirements?
I don't see how anything I wrote implies that.  For x86, there would
presumably be a many-to-one mapping.
> > This presupposes that we need a way of representing OpenCL address
> > spaces in IR targeting X86 (and targets which lack GPU-like address
> > spaces).  As far as I can tell, the only real representations of
> > OpenCL address spaces on such targets that we need are a way of
> > distinguishing the different address spaces for alias analysis
> > and a representation for __local variables allocated on the stack.
> > TBAA metadata would solve the first problem, and we already have
> > mechanisms in the frontend that could be used to solve the second.
> >
> 
> Which mechanisms could be used to differentiate between thread-private and
> __local data?
In OpenCL C, it is illegal to declare a variable with static storage
duration in the __private address space (section 6.5: "All program
scope variables must be declared in the __constant address space.";
section 6.8g: "The extern, static, auto and register storage-class
specifiers are not supported.").  This implies that there is no way
for pointers to the __private address space to be usefully shared
between work-items without invoking undefined behaviour, so the
question is moot (i.e. __private does not need to be implemented using
thread-local storage).

It is possible to write OpenCL C code which shares pointers to
__private memory using barrier synchronisation, but since there is no
way to queue a memory fence across __private memory (only __local and
__global), any access to that memory would invoke undefined behaviour.
For example, consider the following (2 work-items in a work-group):

__kernel void foo() {
  int x = 0;
  int *__local p;
  if (get_local_id(0) == 0) p = &x;
  barrier(CLK_LOCAL_MEM_FENCE);
  if (get_local_id(0) == 1) *p = 1;
  barrier(CLK_LOCAL_MEM_FENCE);
  // what is the value of x in work-item 0 here?
}
  
The value of x at the comment is undefined, because no fence across
__private memory was queued.

Perhaps more straightforwardly, referring to the following passage
in section 3.3 ("memory model") of the OpenCL specification:

"Private Memory: A region of memory private to a work-item. Variables
defined in one work-item's private memory are not visible to another
work-item."

We can interpret the term "not visible" here as meaning that accesses
across work-items invoke undefined behaviour, so in the example above,
the write to x via p would itself be undefined.

Thanks,
-- 
Peter

Justin Holewinski

2011-Oct-15 01:13 UTC

head link

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

On Fri, Oct 14, 2011 at 10:40 AM, Peter Collingbourne <peter at
pcc.me.uk>wrote:
> On Thu, Oct 13, 2011 at 04:14:09PM -0400, Justin Holewinski wrote:
> > On Thu, Oct 13, 2011 at 11:57 AM, Peter Collingbourne <peter at
pcc.me.uk
> >wrote:
> >
> > > Hi Justin,
> > >
> > > Thanks for bringing this up, I think it's important to
discuss
> > > these issues here.
> > >
> > > On Thu, Oct 13, 2011 at 09:46:28AM -0400, Justin Holewinski
wrote:
> > > > It is becoming increasingly clear to me that LLVM address
spaces are
> not
> > > the
> > > > general solution to OpenCL/CUDA memory spaces. They are a
convenient
> hack
> > > to
> > > > get things working in the short term, but I think a more
long-term
> > > approach
> > > > should be discussed and decided upon now before the OpenCL
and CUDA
> > > > implementations in Clang/LLVM get too mature. To be clear, I
am not
> > > > advocating that *targets* change to a different method for
> representing
> > > > device memory spaces. The current use of address spaces to
represent
> > > > different types of device memory is perfectly valid, IMHO.
However,
> this
> > > > knowledge should not be encoded in front-ends and
pre-SelectionDAG
> > > > optimization passes.
> > >
> > > I disagree.  The targets should expose all the address spaces
they
> > > provide, and the frontend should know about the various address
spaces
> > > it needs to know about.  It is incumbent on the frontend to
deliver
> > > a valid IR for a particular language implementation, and part of
> > > that involves knowing about the ABI requirements for the language
> > > implementation (which may involve using specific address spaces)
> > > and the capabilities of each target (including the capabilities
of
> > > the target's address spaces), together with the language
semantics.
> > > It is not the job of the optimisers or backend to know the
semantics
> > > for a specific language, a specific implementation of that
language
> > > or a specific ABI.
> > >
> >
> > But this is assuming that a target's address spaces have a valid 1
to 1
> > mapping between OpenCL memory spaces and back-end address spaces. 
What
> > happens for a target such as x86?  Do we introduce pseudo address
spaces
> > into the back-end just to satisfy the front-end OpenCL requirements?
>
> I don't see how anything I wrote implies that.  For x86, there would
> presumably be a many-to-one mapping.
>
> > > This presupposes that we need a way of representing OpenCL
address
> > > spaces in IR targeting X86 (and targets which lack GPU-like
address
> > > spaces).  As far as I can tell, the only real representations of
> > > OpenCL address spaces on such targets that we need are a way of
> > > distinguishing the different address spaces for alias analysis
> > > and a representation for __local variables allocated on the
stack.
> > > TBAA metadata would solve the first problem, and we already have
> > > mechanisms in the frontend that could be used to solve the
second.
> > >
> >
> > Which mechanisms could be used to differentiate between thread-private
> and
> > __local data?
>
> In OpenCL C, it is illegal to declare a variable with static storage
> duration in the __private address space (section 6.5: "All program
> scope variables must be declared in the __constant address space.";
> section 6.8g: "The extern, static, auto and register storage-class
> specifiers are not supported.").  This implies that there is no way
> for pointers to the __private address space to be usefully shared
> between work-items without invoking undefined behaviour, so the
> question is moot (i.e. __private does not need to be implemented using
> thread-local storage).
>
> It is possible to write OpenCL C code which shares pointers to
> __private memory using barrier synchronisation, but since there is no
> way to queue a memory fence across __private memory (only __local and
> __global), any access to that memory would invoke undefined behaviour.
> For example, consider the following (2 work-items in a work-group):
>
> __kernel void foo() {
>  int x = 0;
>  int *__local p;
>  if (get_local_id(0) == 0) p = &x;
>  barrier(CLK_LOCAL_MEM_FENCE);
>  if (get_local_id(0) == 1) *p = 1;
>  barrier(CLK_LOCAL_MEM_FENCE);
>  // what is the value of x in work-item 0 here?
> }
>
> The value of x at the comment is undefined, because no fence across
> __private memory was queued.
>
> Perhaps more straightforwardly, referring to the following passage
> in section 3.3 ("memory model") of the OpenCL specification:
>
> "Private Memory: A region of memory private to a work-item. Variables
> defined in one work-item's private memory are not visible to another
> work-item."
>
> We can interpret the term "not visible" here as meaning that
accesses
> across work-items invoke undefined behaviour, so in the example above,
> the write to x via p would itself be undefined.
>
I was referring more to the front-end aspects here.  Let's say we have:

__kernel void foo() {
  float privateBuffer[8];
  __local float localBuffer[8];
}

What mechanisms, other than address spaces, can we use to tell the X86
back-end that privateBuffer is private to the thread, and localBuffer is
shared among all threads in a work-group?

>
> Thanks,
> --
> Peter
>


-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111014/065489af/attachment.html>

Peter Collingbourne

2011-Oct-15 01:34 UTC

head link

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

On Fri, Oct 14, 2011 at 06:13:54PM -0700, Justin Holewinski
wrote:> > In OpenCL C, it is illegal to declare a variable with static storage
> > duration in the __private address space (section 6.5: "All
program
> > scope variables must be declared in the __constant address
space.";
> > section 6.8g: "The extern, static, auto and register
storage-class
> > specifiers are not supported.").  This implies that there is no
way
> > for pointers to the __private address space to be usefully shared
> > between work-items without invoking undefined behaviour, so the
> > question is moot (i.e. __private does not need to be implemented using
> > thread-local storage).
> >
> > It is possible to write OpenCL C code which shares pointers to
> > __private memory using barrier synchronisation, but since there is no
> > way to queue a memory fence across __private memory (only __local and
> > __global), any access to that memory would invoke undefined behaviour.
> > For example, consider the following (2 work-items in a work-group):
> >
> > __kernel void foo() {
> >  int x = 0;
> >  int *__local p;
> >  if (get_local_id(0) == 0) p = &x;
> >  barrier(CLK_LOCAL_MEM_FENCE);
> >  if (get_local_id(0) == 1) *p = 1;
> >  barrier(CLK_LOCAL_MEM_FENCE);
> >  // what is the value of x in work-item 0 here?
> > }
> >
> > The value of x at the comment is undefined, because no fence across
> > __private memory was queued.
> >
> > Perhaps more straightforwardly, referring to the following passage
> > in section 3.3 ("memory model") of the OpenCL specification:
> >
> > "Private Memory: A region of memory private to a work-item.
Variables
> > defined in one work-item's private memory are not visible to
another
> > work-item."
> >
> > We can interpret the term "not visible" here as meaning that
accesses
> > across work-items invoke undefined behaviour, so in the example above,
> > the write to x via p would itself be undefined.
> >
> 
> I was referring more to the front-end aspects here.  Let's say we have:
> 
> __kernel void foo() {
>   float privateBuffer[8];
>   __local float localBuffer[8];
> }
> 
> What mechanisms, other than address spaces, can we use to tell the X86
> back-end that privateBuffer is private to the thread, and localBuffer is
> shared among all threads in a work-group?
There is no need to tell the x86 backend that privateBuffer is
private to the thread.  For the reasons I explained, there is no
way for work-items to usefully get pointers to other work-items'
privateBuffer objects, so as long as privateBuffer is allocated
as an automatic variable (i.e. on the stack), there is no other
special treatment required.

As for localBuffer, the IR generator would emit accesses to __local
variables in an implementation-specific way, and the IR generator
already contains a mechanism for doing so.   In this mailing list
post I explained in more detail the CGOpenCLRuntime class that is
used to do this:

http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20110815/045187.html

The "hidden pointer argument" technique is the one most suited to x86,
but this has not actually been implemented.

Thanks,
-- 
Peter

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Oct 2011 - [LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

[LLVMdev] [cfe-dev] RFC: Representation of OpenCL Memory Spaces

Maybe Matching Threads