thr3ads.net - llvm dev - [llvm-dev] RFC: alloca -- specify address space for allocation [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Swaroop Sridhar via llvm-dev

2015-Aug-29 04:30 UTC

[llvm-dev] RFC: alloca -- specify address space for allocation

> -----Original Message-----
> From: Philip Reames [mailto:listmail at philipreames.com]
> Sent: Friday, August 28, 2015 9:38 AM
> To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-
> dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers
> <andya at microsoft.com>; Russell Hadley <rhadley at
microsoft.com>
> Subject: Re: RFC: alloca -- specify address space for allocation
> 
>> I think for the use case you are outlining, an addrspacecast is the
correct IR model --
>> you're specifically saying that it is OK in this case to turn a
pointer from addrspace 0
>> into one for addrspace N because N is your "managed pointer"
set that can be *either*
>> a GC-pointer or a non-GC-pointer. 
>> What the FE is saying is that this is an *acceptable* transition of
addrspace, because your
>> language and runtime semantics have provided for it. 
>> I think the proper way to say that is with a cast.
> The key bit here is that I think Chandler is right.  You are effectively
casting a
> stack allocation *into* a managed pointer. Having something to mark that
> transition seems reasonable.
I think there are two views here:

(1) MSIL level view:
In CLR, the stack, is a part of "managed memory" (which is not the
same as gc-heap, which is managed and garbage-collected memory).
Therefore, all *references* to stack locations are "managed
addresses,"  in the sense that the compiler/runtime exercises certain
control over (values that are) managed-address:
For example: it enforces certain restrictions to guarantee safety -- ex:
lifetime restrictions, non-null requirement in certain contexts, etc.

This is different from a notion of "unmanaged memory" which is for
interoperability with native code.
*Pointers* to unmanaged memory are not controlled by the runtime (ex: do not
provide any safety guarantees).

So, from the language semantics point of view, stack addresses are created as
managed pointers.
Which is why the proposal is to have alloca directly in the managed
address-space seemed natural.

Joseph has written more details in the document that Philip shared out in this
thread.

 (2) A more Lower level IR view:
LLVM creates all stack locations in addrespace(0) for all code, whether it comes
from managed-code or native code.
Of these, Stack locations corresponding to the managed-stack are promoted to
managed-addresses via addrspacecast.
As an optimization, the FrontEnd inserts the addrspace casts only for those
stack locations that are actually address-taken.

If I understand correctly, the recommendation (by Philip, Chandler and David) is
approach (2) because:
(a) No change to Instruction-Set is necessary when the semantics is achievable
via existing instructions.
(b) It saves changing the optimizer to allocate in the correct address-space. 
Looks like the problem here is that: the optimizer is expected to create
type-preserving transformation
by allocating in the correct address-space, but blindly allocates in the default
address space today.
I don't know the LLVM optimizer well enough to have a good estimate of the
magnitude of changes
necessary here. But, I agree that (avoiding) substantial changes to the
optimizer is a strong consideration.
>> You might need N to be a distinct address space from 
>> the one used for GC-pointers and to have similar casts emitted by the
frontend.
Yes, eventually we'll need to differentiate between:
(i) Pointers to unmanaged memory -- which will never be reported to the runtime
(ii) Pointers to GC-heap objects -- which will always be reported to the runtime
(iii) Generic managed pointer -- which may need to be reported if we cannot
establish that it points outside the GC heap.

Currently we report all pointers to the runtime as managed pointers.
This is inefficient because the GC then needs to do extra work to figure out
what kind of pointer it is:
Pointer to a heap object, pointer within a heap object, or outside the heap.
> Of course, having said that all, I'm back to thinking that having a
marker on
> the alloca would be somewhat reasonable too.  However, I think we need a
> much stronger justification to change the IR than has been provided.  If
you
> can show that the cast based model doesn't work for some reason, we can
> re-evaluate.
I don't think we can say that the cast-based model will not work. 
The question is whether alloca addrspace(1)* is a better fit for MSIL semantics,
analysis phases,
and managed-code specific optimizations.

I'm OK if we conclude that we'll keep using the cast model until we hit
a concrete case
where it does not work, or seems architecturally misfit.
> Worth noting is that we might be better off introducing an orthogonal
notion
> for tracking gc references entirely.  The addrspace mechanism has worked,
> but it is a little bit of a hack. We've talked about the need for an
opaque
> pointer type.  Maybe when we actually get around to defining that, the
alloca
> case is one we should consider.
Yes, I'm mainly concerned about getting the right types on the different
kinds of
pointers. If adders space annotation implies more constraints (ex: on layout)
than
what's already necessitated by the type distinction, we should use a
separate
mechanism.

Again, I'm OK if we want to keep using addrspacecast until we hit a concrete
case
where it breaks down.

Swaroop.

Joseph Tremoulet via llvm-dev

2015-Aug-31 02:45 UTC

head link

[llvm-dev] RFC: alloca -- specify address space for allocation

MM> It is not clear to me if these GC pointers are actually living in a
different address space when allocated on the stack (aka. you have two different
stacks)
HF> Is the mental model here that a function using this feature actually has
multiple stacks, in different address spaces, and this allows creating local
stack allocations in those non-addrspace-0 stacks?
PR> the reason we originally went with addrspaces to mark GC pointers was
that the managed and non-managed heaps for us are semantically non-overlapping
PR> we might be better off introducing an orthogonal notion for tracking gc
references entirely.  The addrspace mechanism has worked, but it is a little bit
of a hack
CC> these use cases seem really straight forward for address spaces... But
this is very different from the idea of using an alloca with an address spaces
for GC pointers to stack objects


I think it's an important question whether address spaces are a good fit for
what we're trying to model here.  So I'll explain my mental model of
LLILC's address spaces, and I'd be very interested in feedback on
whether this seems like a good fit for LLVM's address space concept, or a
bastardization thereof:

[preface: it's been my understanding that dereferences of pointers in
different address spaces can alias is semantically meaningful ways; i.e. that
it's appropriate to use different address spaces to model different means of
indexing the same memory.  Some of the comments/questions on this thread seemed
to imply instead an expectation that distinct address spaces always reference
semantically disjoint storage; if that's a hard assumption, then nothing
I'm about to say will make sense and we'll almost certainly need a
different mechanism to model GC pointers]

1. The value of an unmanaged/addrspace(0) pointer is an address (in the virtual
memory available to the process)
2. The value of a managed/addrspace(1) pointer is conceptually an (ID, offset)
pair.  The first component is the "identity" of the object that the
offset component is relative to.  Identity is distinct from address; you could
imagine the GC heap allocator having a counter that it increments with each
allocation, and that an object's identity is the value the counter had when
it was allocated.
3. There are two special reserved IDs for pseudo-objects:
  3a. one reserved ID for the null pseudo-object (i.e. what nullptr points to)
  3b. one reserved ID for the "everything-outside-the-GC-heap"
pseudo-object.  Conceptually this is an infinitely-large object and the data at
offset N in this object is the data at address N (in the virtual memory
available to the process, and with a requirement that N does not correspond to
an address in the GC heap)
4. The benefit of this conceptual model is that garbage collections are
value-preserving, which is why we don't need to model safepoints as
rewriting GC pointers (until the RewriteStatepointsForGC pass where we rewrite
the IR in terms of addresses rather than object identity)
5. We know that the representation of an addrspace(1) pointer into the
outside-the-GC-heap pseudo-object is bit-identical to the representation of an
addrspace(0) pointer to the corresponding address  (question: Doesn't that
mean we can/should be using bitcast instead of addrspacecast when we know the
value in question is an outside-the-GC-heap pointer?)
6. Our source language includes constructs that expose a managed pointer's
address (which have well-defined semantics only if the object in question is
"pinned").  These naturally correspond to addrspace casts (or maybe
need to be additionally constrained?) and are a function of not only the input
pointer but also of the layout of the GC heap and the runtime's pinning
state.  This is where we can get a mix of addrspace(0) and addrspace(1) pointers
whose dereferences alias (in semantically well-defined ways).

Again, I'd love feedback on how sane that sounds to those of you familiar
with LLVM's address space notion.

I think that 3b is the part that seems to generate the most
surprise/consternation.  The reason our source language includes the
"outside-the-GC-heap" pseudo-object is that it allows more powerful
code (you could think of it as allowing polymorphism over GC-heap-ness of
inputs) with no adverse effects to the system's typesafety guarantees.

Relating this back to the question of what address space an alloca belongs in
(which I'm doing to contextualize, not in an attempt to continue the
debate), as the stack (of which there is only one) is outside-the-GC-heap, the
question is one of whether you want to conceptualize an alloca as producing an
unmanaged address or as producing a pointer into the outside-the-GC-heap
pseudo-object.  I'd argue that our source language conceptualizes it as the
latter[1], which (as Swaroop points out below) is why it would feel natural to
model it that way in LLVM, but of course (as Swaroop also points out below) we
could also decompose the source construct into two steps in LLVM IR.  As far as
compiler-introduced allocas, they of course wouldn't be referenced in the
source, and so we wouldn't have a hard requirement either way.

Thanks
-Joseph

[1] - To be more pedantic/precise:
 - Static allocas are implied by local variable declarations, and those
declarations are not annotated with a pointer type
 - Our source language includes a compact form that can be used to describe a
load or store of a local variable; these compact forms are not annotated with a
pointer type
 - Everywhere else that our source language refers to the address of a local
variable, it uses the managed pointer type




-----Original Message-----
From: Swaroop Sridhar 
Sent: Saturday, August 29, 2015 12:30 AM
To: Philip Reames <listmail at philipreames.com>; llvm-dev <llvm-dev at
lists.llvm.org>; Sanjoy Das <sanjoy at playingwithpointers.com>
Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers <andya at
microsoft.com>; Russell Hadley <rhadley at microsoft.com>
Subject: RE: RFC: alloca -- specify address space for allocation
> -----Original Message-----
> From: Philip Reames [mailto:listmail at philipreames.com]
> Sent: Friday, August 28, 2015 9:38 AM
> To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-
> dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers 
> <andya at microsoft.com>; Russell Hadley <rhadley at
microsoft.com>
> Subject: Re: RFC: alloca -- specify address space for allocation
> 
>> I think for the use case you are outlining, an addrspacecast is the 
>> correct IR model -- you're specifically saying that it is OK in
this
>> case to turn a pointer from addrspace 0 into one for addrspace N 
>> because N is your "managed pointer" set that can be *either*
a GC-pointer or a non-GC-pointer.
>> What the FE is saying is that this is an *acceptable* transition of 
>> addrspace, because your language and runtime semantics have provided
for it.
>> I think the proper way to say that is with a cast.
> The key bit here is that I think Chandler is right.  You are 
> effectively casting a stack allocation *into* a managed pointer. 
> Having something to mark that transition seems reasonable.
I think there are two views here:

(1) MSIL level view:
In CLR, the stack, is a part of "managed memory" (which is not the
same as gc-heap, which is managed and garbage-collected memory).
Therefore, all *references* to stack locations are "managed
addresses,"  in the sense that the compiler/runtime exercises certain
control over (values that are) managed-address:
For example: it enforces certain restrictions to guarantee safety -- ex:
lifetime restrictions, non-null requirement in certain contexts, etc.

This is different from a notion of "unmanaged memory" which is for
interoperability with native code.
*Pointers* to unmanaged memory are not controlled by the runtime (ex: do not
provide any safety guarantees).

So, from the language semantics point of view, stack addresses are created as
managed pointers.
Which is why the proposal is to have alloca directly in the managed
address-space seemed natural.

Joseph has written more details in the document that Philip shared out in this
thread.

 (2) A more Lower level IR view:
LLVM creates all stack locations in addrespace(0) for all code, whether it comes
from managed-code or native code.
Of these, Stack locations corresponding to the managed-stack are promoted to
managed-addresses via addrspacecast.
As an optimization, the FrontEnd inserts the addrspace casts only for those
stack locations that are actually address-taken.

If I understand correctly, the recommendation (by Philip, Chandler and David) is
approach (2) because:
(a) No change to Instruction-Set is necessary when the semantics is achievable
via existing instructions.
(b) It saves changing the optimizer to allocate in the correct address-space. 
Looks like the problem here is that: the optimizer is expected to create
type-preserving transformation by allocating in the correct address-space, but
blindly allocates in the default address space today.
I don't know the LLVM optimizer well enough to have a good estimate of the
magnitude of changes necessary here. But, I agree that (avoiding) substantial
changes to the optimizer is a strong consideration.
>> You might need N to be a distinct address space from the one used for 
>> GC-pointers and to have similar casts emitted by the frontend.
Yes, eventually we'll need to differentiate between:
(i) Pointers to unmanaged memory -- which will never be reported to the runtime
(ii) Pointers to GC-heap objects -- which will always be reported to the runtime
(iii) Generic managed pointer -- which may need to be reported if we cannot
establish that it points outside the GC heap.

Currently we report all pointers to the runtime as managed pointers.
This is inefficient because the GC then needs to do extra work to figure out
what kind of pointer it is:
Pointer to a heap object, pointer within a heap object, or outside the heap.
> Of course, having said that all, I'm back to thinking that having a 
> marker on the alloca would be somewhat reasonable too.  However, I 
> think we need a much stronger justification to change the IR than has 
> been provided.  If you can show that the cast based model doesn't work 
> for some reason, we can re-evaluate.
I don't think we can say that the cast-based model will not work. 
The question is whether alloca addrspace(1)* is a better fit for MSIL semantics,
analysis phases, and managed-code specific optimizations.

I'm OK if we conclude that we'll keep using the cast model until we hit
a concrete case where it does not work, or seems architecturally misfit.
> Worth noting is that we might be better off introducing an orthogonal 
> notion for tracking gc references entirely.  The addrspace mechanism 
> has worked, but it is a little bit of a hack. We've talked about the 
> need for an opaque pointer type.  Maybe when we actually get around to 
> defining that, the alloca case is one we should consider.
Yes, I'm mainly concerned about getting the right types on the different
kinds of pointers. If adders space annotation implies more constraints (ex: on
layout) than what's already necessitated by the type distinction, we should
use a separate mechanism.

Again, I'm OK if we want to keep using addrspacecast until we hit a concrete
case where it breaks down.

Swaroop.

Philip Reames via llvm-dev

2015-Aug-31 17:14 UTC

head link

[llvm-dev] RFC: alloca -- specify address space for allocation

A couple of minor responses inline, but I think we're mostly in 
agreement here.

On 08/28/2015 09:30 PM, Swaroop Sridhar wrote:>> -----Original Message-----
>> From: Philip Reames [mailto:listmail at philipreames.com]
>> Sent: Friday, August 28, 2015 9:38 AM
>> To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-
>> dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
>> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers
>> <andya at microsoft.com>; Russell Hadley <rhadley at
microsoft.com>
>> Subject: Re: RFC: alloca -- specify address space for allocation
>>
>>> I think for the use case you are outlining, an addrspacecast is the
correct IR model --
>>> you're specifically saying that it is OK in this case to turn a
pointer from addrspace 0
>>> into one for addrspace N because N is your "managed
pointer" set that can be *either*
>>> a GC-pointer or a non-GC-pointer.
>>> What the FE is saying is that this is an *acceptable* transition of
addrspace, because your
>>> language and runtime semantics have provided for it.
>>> I think the proper way to say that is with a cast.
>> The key bit here is that I think Chandler is right.  You are
effectively casting a
>> stack allocation *into* a managed pointer. Having something to mark
that
>> transition seems reasonable.
> I think there are two views here:
>
> (1) MSIL level view:
> In CLR, the stack, is a part of "managed memory" (which is not
the same as gc-heap, which is managed and garbage-collected memory).
> Therefore, all *references* to stack locations are "managed
addresses,"  in the sense that the compiler/runtime exercises certain
control over (values that are) managed-address:
> For example: it enforces certain restrictions to guarantee safety -- ex:
lifetime restrictions, non-null requirement in certain contexts, etc.
>
> This is different from a notion of "unmanaged memory" which is
for interoperability with native code.
> *Pointers* to unmanaged memory are not controlled by the runtime (ex: do
not provide any safety guarantees).
>
> So, from the language semantics point of view, stack addresses are created
as managed pointers.
> Which is why the proposal is to have alloca directly in the managed
address-space seemed natural.
>
> Joseph has written more details in the document that Philip shared out in
this thread.
>
>   (2) A more Lower level IR view:
> LLVM creates all stack locations in addrespace(0) for all code, whether it
comes from managed-code or native code.
> Of these, Stack locations corresponding to the managed-stack are promoted
to managed-addresses via addrspacecast.
> As an optimization, the FrontEnd inserts the addrspace casts only for those
stack locations that are actually address-taken.
>
> If I understand correctly, the recommendation (by Philip, Chandler and
David) is approach (2) because:
> (a) No change to Instruction-Set is necessary when the semantics is
achievable via existing instructions.
> (b) It saves changing the optimizer to allocate in the correct
address-space.
> Looks like the problem here is that: the optimizer is expected to create
type-preserving transformation
> by allocating in the correct address-space, but blindly allocates in the
default address space today.
> I don't know the LLVM optimizer well enough to have a good estimate of
the magnitude of changes
> necessary here. But, I agree that (avoiding) substantial changes to the
optimizer is a strong consideration.Slight restatement: The only legal address space for an alloca is zero 
today.  As a result, blindly creating addrspace zero allocas is correct 
by construction.  If we introduced additional address spaces for 
allocas, then it would become a problem.>
>>> You might need N to be a distinct address space from
>>> the one used for GC-pointers and to have similar casts emitted by
the frontend.
> Yes, eventually we'll need to differentiate between:
> (i) Pointers to unmanaged memory -- which will never be reported to the
runtime
> (ii) Pointers to GC-heap objects -- which will always be reported to the
runtime
> (iii) Generic managed pointer -- which may need to be reported if we cannot
establish that it points outside the GC heap.
>
> Currently we report all pointers to the runtime as managed pointers.
> This is inefficient because the GC then needs to do extra work to figure
out what kind of pointer it is:
> Pointer to a heap object, pointer within a heap object, or outside the
heap.Unless I misunderstand you, it really sounds like the distinction 
between a 'generic managed pointer' and a 'managed pointer which
happens
to point outside the gc heap' is purely an optimization right?  I'm 
generally hesitant to introduce new concepts for optimization benefit 
without evidence that the optimization is needed.  I'll note that I have 
no direct experience with a language with GC and stack based allocation, 
so it's possible I'm underestimating how important this is.

Side note: One of the things that's really bugging me is that you seem 
to be optimizing for work performed by the collector at safepoints.  My 
mental model is that safepoints are infrequent and that a minor amount 
of additional work at the safepoint doesn't really matter.  What am I 
missing here?  Polling for a safepoint has to happen pretty frequently, 
but we don't need to parse the stack unless we actually call into the 
collector right?>
>> Of course, having said that all, I'm back to thinking that having a
marker on
>> the alloca would be somewhat reasonable too.  However, I think we need
a
>> much stronger justification to change the IR than has been provided. 
If you
>> can show that the cast based model doesn't work for some reason, we
can
>> re-evaluate.
> I don't think we can say that the cast-based model will not work.
> The question is whether alloca addrspace(1)* is a better fit for MSIL
semantics, analysis phases,
> and managed-code specific optimizations.
>
> I'm OK if we conclude that we'll keep using the cast model until we
hit a concrete case
> where it does not work, or seems architecturally misfit.
Great.  We can revisit if needed.>
>> Worth noting is that we might be better off introducing an orthogonal
notion
>> for tracking gc references entirely.  The addrspace mechanism has
worked,
>> but it is a little bit of a hack. We've talked about the need for
an opaque
>> pointer type.  Maybe when we actually get around to defining that, the
alloca
>> case is one we should consider.
> Yes, I'm mainly concerned about getting the right types on the
different kinds of
> pointers. If adders space annotation implies more constraints (ex: on
layout) than
> what's already necessitated by the type distinction, we should use a
separate
> mechanism.Me too honestly.  I think we need to address this relative soon. Not 
immediately, but probably not 5 years from now either.>
> Again, I'm OK if we want to keep using addrspacecast until we hit a
concrete case
> where it breaks down.
>
> Swaroop.
>

Andy Ayers via llvm-dev

2015-Aug-31 18:27 UTC

head link

[llvm-dev] RFC: alloca -- specify address space for allocation

Re "object" vs "managed" pointer....

Given the ability to have GC references in structs we likely see a higher volume
of live references at a safepoint than what you're used to seeing in Java
(hence our concerns about the IR costs of safepoints and the way on-stack
references will be described). But admittedly GCs are rare.
>From an operational standpoint the distinction is indeed an optimization --
managed pointers require extra work during a GC -- first because they may not
point into the GC heap at all, and second because if they do point into the GC
heap the relevant object header must be located, and third there is bookkeeping
for the fixup work needed when they do point into the heap and the objects are
relocated during GC.
During our bring-up I believe we're reporting everything as a managed
pointer and we don't expect that to cause any serious problems. Any perf
impact is likely to be swamped right now by other dumb things we're doing
elsewhere.

However there's another angle -- managed pointers are relatively rare and GC
reporting is tricky to get right. We'd like to the compiler to make the
strongest assertions possible and the runtime verify where it can. So if we
think a pointer refers to the root of a heap object we'd like to describe it
that way for the GC.

At any rate I'm not sure we'd ask for a third class of pointer. We can
likely sort out object vs managed with some late approximate data flow.

-----Original Message-----
From: Philip Reames [mailto:listmail at philipreames.com] 
Sent: Monday, August 31, 2015 10:14 AM
To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers <andya at
microsoft.com>; Russell Hadley <rhadley at microsoft.com>
Subject: Re: RFC: alloca -- specify address space for allocation

A couple of minor responses inline, but I think we're mostly in agreement
here.

On 08/28/2015 09:30 PM, Swaroop Sridhar wrote:>> -----Original Message-----
>> From: Philip Reames [mailto:listmail at philipreames.com]
>> Sent: Friday, August 28, 2015 9:38 AM
>> To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-
>> dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
>> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers 
>> <andya at microsoft.com>; Russell Hadley <rhadley at
microsoft.com>
>> Subject: Re: RFC: alloca -- specify address space for allocation
>>
>>> I think for the use case you are outlining, an addrspacecast is the
>>> correct IR model -- you're specifically saying that it is OK in
this
>>> case to turn a pointer from addrspace 0 into one for addrspace N 
>>> because N is your "managed pointer" set that can be
*either* a GC-pointer or a non-GC-pointer.
>>> What the FE is saying is that this is an *acceptable* transition of
>>> addrspace, because your language and runtime semantics have
provided for it.
>>> I think the proper way to say that is with a cast.
>> The key bit here is that I think Chandler is right.  You are 
>> effectively casting a stack allocation *into* a managed pointer. 
>> Having something to mark that transition seems reasonable.
> I think there are two views here:
>
> (1) MSIL level view:
> In CLR, the stack, is a part of "managed memory" (which is not
the same as gc-heap, which is managed and garbage-collected memory).
> Therefore, all *references* to stack locations are "managed
addresses,"  in the sense that the compiler/runtime exercises certain
control over (values that are) managed-address:
> For example: it enforces certain restrictions to guarantee safety -- ex:
lifetime restrictions, non-null requirement in certain contexts, etc.
>
> This is different from a notion of "unmanaged memory" which is
for interoperability with native code.
> *Pointers* to unmanaged memory are not controlled by the runtime (ex: do
not provide any safety guarantees).
>
> So, from the language semantics point of view, stack addresses are created
as managed pointers.
> Which is why the proposal is to have alloca directly in the managed
address-space seemed natural.
>
> Joseph has written more details in the document that Philip shared out in
this thread.
>
>   (2) A more Lower level IR view:
> LLVM creates all stack locations in addrespace(0) for all code, whether it
comes from managed-code or native code.
> Of these, Stack locations corresponding to the managed-stack are promoted
to managed-addresses via addrspacecast.
> As an optimization, the FrontEnd inserts the addrspace casts only for those
stack locations that are actually address-taken.
>
> If I understand correctly, the recommendation (by Philip, Chandler and
David) is approach (2) because:
> (a) No change to Instruction-Set is necessary when the semantics is
achievable via existing instructions.
> (b) It saves changing the optimizer to allocate in the correct
address-space.
> Looks like the problem here is that: the optimizer is expected to 
> create type-preserving transformation by allocating in the correct
address-space, but blindly allocates in the default address space today.
> I don't know the LLVM optimizer well enough to have a good estimate of 
> the magnitude of changes necessary here. But, I agree that (avoiding)
substantial changes to the optimizer is a strong consideration.Slight restatement: The only legal address space for an alloca is zero today. 
As a result, blindly creating addrspace zero allocas is correct by construction.
If we introduced additional address spaces for allocas, then it would become a
problem.>
>>> You might need N to be a distinct address space from the one used 
>>> for GC-pointers and to have similar casts emitted by the frontend.
> Yes, eventually we'll need to differentiate between:
> (i) Pointers to unmanaged memory -- which will never be reported to 
> the runtime
> (ii) Pointers to GC-heap objects -- which will always be reported to 
> the runtime
> (iii) Generic managed pointer -- which may need to be reported if we cannot
establish that it points outside the GC heap.
>
> Currently we report all pointers to the runtime as managed pointers.
> This is inefficient because the GC then needs to do extra work to figure
out what kind of pointer it is:
> Pointer to a heap object, pointer within a heap object, or outside the
heap.Unless I misunderstand you, it really sounds like the distinction between a
'generic managed pointer' and a 'managed pointer which happens to
point outside the gc heap' is purely an optimization right?  I'm
generally hesitant to introduce new concepts for optimization benefit without
evidence that the optimization is needed.  I'll note that I have no direct
experience with a language with GC and stack based allocation, so it's
possible I'm underestimating how important this is.

Side note: One of the things that's really bugging me is that you seem to be
optimizing for work performed by the collector at safepoints.  My mental model
is that safepoints are infrequent and that a minor amount of additional work at
the safepoint doesn't really matter.  What am I missing here?  Polling for a
safepoint has to happen pretty frequently, but we don't need to parse the
stack unless we actually call into the collector right?>
>> Of course, having said that all, I'm back to thinking that having a
>> marker on the alloca would be somewhat reasonable too.  However, I 
>> think we need a much stronger justification to change the IR than has 
>> been provided.  If you can show that the cast based model doesn't 
>> work for some reason, we can re-evaluate.
> I don't think we can say that the cast-based model will not work.
> The question is whether alloca addrspace(1)* is a better fit for MSIL 
> semantics, analysis phases, and managed-code specific optimizations.
>
> I'm OK if we conclude that we'll keep using the cast model until we
> hit a concrete case where it does not work, or seems architecturally
misfit.
Great.  We can revisit if needed.>
>> Worth noting is that we might be better off introducing an orthogonal 
>> notion for tracking gc references entirely.  The addrspace mechanism 
>> has worked, but it is a little bit of a hack. We've talked about
the
>> need for an opaque pointer type.  Maybe when we actually get around 
>> to defining that, the alloca case is one we should consider.
> Yes, I'm mainly concerned about getting the right types on the 
> different kinds of pointers. If adders space annotation implies more 
> constraints (ex: on layout) than what's already necessitated by the 
> type distinction, we should use a separate mechanism.Me too honestly.  I think we need to address this relative soon. Not
immediately, but probably not 5 years from now either.>
> Again, I'm OK if we want to keep using addrspacecast until we hit a 
> concrete case where it breaks down.
>
> Swaroop.
>

Marcello Maggioni via llvm-dev

2015-Sep-01 02:49 UTC

head link

[llvm-dev] RFC: alloca -- specify address space for allocation

Thanks,

this makes the use case much more clear.
Now though, as far as I would like actually to see supported in LLVM the
capability of not having any special meaning assigned to address space 0 your
proposal goes slightly in contrast with how I always thought of address spaces
in LLVM.
I also have to say that I don’t know deeply how address spaces are meant to be
intended in LLVM so my vision of them might be actually off the LLVM-way.

This is how I see them though:
If in OpenCL for example we demark private memory with addrspace(0) and global
memory with addrspace(1) what addrspace 0 or 1 tells me is an information about
the memory pointed by the pointer (whether it is private or global memory). So
it tells me something about the “pointee” and not the pointer.
What you are proposing here though is the opposite if I understood correctly,
which is that actually addrspace(1) is telling us an information about the
pointer (the fact that is a managed pointer) and nothing about the pointee
(which is just some address on the stack I assume).

Still my idea of address spaces and my understanding of what you are trying to
do could still be completely wrong … :P

Marcello
 > On 30 Aug 2015, at 19:45, Joseph Tremoulet via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> MM> It is not clear to me if these GC pointers are actually living in a
different address space when allocated on the stack (aka. you have two different
stacks)
> HF> Is the mental model here that a function using this feature actually
has multiple stacks, in different address spaces, and this allows creating local
stack allocations in those non-addrspace-0 stacks?
> PR> the reason we originally went with addrspaces to mark GC pointers
was that the managed and non-managed heaps for us are semantically
non-overlapping
> PR> we might be better off introducing an orthogonal notion for tracking
gc references entirely.  The addrspace mechanism has worked, but it is a little
bit of a hack
> CC> these use cases seem really straight forward for address spaces...
But this is very different from the idea of using an alloca with an address
spaces for GC pointers to stack objects
> 
> 
> I think it's an important question whether address spaces are a good
fit for what we're trying to model here.  So I'll explain my mental
model of LLILC's address spaces, and I'd be very interested in feedback
on whether this seems like a good fit for LLVM's address space concept, or a
bastardization thereof:
> 
> [preface: it's been my understanding that dereferences of pointers in
different address spaces can alias is semantically meaningful ways; i.e. that
it's appropriate to use different address spaces to model different means of
indexing the same memory.  Some of the comments/questions on this thread seemed
to imply instead an expectation that distinct address spaces always reference
semantically disjoint storage; if that's a hard assumption, then nothing
I'm about to say will make sense and we'll almost certainly need a
different mechanism to model GC pointers]
> 
> 1. The value of an unmanaged/addrspace(0) pointer is an address (in the
virtual memory available to the process)
> 2. The value of a managed/addrspace(1) pointer is conceptually an (ID,
offset) pair.  The first component is the "identity" of the object
that the offset component is relative to.  Identity is distinct from address;
you could imagine the GC heap allocator having a counter that it increments with
each allocation, and that an object's identity is the value the counter had
when it was allocated.
> 3. There are two special reserved IDs for pseudo-objects:
>  3a. one reserved ID for the null pseudo-object (i.e. what nullptr points
to)
>  3b. one reserved ID for the "everything-outside-the-GC-heap"
pseudo-object.  Conceptually this is an infinitely-large object and the data at
offset N in this object is the data at address N (in the virtual memory
available to the process, and with a requirement that N does not correspond to
an address in the GC heap)
> 4. The benefit of this conceptual model is that garbage collections are
value-preserving, which is why we don't need to model safepoints as
rewriting GC pointers (until the RewriteStatepointsForGC pass where we rewrite
the IR in terms of addresses rather than object identity)
> 5. We know that the representation of an addrspace(1) pointer into the
outside-the-GC-heap pseudo-object is bit-identical to the representation of an
addrspace(0) pointer to the corresponding address  (question: Doesn't that
mean we can/should be using bitcast instead of addrspacecast when we know the
value in question is an outside-the-GC-heap pointer?)
> 6. Our source language includes constructs that expose a managed
pointer's address (which have well-defined semantics only if the object in
question is "pinned").  These naturally correspond to addrspace casts
(or maybe need to be additionally constrained?) and are a function of not only
the input pointer but also of the layout of the GC heap and the runtime's
pinning state.  This is where we can get a mix of addrspace(0) and addrspace(1)
pointers whose dereferences alias (in semantically well-defined ways).
> 
> Again, I'd love feedback on how sane that sounds to those of you
familiar with LLVM's address space notion.
> 
> I think that 3b is the part that seems to generate the most
surprise/consternation.  The reason our source language includes the
"outside-the-GC-heap" pseudo-object is that it allows more powerful
code (you could think of it as allowing polymorphism over GC-heap-ness of
inputs) with no adverse effects to the system's typesafety guarantees.
> 
> Relating this back to the question of what address space an alloca belongs
in (which I'm doing to contextualize, not in an attempt to continue the
debate), as the stack (of which there is only one) is outside-the-GC-heap, the
question is one of whether you want to conceptualize an alloca as producing an
unmanaged address or as producing a pointer into the outside-the-GC-heap
pseudo-object.  I'd argue that our source language conceptualizes it as the
latter[1], which (as Swaroop points out below) is why it would feel natural to
model it that way in LLVM, but of course (as Swaroop also points out below) we
could also decompose the source construct into two steps in LLVM IR.  As far as
compiler-introduced allocas, they of course wouldn't be referenced in the
source, and so we wouldn't have a hard requirement either way.
> 
> Thanks
> -Joseph
> 
> [1] - To be more pedantic/precise:
> - Static allocas are implied by local variable declarations, and those
declarations are not annotated with a pointer type
> - Our source language includes a compact form that can be used to describe
a load or store of a local variable; these compact forms are not annotated with
a pointer type
> - Everywhere else that our source language refers to the address of a local
variable, it uses the managed pointer type
> 
> 
> 
> 
> -----Original Message-----
> From: Swaroop Sridhar 
> Sent: Saturday, August 29, 2015 12:30 AM
> To: Philip Reames <listmail at philipreames.com>; llvm-dev
<llvm-dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers <andya
at microsoft.com>; Russell Hadley <rhadley at microsoft.com>
> Subject: RE: RFC: alloca -- specify address space for allocation
> 
>> -----Original Message-----
>> From: Philip Reames [mailto:listmail at philipreames.com]
>> Sent: Friday, August 28, 2015 9:38 AM
>> To: Swaroop Sridhar <Swaroop.Sridhar at microsoft.com>; llvm-dev
<llvm-
>> dev at lists.llvm.org>; Sanjoy Das <sanjoy at
playingwithpointers.com>
>> Cc: Joseph Tremoulet <jotrem at microsoft.com>; Andy Ayers 
>> <andya at microsoft.com>; Russell Hadley <rhadley at
microsoft.com>
>> Subject: Re: RFC: alloca -- specify address space for allocation
>> 
> 
>>> I think for the use case you are outlining, an addrspacecast is the
>>> correct IR model -- you're specifically saying that it is OK in
this
>>> case to turn a pointer from addrspace 0 into one for addrspace N 
>>> because N is your "managed pointer" set that can be
*either* a GC-pointer or a non-GC-pointer.
> 
>>> What the FE is saying is that this is an *acceptable* transition of
>>> addrspace, because your language and runtime semantics have
provided for it.
>>> I think the proper way to say that is with a cast.
> 
>> The key bit here is that I think Chandler is right.  You are 
>> effectively casting a stack allocation *into* a managed pointer. 
>> Having something to mark that transition seems reasonable.
> 
> I think there are two views here:
> 
> (1) MSIL level view:
> In CLR, the stack, is a part of "managed memory" (which is not
the same as gc-heap, which is managed and garbage-collected memory).
> Therefore, all *references* to stack locations are "managed
addresses,"  in the sense that the compiler/runtime exercises certain
control over (values that are) managed-address:
> For example: it enforces certain restrictions to guarantee safety -- ex:
lifetime restrictions, non-null requirement in certain contexts, etc.
> 
> This is different from a notion of "unmanaged memory" which is
for interoperability with native code.
> *Pointers* to unmanaged memory are not controlled by the runtime (ex: do
not provide any safety guarantees).
> 
> So, from the language semantics point of view, stack addresses are created
as managed pointers.
> Which is why the proposal is to have alloca directly in the managed
address-space seemed natural.
> 
> Joseph has written more details in the document that Philip shared out in
this thread.
> 
> (2) A more Lower level IR view:
> LLVM creates all stack locations in addrespace(0) for all code, whether it
comes from managed-code or native code.
> Of these, Stack locations corresponding to the managed-stack are promoted
to managed-addresses via addrspacecast.
> As an optimization, the FrontEnd inserts the addrspace casts only for those
stack locations that are actually address-taken.
> 
> If I understand correctly, the recommendation (by Philip, Chandler and
David) is approach (2) because:
> (a) No change to Instruction-Set is necessary when the semantics is
achievable via existing instructions.
> (b) It saves changing the optimizer to allocate in the correct
address-space.
> Looks like the problem here is that: the optimizer is expected to create
type-preserving transformation by allocating in the correct address-space, but
blindly allocates in the default address space today.
> I don't know the LLVM optimizer well enough to have a good estimate of
the magnitude of changes necessary here. But, I agree that (avoiding)
substantial changes to the optimizer is a strong consideration.
> 
>>> You might need N to be a distinct address space from the one used
for
>>> GC-pointers and to have similar casts emitted by the frontend.
> 
> Yes, eventually we'll need to differentiate between:
> (i) Pointers to unmanaged memory -- which will never be reported to the
runtime
> (ii) Pointers to GC-heap objects -- which will always be reported to the
runtime
> (iii) Generic managed pointer -- which may need to be reported if we cannot
establish that it points outside the GC heap.
> 
> Currently we report all pointers to the runtime as managed pointers.
> This is inefficient because the GC then needs to do extra work to figure
out what kind of pointer it is:
> Pointer to a heap object, pointer within a heap object, or outside the
heap.
> 
>> Of course, having said that all, I'm back to thinking that having a
>> marker on the alloca would be somewhat reasonable too.  However, I 
>> think we need a much stronger justification to change the IR than has 
>> been provided.  If you can show that the cast based model doesn't
work
>> for some reason, we can re-evaluate.
> 
> I don't think we can say that the cast-based model will not work. 
> The question is whether alloca addrspace(1)* is a better fit for MSIL
semantics, analysis phases, and managed-code specific optimizations.
> 
> I'm OK if we conclude that we'll keep using the cast model until we
hit a concrete case where it does not work, or seems architecturally misfit.
> 
>> Worth noting is that we might be better off introducing an orthogonal 
>> notion for tracking gc references entirely.  The addrspace mechanism 
>> has worked, but it is a little bit of a hack. We've talked about
the
>> need for an opaque pointer type.  Maybe when we actually get around to 
>> defining that, the alloca case is one we should consider.
> 
> Yes, I'm mainly concerned about getting the right types on the
different kinds of pointers. If adders space annotation implies more constraints
(ex: on layout) than what's already necessitated by the type distinction, we
should use a separate mechanism.
> 
> Again, I'm OK if we want to keep using addrspacecast until we hit a
concrete case where it breaks down.
> 
> Swaroop.
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Swaroop Sridhar via llvm-dev

2015-Sep-02 00:47 UTC

head link

[llvm-dev] RFC: alloca -- specify address space for allocation

>> On 08/28/2015 09:30 PM, Swaroop Sridhar wrote:
> > Yes, eventually we'll need to differentiate between:
> > (i) Pointers to unmanaged memory -- which will never be reported to
> > the runtime
> > (ii) Pointers to GC-heap objects -- which will always be reported to
> > the runtime
> > (iii) Generic managed pointer -- which may need to be reported if we
>>  cannot establish that it points outside the GC heap.
> >
> > Currently we report all pointers to the runtime as managed pointers.
> > This is inefficient because the GC then needs to do extra work to
figure out
> > what kind of pointer it is:
> > Pointer to a heap object, pointer within a heap object, or outside the
heap.
>
> Side note: One of the things that's really bugging me is that you seem
to be
> optimizing for work performed by the collector at safepoints.  My mental
> model is that safepoints are infrequent and that a minor amount of
> additional work at the safepoint doesn't really matter.  What am I
missing
> here?  Polling for a safepoint has to happen pretty frequently, but we
don't
> need to parse the stack unless we actually call into the collector right?
Wrt the impact of the GC-time on performance , just wanted to add that 
this really depends on the workload. We've seen some workloads 
(ex: compiling large projects using a compiler written in managed code) where 
the GC-time was a significant portion of the overall execution time. 
In some benchmarks, the particular issue of precisely reporting 
(iii) managed pointer vs (ii) object pointer in the GCTables did impact
performance.

Anyway, we'll use the conservatively correct method of reporting all
Gc-pointers
as (iii) generic managed pointers for bring-up. We can come back to 
the distinction between (ii) and (iii) while tuning the performance.

Thanks,
Swaroop.
.

Reasonably Related Threads

Search for more reasonably related threads

llvm dev - Sep 2015 - RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

[llvm-dev] RFC: alloca -- specify address space for allocation

Reasonably Related Threads