thr3ads.net - llvm dev - [LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne

2014-Apr-07 03:52 UTC

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick
wrote:> 
> On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
> 
> > Hi,
> > 
> > I wanted to propose an IR extension that would allow us to support
zero-cost
> > exception handling for non-call operations that may trap. I wanted to
start
> > with loads and stores through a null pointer, and later we might
extend this to
> > div/rem/mod zero. This feature is obviously useful for implementing
languages
> > such as Java and Go which deterministically translate such operations
into
> > exceptions which may be caught by the user.
> > 
> > There are a couple of somewhat orthogonal features that this would
entail:
> > 
> > 1) Deterministic handling for loads and stores through a null pointer.
> > 2) Ability to unwind a load/store to a specific basic block, like
invoke.
> > 
> > At the moment, we do not exactly have 1), as the optimizer considers
> > non-volatile loads/stores through a null pointer to have undefined
> > behavior. Volatile loads/stores are closer, but they come with their
own
> > set of baggage that can inhibit optimization. (For example, if we can
prove
> > that a load would always succeed, 'volatile' prevents us from
reordering
> > the load or deleting it if it is dead.) So I propose to add an
attribute to
> > 'load' and 'store', which we can call, say,
'nullcheck', with the following
> > additional semantics:
> > 
> > - If the memory address is between zero and a target-defined value
(i.e. the
> >   size of the zero page) the instruction is guaranteed to trap in a
> >   target-defined manner.
> > 
> > - The optimizer may only delete or reorder nullcheck instructions if
the
> >   program cannot observe such a transformation by installing a signal
handler
> >   for the trap.  Therefore, the optimizer would be able to drop the
attribute
> >   if it can prove that the address will always be non-null.
> > 
> > To support 2), I propose a couple of new instructions. I haven't
come up with
> > great names for these instructions, but:
> > 
> > - 'iload' is to 'load' as 'invoke' is to
'call'. That is, the instruction is
> >   a terminator and has normal and unwind destinations. e.g.
> > 
> >   %v = iload i8* %ptr to label %try.cont unwind label %lpad
> > 
> > - Similarly, 'istore' is to 'store' as
'invoke' is to 'call'.
> > 
> >   istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
> > 
> > These instructions always have 'nullcheck' semantics, plus:
> > 
> > - If the instruction traps and the program has installed a signal
handler
> >   for the trap which unwinds, the unwind is guaranteed to land at the
> >   landing pad.
> > 
> > I've been working on an implementation of 'iload' and
'istore' which are
> > in the attached patches, if you are interested. (They aren't ready
to go
> > in yet.) I have asm parsing/printing for both, and code generation for
> > 'iload'. Would be interested in getting feedback on code
generation as this
> > is my first serious foray into the backend -- I haven't tried
running the
> > generated code yet and the DAG builder is a mashup of the DAG builders
for
> > 'invoke' and 'load', but I've eyeballed the asm it
generates (e.g. llc produces
> > iload-exception.s for the attached iload-exception.ll) and it looks
reasonable.
> 
> Hi Peter. All due respect, I don’t think it’s right to introduce new
load/store instructions with equivalent semantics to and+icmp+br+load/store.
I don't think the new instructions have equivalent semantics. If the null
check
fails with the iload/istore instructions, we need to throw the appropriate
language-specific exception and evaluate it against the landing pad. There
may also need to be an active exception that can be resumed.

As far as I can tell, the frontend would need to emit IR that calls the
language runtime to manually throw the exception. This IR would need to be
recognized by the IR optimization that converts the icmp+br+load/store to
a checked load/store. It seems to me that it would be simpler to just start
with the checked load/store.
> ‘invoke’ is different. It is needed because there is no way for the caller
to explicitly check for an exception.
> 
> We do introduce intrinsics that encapsulate things like overflow checks.
This is done to eliminate control flow edges that tend to inhibit LLVM’s
optimizer and instruction selection. But you’re not removing the control flow,
so this technique does not apply. Null checks should actually be exposed in IR
so general optimizations can remove redundant checks.
My idea for removing redundant checks is to teach the IR optimizer to
treat iloads/istores as if they were null checks. Is there any reason
why this wouldn't work?
> Ideally this would just be a machine code pass that can hoist a load/store
above a branch and nuke the compare. However, I can see how it’s easier to
relate the compare operand to the address arithmetic at IR level.
> 
> To do this at IR level, you could introduce a pre-CodeGen pass that
converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic
only exists for instruction selection, the optimizer doesn’t need to know about
it.
I did initially consider implementing the checked load/store as an
intrinsic. But there are relatively boring reasons why this won't work at
present. For example, there is no current way to represent a struct load
using an intrinsic, as there is no mangling for struct types. Also, named
structs would need a mangling that is resistant to renames. Rather than
solve these problems, I decided to avoid intrinsics entirely.
> The intrinsic would need to be lowered to an MI pseudo-instruction that
feels like a load/store to the backend, but is a terminator. During code
emission you could grab the address of that pseudo load/store and its resulting
branch target to inform the runtime.
As far as I know, a load can lower to multiple machine instructions. This
will definitely be the case for the Go frontend that I'm working with, as
its IR tends to use struct loads/stores quite frequently. So I'm not sure
if this will work. I think it needs to look a lot like how the lowering for
invokes currently looks, with a pair of EH_LABELs around a set of ordinary
load/store MIs -- which is how I implemented it.

Thanks,
-- 
Peter

Filip Pizlo

2014-Apr-07 04:06 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

Hey Peter,

I've now had multiple opportunities to evaluate the benefit of
virtual-memory-based null checks versus explicit branches.  Here are some of the
papers where we did this.  In all cases, we found that explicit null check
branches are just as fast as using a virtual memory trap.

http://www.filpizlo.com/papers/baker-ccpe09-accurate.pdf    (Tomas stumbled on
this observation while comparing OpenVM's C backend to its C++ backend.)
http://www.filpizlo.com/papers/pizlo-eurosys2010-fijivm.pdf   (I designed Fiji
VM around never using virtual-memory-based optimizations and then I evaluated
the cost of null check branches.  They were free.)

I've also done this experiment in WebKit.  In JavaScript the equivalent
thing is a cell check, and we turn them into explicit branches.  We of course
emit a minimal set of such branches through GVN, LICM, etc.  After that, these
branches are essentially free.

In all cases, I've found that on modern hardware, the explicit null check
branches have essentially zero cost.  Redundant null checks are trivial to
eliminate using any serious compiler infrastructure, LLVM included.  The
remaining null checks are trivially predicted by the hardware.

To me it feels like you're proposing to add new instructions to the IR to
support something that isn't a worthwhile optimization.

(Also, this is specific to languages where the only non-pointer value that we
might attempt to load from is null.  It wouldn't work for languages like
JavaScript/Ruby/Python/etc where we may attempt to load from an integer or
double that uses a high-bit type tag for instance.  So this is an optimization
that really only applies to Java and Go, and in Java it's not clear that
your IR gives you what you want - Java is basically always JITed and you'd
want to deoptimize on null.)

-Phil


On April 6, 2014 at 8:54:39 PM, Peter Collingbourne (peter at pcc.me.uk) wrote:

On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick
wrote:>  
> On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
>  
> > Hi,
> >  
> > I wanted to propose an IR extension that would allow us to support
zero-cost
> > exception handling for non-call operations that may trap. I wanted to
start
> > with loads and stores through a null pointer, and later we might
extend this to
> > div/rem/mod zero. This feature is obviously useful for implementing
languages
> > such as Java and Go which deterministically translate such operations
into
> > exceptions which may be caught by the user.
> >  
> > There are a couple of somewhat orthogonal features that this would
entail:
> >  
> > 1) Deterministic handling for loads and stores through a null pointer.
> > 2) Ability to unwind a load/store to a specific basic block, like
invoke.
> >  
> > At the moment, we do not exactly have 1), as the optimizer considers
> > non-volatile loads/stores through a null pointer to have undefined
> > behavior. Volatile loads/stores are closer, but they come with their
own
> > set of baggage that can inhibit optimization. (For example, if we can
prove
> > that a load would always succeed, 'volatile' prevents us from
reordering
> > the load or deleting it if it is dead.) So I propose to add an
attribute to
> > 'load' and 'store', which we can call, say,
'nullcheck', with the following
> > additional semantics:
> >  
> > - If the memory address is between zero and a target-defined value
(i.e. the
> > size of the zero page) the instruction is guaranteed to trap in a
> > target-defined manner.
> >  
> > - The optimizer may only delete or reorder nullcheck instructions if
the
> > program cannot observe such a transformation by installing a signal
handler
> > for the trap. Therefore, the optimizer would be able to drop the
attribute
> > if it can prove that the address will always be non-null.
> >  
> > To support 2), I propose a couple of new instructions. I haven't
come up with
> > great names for these instructions, but:
> >  
> > - 'iload' is to 'load' as 'invoke' is to
'call'. That is, the instruction is
> > a terminator and has normal and unwind destinations. e.g.
> >  
> > %v = iload i8* %ptr to label %try.cont unwind label %lpad
> >  
> > - Similarly, 'istore' is to 'store' as
'invoke' is to 'call'.
> >  
> > istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
> >  
> > These instructions always have 'nullcheck' semantics, plus:
> >  
> > - If the instruction traps and the program has installed a signal
handler
> > for the trap which unwinds, the unwind is guaranteed to land at the
> > landing pad.
> >  
> > I've been working on an implementation of 'iload' and
'istore' which are
> > in the attached patches, if you are interested. (They aren't ready
to go
> > in yet.) I have asm parsing/printing for both, and code generation for
> > 'iload'. Would be interested in getting feedback on code
generation as this
> > is my first serious foray into the backend -- I haven't tried
running the
> > generated code yet and the DAG builder is a mashup of the DAG builders
for
> > 'invoke' and 'load', but I've eyeballed the asm it
generates (e.g. llc produces
> > iload-exception.s for the attached iload-exception.ll) and it looks
reasonable.
>  
> Hi Peter. All due respect, I don’t think it’s right to introduce new
load/store instructions with equivalent semantics to and+icmp+br+load/store.
I don't think the new instructions have equivalent semantics. If the null
check
fails with the iload/istore instructions, we need to throw the appropriate
language-specific exception and evaluate it against the landing pad. There
may also need to be an active exception that can be resumed.

As far as I can tell, the frontend would need to emit IR that calls the
language runtime to manually throw the exception. This IR would need to be
recognized by the IR optimization that converts the icmp+br+load/store to
a checked load/store. It seems to me that it would be simpler to just start
with the checked load/store.
> ‘invoke’ is different. It is needed because there is no way for the caller
to explicitly check for an exception.
>  
> We do introduce intrinsics that encapsulate things like overflow checks.
This is done to eliminate control flow edges that tend to inhibit LLVM’s
optimizer and instruction selection. But you’re not removing the control flow,
so this technique does not apply. Null checks should actually be exposed in IR
so general optimizations can remove redundant checks.
My idea for removing redundant checks is to teach the IR optimizer to
treat iloads/istores as if they were null checks. Is there any reason
why this wouldn't work?
> Ideally this would just be a machine code pass that can hoist a load/store
above a branch and nuke the compare. However, I can see how it’s easier to
relate the compare operand to the address arithmetic at IR level.
>  
> To do this at IR level, you could introduce a pre-CodeGen pass that
converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic
only exists for instruction selection, the optimizer doesn’t need to know about
it.
I did initially consider implementing the checked load/store as an
intrinsic. But there are relatively boring reasons why this won't work at
present. For example, there is no current way to represent a struct load
using an intrinsic, as there is no mangling for struct types. Also, named
structs would need a mangling that is resistant to renames. Rather than
solve these problems, I decided to avoid intrinsics entirely.
> The intrinsic would need to be lowered to an MI pseudo-instruction that
feels like a load/store to the backend, but is a terminator. During code
emission you could grab the address of that pseudo load/store and its resulting
branch target to inform the runtime.
As far as I know, a load can lower to multiple machine instructions. This
will definitely be the case for the Go frontend that I'm working with, as
its IR tends to use struct loads/stores quite frequently. So I'm not sure
if this will work. I think it needs to look a lot like how the lowering for
invokes currently looks, with a pair of EH_LABELs around a set of ordinary
load/store MIs -- which is how I implemented it.

Thanks,
--  
Peter
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140406/6c82fc8b/attachment.html>

Peter Collingbourne

2014-Apr-07 04:53 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

Hi Filip,

Thanks for sharing these insights.

Since I've already started on an implementation, what I think I'll do is
experiment locally and see if these new instructions give a performance
improvement in the context of the compiler I'm working on -- I'll
probably
need to implement explicit null checks anyway for the sake of supporting
targets that don't support unwinding through signal handlers. Once I have
some numbers I'll re-evaluate.

Thanks,
Peter

On Sun, Apr 06, 2014 at 09:06:49PM -0700, Filip Pizlo
wrote:> Hey Peter,
> 
> I've now had multiple opportunities to evaluate the benefit of
virtual-memory-based null checks versus explicit branches.  Here are some of the
papers where we did this.  In all cases, we found that explicit null check
branches are just as fast as using a virtual memory trap.
> 
> http://www.filpizlo.com/papers/baker-ccpe09-accurate.pdf    (Tomas stumbled
on this observation while comparing OpenVM's C backend to its C++ backend.)
> http://www.filpizlo.com/papers/pizlo-eurosys2010-fijivm.pdf   (I designed
Fiji VM around never using virtual-memory-based optimizations and then I
evaluated the cost of null check branches.  They were free.)
> 
> I've also done this experiment in WebKit.  In JavaScript the equivalent
thing is a cell check, and we turn them into explicit branches.  We of course
emit a minimal set of such branches through GVN, LICM, etc.  After that, these
branches are essentially free.
> 
> In all cases, I've found that on modern hardware, the explicit null
check branches have essentially zero cost.  Redundant null checks are trivial to
eliminate using any serious compiler infrastructure, LLVM included.  The
remaining null checks are trivially predicted by the hardware.
> 
> To me it feels like you're proposing to add new instructions to the IR
to support something that isn't a worthwhile optimization.
> 
> (Also, this is specific to languages where the only non-pointer value that
we might attempt to load from is null.  It wouldn't work for languages like
JavaScript/Ruby/Python/etc where we may attempt to load from an integer or
double that uses a high-bit type tag for instance.  So this is an optimization
that really only applies to Java and Go, and in Java it's not clear that
your IR gives you what you want - Java is basically always JITed and you'd
want to deoptimize on null.)
> 
> -Phil
> 
> 
> On April 6, 2014 at 8:54:39 PM, Peter Collingbourne (peter at pcc.me.uk)
wrote:
> 
> On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick wrote:
> >  
> > On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
> >  
> > > Hi,
> > >  
> > > I wanted to propose an IR extension that would allow us to
support zero-cost
> > > exception handling for non-call operations that may trap. I
wanted to start
> > > with loads and stores through a null pointer, and later we might
extend this to
> > > div/rem/mod zero. This feature is obviously useful for
implementing languages
> > > such as Java and Go which deterministically translate such
operations into
> > > exceptions which may be caught by the user.
> > >  
> > > There are a couple of somewhat orthogonal features that this
would entail:
> > >  
> > > 1) Deterministic handling for loads and stores through a null
pointer.
> > > 2) Ability to unwind a load/store to a specific basic block, like
invoke.
> > >  
> > > At the moment, we do not exactly have 1), as the optimizer
considers
> > > non-volatile loads/stores through a null pointer to have
undefined
> > > behavior. Volatile loads/stores are closer, but they come with
their own
> > > set of baggage that can inhibit optimization. (For example, if we
can prove
> > > that a load would always succeed, 'volatile' prevents us
from reordering
> > > the load or deleting it if it is dead.) So I propose to add an
attribute to
> > > 'load' and 'store', which we can call, say,
'nullcheck', with the following
> > > additional semantics:
> > >  
> > > - If the memory address is between zero and a target-defined
value (i.e. the
> > > size of the zero page) the instruction is guaranteed to trap in a
> > > target-defined manner.
> > >  
> > > - The optimizer may only delete or reorder nullcheck instructions
if the
> > > program cannot observe such a transformation by installing a
signal handler
> > > for the trap. Therefore, the optimizer would be able to drop the
attribute
> > > if it can prove that the address will always be non-null.
> > >  
> > > To support 2), I propose a couple of new instructions. I
haven't come up with
> > > great names for these instructions, but:
> > >  
> > > - 'iload' is to 'load' as 'invoke' is to
'call'. That is, the instruction is
> > > a terminator and has normal and unwind destinations. e.g.
> > >  
> > > %v = iload i8* %ptr to label %try.cont unwind label %lpad
> > >  
> > > - Similarly, 'istore' is to 'store' as
'invoke' is to 'call'.
> > >  
> > > istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
> > >  
> > > These instructions always have 'nullcheck' semantics,
plus:
> > >  
> > > - If the instruction traps and the program has installed a signal
handler
> > > for the trap which unwinds, the unwind is guaranteed to land at
the
> > > landing pad.
> > >  
> > > I've been working on an implementation of 'iload' and
'istore' which are
> > > in the attached patches, if you are interested. (They aren't
ready to go
> > > in yet.) I have asm parsing/printing for both, and code
generation for
> > > 'iload'. Would be interested in getting feedback on code
generation as this
> > > is my first serious foray into the backend -- I haven't tried
running the
> > > generated code yet and the DAG builder is a mashup of the DAG
builders for
> > > 'invoke' and 'load', but I've eyeballed the
asm it generates (e.g. llc produces
> > > iload-exception.s for the attached iload-exception.ll) and it
looks reasonable.
> >  
> > Hi Peter. All due respect, I don’t think it’s right to introduce new
load/store instructions with equivalent semantics to and+icmp+br+load/store.
> 
> I don't think the new instructions have equivalent semantics. If the
null check
> fails with the iload/istore instructions, we need to throw the appropriate
> language-specific exception and evaluate it against the landing pad. There
> may also need to be an active exception that can be resumed.
> 
> As far as I can tell, the frontend would need to emit IR that calls the
> language runtime to manually throw the exception. This IR would need to be
> recognized by the IR optimization that converts the icmp+br+load/store to
> a checked load/store. It seems to me that it would be simpler to just start
> with the checked load/store.
> 
> > ‘invoke’ is different. It is needed because there is no way for the
caller to explicitly check for an exception.
> >  
> > We do introduce intrinsics that encapsulate things like overflow
checks. This is done to eliminate control flow edges that tend to inhibit LLVM’s
optimizer and instruction selection. But you’re not removing the control flow,
so this technique does not apply. Null checks should actually be exposed in IR
so general optimizations can remove redundant checks.
> 
> My idea for removing redundant checks is to teach the IR optimizer to
> treat iloads/istores as if they were null checks. Is there any reason
> why this wouldn't work?
> 
> > Ideally this would just be a machine code pass that can hoist a
load/store above a branch and nuke the compare. However, I can see how it’s
easier to relate the compare operand to the address arithmetic at IR level.
> >  
> > To do this at IR level, you could introduce a pre-CodeGen pass that
converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic
only exists for instruction selection, the optimizer doesn’t need to know about
it.
> 
> I did initially consider implementing the checked load/store as an
> intrinsic. But there are relatively boring reasons why this won't work
at
> present. For example, there is no current way to represent a struct load
> using an intrinsic, as there is no mangling for struct types. Also, named
> structs would need a mangling that is resistant to renames. Rather than
> solve these problems, I decided to avoid intrinsics entirely.
> 
> > The intrinsic would need to be lowered to an MI pseudo-instruction
that feels like a load/store to the backend, but is a terminator. During code
emission you could grab the address of that pseudo load/store and its resulting
branch target to inform the runtime.
> 
> As far as I know, a load can lower to multiple machine instructions. This
> will definitely be the case for the Go frontend that I'm working with,
as
> its IR tends to use struct loads/stores quite frequently. So I'm not
sure
> if this will work. I think it needs to look a lot like how the lowering for
> invokes currently looks, with a pair of EH_LABELs around a set of ordinary
> load/store MIs -- which is how I implemented it.
> 
> Thanks,
> --  
> Peter
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-- 
Peter

Andrew Trick

2014-Apr-07 08:07 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

On Apr 6, 2014, at 8:52 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
> On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick wrote:
>> 
>> On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
>> 
>>> Hi,
>>> 
>>> I wanted to propose an IR extension that would allow us to support
zero-cost
>>> exception handling for non-call operations that may trap. I wanted
to start
>>> with loads and stores through a null pointer, and later we might
extend this to
>>> div/rem/mod zero. This feature is obviously useful for implementing
languages
>>> such as Java and Go which deterministically translate such
operations into
>>> exceptions which may be caught by the user.
>>> 
>>> There are a couple of somewhat orthogonal features that this would
entail:
>>> 
>>> 1) Deterministic handling for loads and stores through a null
pointer.
>>> 2) Ability to unwind a load/store to a specific basic block, like
invoke.
>>> 
>>> At the moment, we do not exactly have 1), as the optimizer
considers
>>> non-volatile loads/stores through a null pointer to have undefined
>>> behavior. Volatile loads/stores are closer, but they come with
their own
>>> set of baggage that can inhibit optimization. (For example, if we
can prove
>>> that a load would always succeed, 'volatile' prevents us
from reordering
>>> the load or deleting it if it is dead.) So I propose to add an
attribute to
>>> 'load' and 'store', which we can call, say,
'nullcheck', with the following
>>> additional semantics:
>>> 
>>> - If the memory address is between zero and a target-defined value
(i.e. the
>>>  size of the zero page) the instruction is guaranteed to trap in a
>>>  target-defined manner.
>>> 
>>> - The optimizer may only delete or reorder nullcheck instructions
if the
>>>  program cannot observe such a transformation by installing a
signal handler
>>>  for the trap.  Therefore, the optimizer would be able to drop the
attribute
>>>  if it can prove that the address will always be non-null.
>>> 
>>> To support 2), I propose a couple of new instructions. I
haven't come up with
>>> great names for these instructions, but:
>>> 
>>> - 'iload' is to 'load' as 'invoke' is to
'call'. That is, the instruction is
>>>  a terminator and has normal and unwind destinations. e.g.
>>> 
>>>  %v = iload i8* %ptr to label %try.cont unwind label %lpad
>>> 
>>> - Similarly, 'istore' is to 'store' as
'invoke' is to 'call'.
>>> 
>>>  istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
>>> 
>>> These instructions always have 'nullcheck' semantics, plus:
>>> 
>>> - If the instruction traps and the program has installed a signal
handler
>>>  for the trap which unwinds, the unwind is guaranteed to land at
the
>>>  landing pad.
>>> 
>>> I've been working on an implementation of 'iload' and
'istore' which are
>>> in the attached patches, if you are interested. (They aren't
ready to go
>>> in yet.) I have asm parsing/printing for both, and code generation
for
>>> 'iload'. Would be interested in getting feedback on code
generation as this
>>> is my first serious foray into the backend -- I haven't tried
running the
>>> generated code yet and the DAG builder is a mashup of the DAG
builders for
>>> 'invoke' and 'load', but I've eyeballed the asm
it generates (e.g. llc produces
>>> iload-exception.s for the attached iload-exception.ll) and it looks
reasonable.
>> 
>> Hi Peter. All due respect, I don’t think it’s right to introduce new
load/store instructions with equivalent semantics to and+icmp+br+load/store.
> 
> I don't think the new instructions have equivalent semantics. If the
null check
> fails with the iload/istore instructions, we need to throw the appropriate
> language-specific exception and evaluate it against the landing pad. There
> may also need to be an active exception that can be resumed.
> 
> As far as I can tell, the frontend would need to emit IR that calls the
> language runtime to manually throw the exception. This IR would need to be
> recognized by the IR optimization that converts the icmp+br+load/store to
> a checked load/store. It seems to me that it would be simpler to just start
> with the checked load/store.
I agree with Filip that implementing null checks as trapping loads is
essentially a minor code-size optimization that is typically not worthwhile.
However, my point is that this decision should not be made at IR level.

It could make sense to have a null check intrinsic that encapsulates
and+cmp+br+invoke. I just don't like the idea of it subsuming the
load/store. I think that will complicate both null check optimization and
load/store optimization.

Whether the null check can be profitably implemented as a trapping load/store is
specific to the target platform. This decision should be independent of
optimization in the presence of null checks and independent of optimization of
the null checks themselves. It is purely a codegen issue.

Imagine someone else porting your frontend to a new platform. They should not be
required to implement trapping loads/stores to achieve a working system. And
either way, they should benefit from the same platform independent optimization.
>> ‘invoke’ is different. It is needed because there is no way for the
caller to explicitly check for an exception.
>> 
>> We do introduce intrinsics that encapsulate things like overflow
checks. This is done to eliminate control flow edges that tend to inhibit LLVM’s
optimizer and instruction selection. But you’re not removing the control flow,
so this technique does not apply. Null checks should actually be exposed in IR
so general optimizations can remove redundant checks.
> 
> My idea for removing redundant checks is to teach the IR optimizer to
> treat iloads/istores as if they were null checks. Is there any reason
> why this wouldn't work?
I think it would be poor IR design. The optimizer should be able to assume loads
are loads and stores are stores. People writing optimization passes will not
remember that there is now another special load/store operation they should
consider.

An intrinsic makes it fairly clear that optimization passes will typically
ignore this operation. Ideally, language-specific intrinsics are lowered early
into canonical IR and platform-specific intrinsics are generated late from
canonical IR to avoid complicating optimization.
>> Ideally this would just be a machine code pass that can hoist a
load/store above a branch and nuke the compare. However, I can see how it’s
easier to relate the compare operand to the address arithmetic at IR level.
>> 
>> To do this at IR level, you could introduce a pre-CodeGen pass that
converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic
only exists for instruction selection, the optimizer doesn’t need to know about
it.
> 
> I did initially consider implementing the checked load/store as an
> intrinsic. But there are relatively boring reasons why this won't work
at
> present. For example, there is no current way to represent a struct load
> using an intrinsic, as there is no mangling for struct types. Also, named
> structs would need a mangling that is resistant to renames. Rather than
> solve these problems, I decided to avoid intrinsics entirely.
Ok. This seems worth solving. Maybe it should be discussed in another thread.
> 
>> The intrinsic would need to be lowered to an MI pseudo-instruction that
feels like a load/store to the backend, but is a terminator. During code
emission you could grab the address of that pseudo load/store and its resulting
branch target to inform the runtime.
> 
> As far as I know, a load can lower to multiple machine instructions. This
> will definitely be the case for the Go frontend that I'm working with,
as
> its IR tends to use struct loads/stores quite frequently. So I'm not
sure
> if this will work. I think it needs to look a lot like how the lowering for
> invokes currently looks, with a pair of EH_LABELs around a set of ordinary
> load/store MIs -- which is how I implemented it.
That sounds nice. But I'm not sure that's sufficient to prevent the MI
level loads being optimized away or move/combined when you still need the null
check in the original position.

-Andy
> 
> Thanks,
> -- 
> Peter

Jeremy Lakeman

2014-Apr-09 09:06 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

On Mon, Apr 7, 2014 at 5:37 PM, Andrew Trick <atrick at apple.com> wrote:
>
> I agree with Filip that implementing null checks as trapping loads is
> essentially a minor code-size optimization that is typically not
> worthwhile. However, my point is that this decision should not be made at
> IR level.
>
> It could make sense to have a null check intrinsic that encapsulates
> and+cmp+br+invoke. I just don't like the idea of it subsuming the
> load/store. I think that will complicate both null check optimization and
> load/store optimization.
>
> Whether the null check can be profitably implemented as a trapping
> load/store is specific to the target platform. This decision should be
> independent of optimization in the presence of null checks and independent
> of optimization of the null checks themselves. It is purely a codegen
issue.
>
> Imagine someone else porting your frontend to a new platform. They should
> not be required to implement trapping loads/stores to achieve a working
> system. And either way, they should benefit from the same platform
> independent optimization.
>
Integer type legalisation, vector math... There's plenty of existing
language features you don't have to implement, just run passes to simplify
the IR first.
I'm also thinking particularly of how ecmascripten and pnacl are
implemented.

If the IR changes (and I'm not taking a position on that issue), we just
need a pass to transform the IR for back ends that don't wish to implement
it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140409/3b7c30b7/attachment.html>

Philip Reames

2014-Apr-10 16:40 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

I'll respond to the rest of this thread separately, but want to respond 
to one bit in particular.

On 04/06/2014 08:52 PM, Peter Collingbourne wrote:> I did initially consider implementing the checked load/store as an 
> intrinsic. But there are relatively boring reasons why this won't work 
> at present. For example, there is no current way to represent a struct 
> load using an intrinsic, as there is no mangling for struct types. 
> Also, named structs would need a mangling that is resistant to 
> renames. Rather than solve these problems, I decided to avoid 
> intrinsics entirely.I have out of tree code which adds support for intrinsics with struct 
type.  I'm happy to share this.

More over, if you identify a weakness in the intrinsic mechanism, we 
should fix it.  :)

(Says the guy with a couple of such fixes out of tree... really should 
get around to upstreaming that...)

Philip

Philip Reames

2014-Apr-10 16:54 UTC

head link

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

Speaking as someone working on another language where the performance of 
null checks are a major concern, I find myself agreeing with Andy and 
Filip here.

I'm somewhat okay with the addition of the "nullcheck" attribute.
I'm
not convinced it's really necessary, but the burden seems low. I find 
myself in strong opposition to the new load/store variants. As others 
have pointed out, there's not a clear win here and there's a *serious* 
cost to modifying the IR - both current and ongoing.

Worth noting, both of your load and store instructions could be 
implemented as intrinsics with minimal (if any) loss in expressiveness 
or performance.  There are function attributes which express nearly all 
the aliasing properties you'd need.  The only real complication is that 
we don't really support invoking intrinsics today.  I've got some hacky 
changes out of tree that address that issue, but it needs to be 
discussed on the mailing list before sharing.

Just to note: I'm not taking a position w.r.t. the profitability of 
implicit null checks as a whole.  We plan on running this experiment at 
some point, but it's fairly low priority for us.  There's definitely 
more profitable things to get to first.

IMHO, a better area to invest in would be in null check elimination.  
We've found a couple of cases where "obviously" redundant null
checks
get missed by the optimizer.  We haven't started digging into this in 
great detail, but that seems like it would be a much more profitable 
optimization opportunity.

Philip


On 04/07/2014 01:07 AM, Andrew Trick wrote:> On Apr 6, 2014, at 8:52 PM, Peter Collingbourne <peter at pcc.me.uk>
wrote:
>
>> On Sat, Apr 05, 2014 at 12:21:17AM -0700, Andrew Trick wrote:
>>> On Mar 31, 2014, at 6:58 PM, Peter Collingbourne <peter at
pcc.me.uk> wrote:
>>>
>>>> Hi,
>>>>
>>>> I wanted to propose an IR extension that would allow us to
support zero-cost
>>>> exception handling for non-call operations that may trap. I
wanted to start
>>>> with loads and stores through a null pointer, and later we
might extend this to
>>>> div/rem/mod zero. This feature is obviously useful for
implementing languages
>>>> such as Java and Go which deterministically translate such
operations into
>>>> exceptions which may be caught by the user.
>>>>
>>>> There are a couple of somewhat orthogonal features that this
would entail:
>>>>
>>>> 1) Deterministic handling for loads and stores through a null
pointer.
>>>> 2) Ability to unwind a load/store to a specific basic block,
like invoke.
>>>>
>>>> At the moment, we do not exactly have 1), as the optimizer
considers
>>>> non-volatile loads/stores through a null pointer to have
undefined
>>>> behavior. Volatile loads/stores are closer, but they come with
their own
>>>> set of baggage that can inhibit optimization. (For example, if
we can prove
>>>> that a load would always succeed, 'volatile' prevents
us from reordering
>>>> the load or deleting it if it is dead.) So I propose to add an
attribute to
>>>> 'load' and 'store', which we can call, say,
'nullcheck', with the following
>>>> additional semantics:
>>>>
>>>> - If the memory address is between zero and a target-defined
value (i.e. the
>>>>   size of the zero page) the instruction is guaranteed to trap
in a
>>>>   target-defined manner.
>>>>
>>>> - The optimizer may only delete or reorder nullcheck
instructions if the
>>>>   program cannot observe such a transformation by installing a
signal handler
>>>>   for the trap.  Therefore, the optimizer would be able to drop
the attribute
>>>>   if it can prove that the address will always be non-null.
>>>>
>>>> To support 2), I propose a couple of new instructions. I
haven't come up with
>>>> great names for these instructions, but:
>>>>
>>>> - 'iload' is to 'load' as 'invoke' is
to 'call'. That is, the instruction is
>>>>   a terminator and has normal and unwind destinations. e.g.
>>>>
>>>>   %v = iload i8* %ptr to label %try.cont unwind label %lpad
>>>>
>>>> - Similarly, 'istore' is to 'store' as
'invoke' is to 'call'.
>>>>
>>>>   istore i8 %v, i8* %ptr to label %try.cont unwind label %lpad
>>>>
>>>> These instructions always have 'nullcheck' semantics,
plus:
>>>>
>>>> - If the instruction traps and the program has installed a
signal handler
>>>>   for the trap which unwinds, the unwind is guaranteed to land
at the
>>>>   landing pad.
>>>>
>>>> I've been working on an implementation of 'iload'
and 'istore' which are
>>>> in the attached patches, if you are interested. (They
aren't ready to go
>>>> in yet.) I have asm parsing/printing for both, and code
generation for
>>>> 'iload'. Would be interested in getting feedback on
code generation as this
>>>> is my first serious foray into the backend -- I haven't
tried running the
>>>> generated code yet and the DAG builder is a mashup of the DAG
builders for
>>>> 'invoke' and 'load', but I've eyeballed the
asm it generates (e.g. llc produces
>>>> iload-exception.s for the attached iload-exception.ll) and it
looks reasonable.
>>> Hi Peter. All due respect, I don’t think it’s right to introduce
new load/store instructions with equivalent semantics to and+icmp+br+load/store.
>> I don't think the new instructions have equivalent semantics. If
the null check
>> fails with the iload/istore instructions, we need to throw the
appropriate
>> language-specific exception and evaluate it against the landing pad.
There
>> may also need to be an active exception that can be resumed.
>>
>> As far as I can tell, the frontend would need to emit IR that calls the
>> language runtime to manually throw the exception. This IR would need to
be
>> recognized by the IR optimization that converts the icmp+br+load/store
to
>> a checked load/store. It seems to me that it would be simpler to just
start
>> with the checked load/store.
> I agree with Filip that implementing null checks as trapping loads is
essentially a minor code-size optimization that is typically not worthwhile.
However, my point is that this decision should not be made at IR level.
>
> It could make sense to have a null check intrinsic that encapsulates
and+cmp+br+invoke. I just don't like the idea of it subsuming the
load/store. I think that will complicate both null check optimization and
load/store optimization.
>
> Whether the null check can be profitably implemented as a trapping
load/store is specific to the target platform. This decision should be
independent of optimization in the presence of null checks and independent of
optimization of the null checks themselves. It is purely a codegen issue.
>
> Imagine someone else porting your frontend to a new platform. They should
not be required to implement trapping loads/stores to achieve a working system.
And either way, they should benefit from the same platform independent
optimization.
>
>>> ‘invoke’ is different. It is needed because there is no way for the
caller to explicitly check for an exception.
>>>
>>> We do introduce intrinsics that encapsulate things like overflow
checks. This is done to eliminate control flow edges that tend to inhibit LLVM’s
optimizer and instruction selection. But you’re not removing the control flow,
so this technique does not apply. Null checks should actually be exposed in IR
so general optimizations can remove redundant checks.
>> My idea for removing redundant checks is to teach the IR optimizer to
>> treat iloads/istores as if they were null checks. Is there any reason
>> why this wouldn't work?
> I think it would be poor IR design. The optimizer should be able to assume
loads are loads and stores are stores. People writing optimization passes will
not remember that there is now another special load/store operation they should
consider.
>
> An intrinsic makes it fairly clear that optimization passes will typically
ignore this operation. Ideally, language-specific intrinsics are lowered early
into canonical IR and platform-specific intrinsics are generated late from
canonical IR to avoid complicating optimization.
>
>>> Ideally this would just be a machine code pass that can hoist a
load/store above a branch and nuke the compare. However, I can see how it’s
easier to relate the compare operand to the address arithmetic at IR level.
>>>
>>> To do this at IR level, you could introduce a pre-CodeGen pass that
converts cmp+br+load/store into a checked load intrinsic. Since this intrinsic
only exists for instruction selection, the optimizer doesn’t need to know about
it.
>> I did initially consider implementing the checked load/store as an
>> intrinsic. But there are relatively boring reasons why this won't
work at
>> present. For example, there is no current way to represent a struct
load
>> using an intrinsic, as there is no mangling for struct types. Also,
named
>> structs would need a mangling that is resistant to renames. Rather than
>> solve these problems, I decided to avoid intrinsics entirely.
> Ok. This seems worth solving. Maybe it should be discussed in another
thread.
>
>>> The intrinsic would need to be lowered to an MI pseudo-instruction
that feels like a load/store to the backend, but is a terminator. During code
emission you could grab the address of that pseudo load/store and its resulting
branch target to inform the runtime.
>> As far as I know, a load can lower to multiple machine instructions.
This
>> will definitely be the case for the Go frontend that I'm working
with, as
>> its IR tends to use struct loads/stores quite frequently. So I'm
not sure
>> if this will work. I think it needs to look a lot like how the lowering
for
>> invokes currently looks, with a pair of EH_LABELs around a set of
ordinary
>> load/store MIs -- which is how I implemented it.
> That sounds nice. But I'm not sure that's sufficient to prevent the
MI level loads being optimized away or move/combined when you still need the
null check in the original position.
>
> -Andy
>
>> Thanks,
>> -- 
>> Peter
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

llvm dev - Apr 2014 - [LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior

[LLVMdev] Proposal: Loads/stores with deterministic trap/unwind behavior