thr3ads.net - llvm dev - [llvm-dev] Adding support for self-modifying branches to LLVM? [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Philip Reames via llvm-dev

2016-Jan-21 22:09 UTC

[llvm-dev] Adding support for self-modifying branches to LLVM?

On 01/21/2016 01:51 PM, Sean Silva wrote:>
>
> On Thu, Jan 21, 2016 at 1:33 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at
philipreames.com>> wrote:
>
>
>
>     On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote:
>>
>>     AFAIK, the cost of a well-predicted, not-taken branch is the same
>>     as a nop on every x86 made in the last many years. See
>>     http://www.agner.org/optimize/instruction_tables.pdf
>>     Generally speaking a correctly-predicted not-taken branch is
>>     basically identical to a nop, and a correctly-predicted taken
>>     branch is has an extra overhead similar to an "add" or
other
>>     extremely cheap operation.
>     Specifically on this point only: While absolutely true for most
>     micro-benchmarks, this is less true at large scale.  I've
>     definitely seen removing a highly predictable branch (in many,
>     many places, some of which are hot) to benefit performance in the
>     5-10% range.  For instance, removing highly predictable branches
>     is the primary motivation of implicit null checking. 
>     (http://llvm.org/docs/FaultMaps.html). Where exactly the
>     performance improvement comes from is hard to say, but,
>     empirically, it does matter.
>
>     (Caveat to above: I have not run an experiment that actually put
>     in the same number of bytes in nops.  It's possible the entire
>     benefit I mentioned is code size related, but I doubt it given how
>     many ticks a sample profiler will show on said branches.)
>
>
> Interesting. Another possible explanation is that these extra branches 
> cause contention on branch-prediction resources.I've heard and proposed this explanation in the past as well, but I've 
never heard of anyone able to categorically answer the question.

The other explanation I've considered is that the processor has a finite 
speculation depth (i.e. how many in flight predicted branches), and the 
extra branches cause the processor to not be able to speculate 
"interesting" branches because they're full of uninteresting ones.
However, my hardware friends tell me this is a somewhat questionable 
explanation since the check branches should be easy to satisfy and 
retire quickly.
> In the past when talking with Dan about WebAssembly sandboxing, IIRC 
> he said that they found about 15% overhead, due primarily to 
> branch-prediction resource contention.15% seems a bit high to me, but I don't have anything concrete to share 
here unfortunately.> In fact I think they had a pretty clear idea of wanting a new 
> instruction which is just a "statically predict never taken and
don't
> use any branch-prediction resources" branch (this is on x86 IIRC; some
> arches actually obviously have such an instruction!).This has been on my wish list for a while.  It would make many things so 
much easier.

The sickly amusing bit is that x86 has two different forms of this, 
neither of which actually work:
1) There are prefixes for branches which are supposed to control the 
prediction direction.  My understanding is that code which tried using 
them was so often wrong, that modern chips interpret them as nop 
padding.  We actually use this to produce near arbitrary length nops.  :)
2) x86 (but not x86-64) had a "into" instruction which triggered an 
interrupt if the overflow bit is set.  (Hey, signal handlers are just 
weird branches right? :p)  However, this does not work as designed in 
x86-64.  My understanding is that the original AMD implementation had a 
bug in this instruction and the bug essentially got written into the 
spec for all future chips.  :(

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/b80b7e9f/attachment-0001.html>

Jonas Wagner via llvm-dev

2016-Jan-21 22:52 UTC

head link

[llvm-dev] Adding support for self-modifying branches to LLVM?

Hello,

There is some data on this, e.g, in “High System-Code Security with Low
Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work we
found that, for ASan as well as other instrumentation tools, most overhead
comes from the checks. Especially for CPU-intensive applications, the cost
of maintaining shadow memory is small.

How did you measure this? If it was measured by removing the checks before
optimization happens, then what you may have been measuring is not the
execution overhead of the branches (which is what would be eliminated by
nop’ing them out) but the effect on the optimizer.

Interesting. Indeed this was measured by removing some checks and then
re-optimizing the program.

I’m aware of some impact checks may have on optimization. For example, I’ve
seen cases where much less inlining happens because functions with checks
are larger. Do you know other concrete examples? This is definitely
something I’ll have to be careful about. Philip Reames confirms this, too.

On the other hand, we’ve also found that the benefit from removing a check
is roughly proportional to the number of cycles spent executing that
check’s instructions. Our model of this is not very precise, but it shows
that the cost of executing the check’s instructions matters.

I'll try to measure this, and will come back when I have data.

Best,
Jonas

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/7fcaa87a/attachment.html>

Sean Silva via llvm-dev

2016-Jan-22 00:53 UTC

head link

[llvm-dev] Adding support for self-modifying branches to LLVM?

On Thu, Jan 21, 2016 at 2:52 PM, Jonas Wagner <jonas.wagner at epfl.ch>
wrote:
> Hello,
>
> There is some data on this, e.g, in “High System-Code Security with Low
> Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this
work we
> found that, for ASan as well as other instrumentation tools, most overhead
> comes from the checks. Especially for CPU-intensive applications, the cost
> of maintaining shadow memory is small.
>
> How did you measure this? If it was measured by removing the checks before
> optimization happens, then what you may have been measuring is not the
> execution overhead of the branches (which is what would be eliminated by
> nop’ing them out) but the effect on the optimizer.
>
> Interesting. Indeed this was measured by removing some checks and then
> re-optimizing the program.
>
> I’m aware of some impact checks may have on optimization. For example,
> I’ve seen cases where much less inlining happens because functions with
> checks are larger. Do you know other concrete examples? This is definitely
> something I’ll have to be careful about. Philip Reames confirms this, too.
>Off the top of my head:

- CFG is more complex
- regalloc is harder (although in theory it needn't do a worse job since
the branches are marked as unlikely; but not sure about in practice)
- the branch splits a BB which otherwise would have stayed together,
restricting e.g. the scheduler (which is BB-at-a-time last I heard)

-- Sean Silva

> On the other hand, we’ve also found that the benefit from removing a check
> is roughly proportional to the number of cycles spent executing that
> check’s instructions. Our model of this is not very precise, but it shows
> that the cost of executing the check’s instructions matters.
>
> I'll try to measure this, and will come back when I have data.
>
> Best,
> Jonas
> 
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/8537e8c6/attachment.html>

Jonas Wagner via llvm-dev

2016-Feb-09 14:57 UTC

head link

[llvm-dev] Adding support for self-modifying branches to LLVM?

Hi,

I'm coming back to this old thread with data about the performance of NOPs.
Recalling that I was considering transforming NOP instructions into
branches and back, in order to dynamically enable code. One use case for
this was enabling/disabling individual sanitizer checks (ASan, UBSan) on
demand.

I wrote a pass which takes an ASan-instrumented program, and replaces each
ASan check with an llvm.experimental.patchpoint intrinsic. This intrinsic
inserts a NOP of configurable size. It has otherwise no effect on the
program semantics. It does prevent some optimizations, presumably because
instructions cannot be moved across the patchpoint.

Some results:
- On SPEC, patchpoints introduce an overhead of ~25% compared to a version
where ASan checks are removed.
- This is almost half of the cost of the checks themselves.
- The results are similar for NOPs of size 1 and 5 bytes.
- Interestingly, the results are similar for NOPs of 0 bytes, too. These
are patchpoints that don't insert any code and only inhibit optimizations.
I've only tested this on one benchmark, though.

To summarize, only part of the cost of NOPs is due to executing them. Their
effect on optimizations is significant, too. I guess this would hold for
branches and sanitizer checks as well.

Best,
Jonas


On Thu, Jan 21, 2016 at 11:52 PM Jonas Wagner <jonas.wagner at epfl.ch>
wrote:
> Hello,
>
> There is some data on this, e.g, in “High System-Code Security with Low
> Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this
work we
> found that, for ASan as well as other instrumentation tools, most overhead
> comes from the checks. Especially for CPU-intensive applications, the cost
> of maintaining shadow memory is small.
>
> How did you measure this? If it was measured by removing the checks before
> optimization happens, then what you may have been measuring is not the
> execution overhead of the branches (which is what would be eliminated by
> nop’ing them out) but the effect on the optimizer.
>
> Interesting. Indeed this was measured by removing some checks and then
> re-optimizing the program.
>
> I’m aware of some impact checks may have on optimization. For example,
> I’ve seen cases where much less inlining happens because functions with
> checks are larger. Do you know other concrete examples? This is definitely
> something I’ll have to be careful about. Philip Reames confirms this, too.
>
> On the other hand, we’ve also found that the benefit from removing a check
> is roughly proportional to the number of cycles spent executing that
> check’s instructions. Our model of this is not very precise, but it shows
> that the cost of executing the check’s instructions matters.
>
> I'll try to measure this, and will come back when I have data.
>
> Best,
> Jonas
> 
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160209/2eb19061/attachment.html>

llvm dev - Jan 2016 - Adding support for self-modifying branches to LLVM?

[llvm-dev] Adding support for self-modifying branches to LLVM?

[llvm-dev] Adding support for self-modifying branches to LLVM?

[llvm-dev] Adding support for self-modifying branches to LLVM?

[llvm-dev] Adding support for self-modifying branches to LLVM?