Philip Reames via llvm-dev
2016-Jan-21 22:09 UTC
[llvm-dev] Adding support for self-modifying branches to LLVM?
On 01/21/2016 01:51 PM, Sean Silva wrote:> > > On Thu, Jan 21, 2016 at 1:33 PM, Philip Reames > <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote: > > > > On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote: >> >> AFAIK, the cost of a well-predicted, not-taken branch is the same >> as a nop on every x86 made in the last many years. See >> http://www.agner.org/optimize/instruction_tables.pdf >> Generally speaking a correctly-predicted not-taken branch is >> basically identical to a nop, and a correctly-predicted taken >> branch is has an extra overhead similar to an "add" or other >> extremely cheap operation. > Specifically on this point only: While absolutely true for most > micro-benchmarks, this is less true at large scale. I've > definitely seen removing a highly predictable branch (in many, > many places, some of which are hot) to benefit performance in the > 5-10% range. For instance, removing highly predictable branches > is the primary motivation of implicit null checking. > (http://llvm.org/docs/FaultMaps.html). Where exactly the > performance improvement comes from is hard to say, but, > empirically, it does matter. > > (Caveat to above: I have not run an experiment that actually put > in the same number of bytes in nops. It's possible the entire > benefit I mentioned is code size related, but I doubt it given how > many ticks a sample profiler will show on said branches.) > > > Interesting. Another possible explanation is that these extra branches > cause contention on branch-prediction resources.I've heard and proposed this explanation in the past as well, but I've never heard of anyone able to categorically answer the question. The other explanation I've considered is that the processor has a finite speculation depth (i.e. how many in flight predicted branches), and the extra branches cause the processor to not be able to speculate "interesting" branches because they're full of uninteresting ones. However, my hardware friends tell me this is a somewhat questionable explanation since the check branches should be easy to satisfy and retire quickly.> In the past when talking with Dan about WebAssembly sandboxing, IIRC > he said that they found about 15% overhead, due primarily to > branch-prediction resource contention.15% seems a bit high to me, but I don't have anything concrete to share here unfortunately.> In fact I think they had a pretty clear idea of wanting a new > instruction which is just a "statically predict never taken and don't > use any branch-prediction resources" branch (this is on x86 IIRC; some > arches actually obviously have such an instruction!).This has been on my wish list for a while. It would make many things so much easier. The sickly amusing bit is that x86 has two different forms of this, neither of which actually work: 1) There are prefixes for branches which are supposed to control the prediction direction. My understanding is that code which tried using them was so often wrong, that modern chips interpret them as nop padding. We actually use this to produce near arbitrary length nops. :) 2) x86 (but not x86-64) had a "into" instruction which triggered an interrupt if the overflow bit is set. (Hey, signal handlers are just weird branches right? :p) However, this does not work as designed in x86-64. My understanding is that the original AMD implementation had a bug in this instruction and the bug essentially got written into the spec for all future chips. :( Philip -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/b80b7e9f/attachment-0001.html>
Jonas Wagner via llvm-dev
2016-Jan-21 22:52 UTC
[llvm-dev] Adding support for self-modifying branches to LLVM?
Hello, There is some data on this, e.g, in “High System-Code Security with Low Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work we found that, for ASan as well as other instrumentation tools, most overhead comes from the checks. Especially for CPU-intensive applications, the cost of maintaining shadow memory is small. How did you measure this? If it was measured by removing the checks before optimization happens, then what you may have been measuring is not the execution overhead of the branches (which is what would be eliminated by nop’ing them out) but the effect on the optimizer. Interesting. Indeed this was measured by removing some checks and then re-optimizing the program. I’m aware of some impact checks may have on optimization. For example, I’ve seen cases where much less inlining happens because functions with checks are larger. Do you know other concrete examples? This is definitely something I’ll have to be careful about. Philip Reames confirms this, too. On the other hand, we’ve also found that the benefit from removing a check is roughly proportional to the number of cycles spent executing that check’s instructions. Our model of this is not very precise, but it shows that the cost of executing the check’s instructions matters. I'll try to measure this, and will come back when I have data. Best, Jonas -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/7fcaa87a/attachment.html>
Sean Silva via llvm-dev
2016-Jan-22 00:53 UTC
[llvm-dev] Adding support for self-modifying branches to LLVM?
On Thu, Jan 21, 2016 at 2:52 PM, Jonas Wagner <jonas.wagner at epfl.ch> wrote:> Hello, > > There is some data on this, e.g, in “High System-Code Security with Low > Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work we > found that, for ASan as well as other instrumentation tools, most overhead > comes from the checks. Especially for CPU-intensive applications, the cost > of maintaining shadow memory is small. > > How did you measure this? If it was measured by removing the checks before > optimization happens, then what you may have been measuring is not the > execution overhead of the branches (which is what would be eliminated by > nop’ing them out) but the effect on the optimizer. > > Interesting. Indeed this was measured by removing some checks and then > re-optimizing the program. > > I’m aware of some impact checks may have on optimization. For example, > I’ve seen cases where much less inlining happens because functions with > checks are larger. Do you know other concrete examples? This is definitely > something I’ll have to be careful about. Philip Reames confirms this, too. >Off the top of my head: - CFG is more complex - regalloc is harder (although in theory it needn't do a worse job since the branches are marked as unlikely; but not sure about in practice) - the branch splits a BB which otherwise would have stayed together, restricting e.g. the scheduler (which is BB-at-a-time last I heard) -- Sean Silva> On the other hand, we’ve also found that the benefit from removing a check > is roughly proportional to the number of cycles spent executing that > check’s instructions. Our model of this is not very precise, but it shows > that the cost of executing the check’s instructions matters. > > I'll try to measure this, and will come back when I have data. > > Best, > Jonas > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/8537e8c6/attachment.html>
Jonas Wagner via llvm-dev
2016-Feb-09 14:57 UTC
[llvm-dev] Adding support for self-modifying branches to LLVM?
Hi, I'm coming back to this old thread with data about the performance of NOPs. Recalling that I was considering transforming NOP instructions into branches and back, in order to dynamically enable code. One use case for this was enabling/disabling individual sanitizer checks (ASan, UBSan) on demand. I wrote a pass which takes an ASan-instrumented program, and replaces each ASan check with an llvm.experimental.patchpoint intrinsic. This intrinsic inserts a NOP of configurable size. It has otherwise no effect on the program semantics. It does prevent some optimizations, presumably because instructions cannot be moved across the patchpoint. Some results: - On SPEC, patchpoints introduce an overhead of ~25% compared to a version where ASan checks are removed. - This is almost half of the cost of the checks themselves. - The results are similar for NOPs of size 1 and 5 bytes. - Interestingly, the results are similar for NOPs of 0 bytes, too. These are patchpoints that don't insert any code and only inhibit optimizations. I've only tested this on one benchmark, though. To summarize, only part of the cost of NOPs is due to executing them. Their effect on optimizations is significant, too. I guess this would hold for branches and sanitizer checks as well. Best, Jonas On Thu, Jan 21, 2016 at 11:52 PM Jonas Wagner <jonas.wagner at epfl.ch> wrote:> Hello, > > There is some data on this, e.g, in “High System-Code Security with Low > Overhead” <http://dslab.epfl.ch/proj/asap/#publications>. In this work we > found that, for ASan as well as other instrumentation tools, most overhead > comes from the checks. Especially for CPU-intensive applications, the cost > of maintaining shadow memory is small. > > How did you measure this? If it was measured by removing the checks before > optimization happens, then what you may have been measuring is not the > execution overhead of the branches (which is what would be eliminated by > nop’ing them out) but the effect on the optimizer. > > Interesting. Indeed this was measured by removing some checks and then > re-optimizing the program. > > I’m aware of some impact checks may have on optimization. For example, > I’ve seen cases where much less inlining happens because functions with > checks are larger. Do you know other concrete examples? This is definitely > something I’ll have to be careful about. Philip Reames confirms this, too. > > On the other hand, we’ve also found that the benefit from removing a check > is roughly proportional to the number of cycles spent executing that > check’s instructions. Our model of this is not very precise, but it shows > that the cost of executing the check’s instructions matters. > > I'll try to measure this, and will come back when I have data. > > Best, > Jonas > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160209/2eb19061/attachment.html>