thr3ads.net - llvm dev - [llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations [Nov 2015]

If this information is useful, please help other people find it:
Share via:

Richard Diamond via llvm-dev

2015-Nov-06 16:35 UTC

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

On Tue, Nov 3, 2015 at 3:15 PM, Daniel Berlin <dberlin at dberlin.org>
wrote:
>
>
> On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <
> wichard at vitalitystudios.com> wrote:
>
>>
>>
>> On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dberlin at
dberlin.org>
>> wrote:
>>
>>> I'm very unclear and why you think a generic black box
intrinsic will
>>> have any different performance impact ;-)
>>>
>>>
>>> I'm also unclear on what the goal with this intrinsic is.
>>> I understand the symptoms you are trying to solve - what exactly is
the
>>> disease.
>>>
>>> IE you say "
>>>
>>> I'd like to propose a new intrinsic for use in preventing
optimizations
>>> from deleting IR due to constant propagation, dead code
elimination, etc."
>>>
>>> But why are you trying to achieve this goal?
>>>
>>
>> It's a cleaner design than current solutions (as far as I'm
aware).
>>
>
> For what, exact, well defined goal?
>
> Trying to make certain specific optimizations not work does not seem like
> a goal unto itself.
> It's a thing you are doing to achieve something else, right?
> (Because if not, it has a very well defined and well supported solutions -
> set up a pass manager that runs the passes you want)
>
> What is the something else?
>
> IE what is the problem that led you to consider this solution.
>
I apologize if I'm not being clear enough. This contrived example
```rust
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        (0..1000).fold(0, |old, new| old ^ new);
    });
}
```
is completely optimized away. Granted, IRL production (ignoring the
question of why this code was ever used in production in the first place)
this optimization is desired, but here it leads to bogus measurements (ie
0ns per iteration). By using `test::black_box`, one would have

```rust
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
    b.iter(|| {
        let n = test::black_box(1000);  // optional
        test::black_box((0..n).fold(0, |old, new| old ^ new));
    });
}
```
and the microbenchmark wouldn't have bogos 0ns measurements anymore.

Now, as I stated in the proposal, `test::black_box` currently uses no-op
inline asm to "read" from its argument in a way the optimizations
can't
see. Conceptually, this seems like something that should be modelled in
LLVM's IR rather than by hacks higher up the IR food chain because the root
problem is caused by LLVM's optimization passes (most of the time this code
optimization is desired, just not here). Plus, it seems others have used
other tricks to achieve similar effects (ie volatile), so why shouldn't
there be something to model this behaviour?

> Benchmarks that can be const prop'd/etc away are often meaningless.
>>>
>>
>> A benchmark that's completely removed is even more meaningless, and
the
>> developer may not even know it's happening.
>>
>
> Write good benchmarks?
>
> No, seriously, i mean, you want benchmarks that tests what users will see
> when the compiler works, not benchmarks that test what users see if the
> were to suddenly turn off parts of the optimizers ;)
>
But users are also not testing how fast deterministic code which LLVM is
completely removing can go. This intrinsic prevents LLVM from correctly
thinking the code is deterministic (or that a value isn't used) so that
measurements are (at the very least, the tiniest bit) meaningful.

I'm not saying this intrinsic will make all benchmarks meaningful (and
I>> can't), I'm saying that it would be useful in Rust in ensuring
that
>> tests/benches aren't invalidated simply because a computation
wasn't
>> performed.
>>
>> Past that, if you want to ensure a particular optimization does a
>>> particular thing on a benchmark, ISTM it would be better to
generate the
>>> IR, run opt (or build your own pass-by-pass harness), and then run
"the
>>> passes you want on it" instead of "trying to stop certain
passes from doing
>>> things to it".
>>>
>>
>> True, but why would you want to force that speed bump onto other
>> developers? I'd argue that's more hacky than the inline asm.
>>
>> Speed bump? Hacky?
> It's a completely normal test harness?
>
> That's in fact, why llvm uses it as a test harness?
>
I mean I wouldn't write a harness or some other type of workaround for
something like this: Rust doesn't seem to be the first to have encountered
this issue, thus it is nonsensical to require every project using LLVM to
have a separate harness or other workaround so they don't run into this
issue. LLVM's own documentation suggests that adding an intrinsic is the
best choice moving forward anyway: "Adding an intrinsic function is far
easier than adding an instruction, and is transparent to optimization
passes. If your added functionality can be expressed as a function call, an
intrinsic function is the method of choice for LLVM extension." (from
http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.

At anyrate, I apologize for my original hand-wavy-ness; I am young and
inexperienced.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151106/7f1d850a/attachment-0001.html>

Daniel Berlin via llvm-dev

2015-Nov-06 17:27 UTC

head link

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

On Fri, Nov 6, 2015 at 8:35 AM, Richard Diamond <wichard at
vitalitystudios.com> wrote:
>
>
> On Tue, Nov 3, 2015 at 3:15 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>
>>
>>
>> On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <
>> wichard at vitalitystudios.com> wrote:
>>
>>>
>>>
>>> On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dberlin at
dberlin.org>
>>> wrote:
>>>
>>>> I'm very unclear and why you think a generic black box
intrinsic will
>>>> have any different performance impact ;-)
>>>>
>>>>
>>>> I'm also unclear on what the goal with this intrinsic is.
>>>> I understand the symptoms you are trying to solve - what
exactly is the
>>>> disease.
>>>>
>>>> IE you say "
>>>>
>>>> I'd like to propose a new intrinsic for use in preventing
optimizations
>>>> from deleting IR due to constant propagation, dead code
elimination, etc."
>>>>
>>>> But why are you trying to achieve this goal?
>>>>
>>>
>>> It's a cleaner design than current solutions (as far as I'm
aware).
>>>
>>
>> For what, exact, well defined goal?
>>
>
>> Trying to make certain specific optimizations not work does not seem
like
>> a goal unto itself.
>> It's a thing you are doing to achieve something else, right?
>> (Because if not, it has a very well defined and well supported
solutions
>> - set up a pass manager that runs the passes you want)
>>
>> What is the something else?
>>
>> IE what is the problem that led you to consider this solution.
>>
>
> I apologize if I'm not being clear enough. This contrived example
> ```rust
> #[bench]
> fn bench_xor_1000_ints(b: &mut Bencher) {
>     b.iter(|| {
>         (0..1000).fold(0, |old, new| old ^ new);
>     });
> }
> ```
> is completely optimized away.
>

Great!

You should then test that this happens, and additionally write a test that
can't be optimized away, since the above is apparently not a useful
microbenchmark for anything but the compiler ;-)

Seriously though, there are basically three cases (with a bit of handwaving)

1. You want to test that the compiler optimizes something in a certain
way.  The above example, without anything else, you actually want to test
that the compiler optimizes this away completely.
This doesn't require anything except using something like FIleCheck and
producing IR at the end of Rust's optimization pipeline.

2. You want to make the above code into a benchmark, and ensure the
compiler is required to keep the number and relative order of certain
operations.
Use volatile for this.

Volatile is not what you seem to think it is, or may think about it in
terms of what people use it for in C/C++.
volatile in llvm has a well defined meaning:
http://llvm.org/docs/LangRef.html#volatile-memory-accesses

3. You want to get the compiler to only do certain optimizations to your
code.

Yes, you have to either write a test harness (even if that test harness is
"your normal compiler, with certain flags passed"), or use ours, for
that
;-)

It seems like you want #2, so you should use volatile.

But don't conflate #2 and #3.

As said:
If you want the compiler to only do certain things to your code, you should
tell it to only do those things by giving it a pass pipeline that only does
those things.  Nothing else is going to solve this problem well.

If you want the compiler to do every optimization it knows to your code,
but want it to maintain the number and relative order of certain
operations, that's volatile.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151106/673ca0e5/attachment.html>

Mehdi Amini via llvm-dev

2015-Nov-06 18:17 UTC

head link

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

> On Nov 6, 2015, at 8:35 AM, Richard Diamond via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> 
> 
> On Tue, Nov 3, 2015 at 3:15 PM, Daniel Berlin <dberlin at dberlin.org
<mailto:dberlin at dberlin.org>> wrote:
> 
> 
> On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <wichard at
vitalitystudios.com <mailto:wichard at vitalitystudios.com>> wrote:
> 
> 
> On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dberlin at dberlin.org
<mailto:dberlin at dberlin.org>> wrote:
> I'm very unclear and why you think a generic black box intrinsic will
have any different performance impact ;-)
> 
> 
> I'm also unclear on what the goal with this intrinsic is.
> I understand the symptoms you are trying to solve - what exactly is the
disease.
> 
> IE you say "
> 
> I'd like to propose a new intrinsic for use in preventing optimizations
from deleting IR due to constant propagation, dead code elimination, etc."
> 
> But why are you trying to achieve this goal?
> 
> It's a cleaner design than current solutions (as far as I'm aware).
> 
> For what, exact, well defined goal? 
> 
> Trying to make certain specific optimizations not work does not seem like a
goal unto itself.
> It's a thing you are doing to achieve something else, right?
> (Because if not, it has a very well defined and well supported solutions -
set up a pass manager that runs the passes you want)
> 
> What is the something else?
> 
> IE what is the problem that led you to consider this solution.
> 
> I apologize if I'm not being clear enough. This contrived example
> ```rust
> #[bench]
> fn bench_xor_1000_ints(b: &mut Bencher) {
>     b.iter(|| {
>         (0..1000).fold(0, |old, new| old ^ new);
>     });
> }
> ```
> is completely optimized away. Granted, IRL production (ignoring the
question of why this code was ever used in production in the first place) this
optimization is desired, but here it leads to bogus measurements (ie 0ns per
iteration). By using `test::black_box`, one would have
> 
> ```rust
> #[bench]
> fn bench_xor_1000_ints(b: &mut Bencher) {
>     b.iter(|| {
>         let n = test::black_box(1000);  // optional
>         test::black_box((0..n).fold(0, |old, new| old ^ new));
>     });
> }
> ```
> and the microbenchmark wouldn't have bogos 0ns measurements anymore.
> 
> Now, as I stated in the proposal, `test::black_box` currently uses no-op
inline asm to "read" from its argument in a way the optimizations
can't see. Conceptually, this seems like something that should be modelled
in LLVM's IR rather than by hacks higher up the IR food chain because the
root problem is caused by LLVM's optimization passes (most of the time this
code optimization is desired, just not here). Plus, it seems others have used
other tricks to achieve similar effects (ie volatile), so why shouldn't
there be something to model this behavior?
How would black_box be different from existing mechanism (inline asm, volatile,
…)?
If the effect on the optimizer is not different then there is no reason to
introduce a new intrinsic just for the sake of it. It has some cost: any
optimization has to take this into account.

On this topic, I think Chandler’s talk at CppCon seems relevant:
https://www.youtube.com/watch?v=nXaxk27zwlk
>  
> Benchmarks that can be const prop'd/etc away are often meaningless. 
>  
> A benchmark that's completely removed is even more meaningless, and the
developer may not even know it's happening.
> 
> Write good benchmarks?
> 
> No, seriously, i mean, you want benchmarks that tests what users will see
when the compiler works, not benchmarks that test what users see if the were to
suddenly turn off parts of the optimizers ;)
> 
> But users are also not testing how fast deterministic code which LLVM is
completely removing can go. This intrinsic prevents LLVM from correctly thinking
the code is deterministic (or that a value isn't used) so that measurements
are (at the very least, the tiniest bit) meaningful.
> 
> I'm not saying this intrinsic will make all benchmarks meaningful (and
I can't), I'm saying that it would be useful in Rust in ensuring that
tests/benches aren't invalidated simply because a computation wasn't
performed.
> 
> Past that, if you want to ensure a particular optimization does a
particular thing on a benchmark, ISTM it would be better to generate the IR, run
opt (or build your own pass-by-pass harness), and then run "the passes you
want on it" instead of "trying to stop certain passes from doing
things to it".
> 
> True, but why would you want to force that speed bump onto other
developers? I'd argue that's more hacky than the inline asm.
> 
> Speed bump? Hacky?
> It's a completely normal test harness? 
> 
> That's in fact, why llvm uses it as a test harness?
> 
> I mean I wouldn't write a harness or some other type of workaround for
something like this: Rust doesn't seem to be the first to have encountered
this issue, thus it is nonsensical to require every project using LLVM to have a
separate harness or other workaround so they don't run into this issue.
LLVM's own documentation suggests that adding an intrinsic is the best
choice moving forward anyway: "Adding an intrinsic function is far easier
than adding an instruction, and is transparent to optimization passes. If your
added functionality can be expressed as a function call, an intrinsic function
is the method of choice for LLVM extension." (from
http://llvm.org/docs/ExtendingLLVM.html
<http://llvm.org/docs/ExtendingLLVM.html>). That sounds perfect to me.
The doc is about if you *need* to extend LLVM, then you should try with
intrinsic instead of adding an instruction, it is the “need” part that is not
clear here. The doc also states that an intrinsic is transparent to optimization
passes, but it is not the case here since you want to prevent optimizations from
happening (and you haven’t really specified how to decide what can an
optimization do around this intrinsic, because if you don’t teach the optimizer
about it, it will treat it as an external function call).

— 
Mehdi


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151106/7e969437/attachment.html>

Sean Silva via llvm-dev

2015-Nov-07 03:03 UTC

head link

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

On Fri, Nov 6, 2015 at 8:35 AM, Richard Diamond via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
>
>
> On Tue, Nov 3, 2015 at 3:15 PM, Daniel Berlin <dberlin at
dberlin.org> wrote:
>
>>
>>
>> On Tue, Nov 3, 2015 at 12:29 PM, Richard Diamond <
>> wichard at vitalitystudios.com> wrote:
>>
>>>
>>>
>>> On Mon, Nov 2, 2015 at 9:16 PM, Daniel Berlin <dberlin at
dberlin.org>
>>> wrote:
>>>
>>>> I'm very unclear and why you think a generic black box
intrinsic will
>>>> have any different performance impact ;-)
>>>>
>>>>
>>>> I'm also unclear on what the goal with this intrinsic is.
>>>> I understand the symptoms you are trying to solve - what
exactly is the
>>>> disease.
>>>>
>>>> IE you say "
>>>>
>>>> I'd like to propose a new intrinsic for use in preventing
optimizations
>>>> from deleting IR due to constant propagation, dead code
elimination, etc."
>>>>
>>>> But why are you trying to achieve this goal?
>>>>
>>>
>>> It's a cleaner design than current solutions (as far as I'm
aware).
>>>
>>
>> For what, exact, well defined goal?
>>
>
>> Trying to make certain specific optimizations not work does not seem
like
>> a goal unto itself.
>> It's a thing you are doing to achieve something else, right?
>> (Because if not, it has a very well defined and well supported
solutions
>> - set up a pass manager that runs the passes you want)
>>
>> What is the something else?
>>
>> IE what is the problem that led you to consider this solution.
>>
>
> I apologize if I'm not being clear enough. This contrived example
> ```rust
> #[bench]
> fn bench_xor_1000_ints(b: &mut Bencher) {
>     b.iter(|| {
>         (0..1000).fold(0, |old, new| old ^ new);
>     });
> }
> ```
> is completely optimized away. Granted, IRL production (ignoring the
> question of why this code was ever used in production in the first place)
> this optimization is desired, but here it leads to bogus measurements (ie
> 0ns per iteration). By using `test::black_box`, one would have
>
> ```rust
> #[bench]
> fn bench_xor_1000_ints(b: &mut Bencher) {
>     b.iter(|| {
>         let n = test::black_box(1000);  // optional
>         test::black_box((0..n).fold(0, |old, new| old ^ new));
>     });
> }
> ```
> and the microbenchmark wouldn't have bogos 0ns measurements anymore.
>
I still don't understand what you are trying to test with this.

Are you trying to measure e.g. the performance of

xor %eax, %eax
1:
xor %esi, %eax
dec %esi
jnz 1b

?

are you trying to measure whether the compiler will vectorize this? Are you
trying to test how well the compiler will vectorize this? Are you trying to
measure the compiler's unrolling heuristics? Are you trying to see if the
(0..n).fold(...) machinery gets lowered to a loop? Are you trying to see if
the compiler will reduce it to (n & ((n&1)-1)) ^ ((n ^ (n >>
1))&1) or a
similar closed form expression? (I'm sure that's not the simplest one;
just
one I cooked up)

I'm honestly curious.

-- Sean Silva

>
> Now, as I stated in the proposal, `test::black_box` currently uses no-op
> inline asm to "read" from its argument in a way the optimizations
can't
> see. Conceptually, this seems like something that should be modelled in
> LLVM's IR rather than by hacks higher up the IR food chain because the
root
> problem is caused by LLVM's optimization passes (most of the time this
code
> optimization is desired, just not here). Plus, it seems others have used
> other tricks to achieve similar effects (ie volatile), so why shouldn't
> there be something to model this behaviour?
>
>
>> Benchmarks that can be const prop'd/etc away are often meaningless.
>>>>
>>>
>>> A benchmark that's completely removed is even more meaningless,
and the
>>> developer may not even know it's happening.
>>>
>>
>> Write good benchmarks?
>>
>> No, seriously, i mean, you want benchmarks that tests what users will
see
>> when the compiler works, not benchmarks that test what users see if the
>> were to suddenly turn off parts of the optimizers ;)
>>
>
> But users are also not testing how fast deterministic code which LLVM is
> completely removing can go. This intrinsic prevents LLVM from correctly
> thinking the code is deterministic (or that a value isn't used) so that
> measurements are (at the very least, the tiniest bit) meaningful.
>
> I'm not saying this intrinsic will make all benchmarks meaningful (and
I
>>> can't), I'm saying that it would be useful in Rust in
ensuring that
>>> tests/benches aren't invalidated simply because a computation
wasn't
>>> performed.
>>>
>>> Past that, if you want to ensure a particular optimization does a
>>>> particular thing on a benchmark, ISTM it would be better to
generate the
>>>> IR, run opt (or build your own pass-by-pass harness), and then
run "the
>>>> passes you want on it" instead of "trying to stop
certain passes from doing
>>>> things to it".
>>>>
>>>
>>> True, but why would you want to force that speed bump onto other
>>> developers? I'd argue that's more hacky than the inline
asm.
>>>
>>> Speed bump? Hacky?
>> It's a completely normal test harness?
>>
>> That's in fact, why llvm uses it as a test harness?
>>
>
> I mean I wouldn't write a harness or some other type of workaround for
> something like this: Rust doesn't seem to be the first to have
encountered
> this issue, thus it is nonsensical to require every project using LLVM to
> have a separate harness or other workaround so they don't run into this
> issue. LLVM's own documentation suggests that adding an intrinsic is
the
> best choice moving forward anyway: "Adding an intrinsic function is
far
> easier than adding an instruction, and is transparent to optimization
> passes. If your added functionality can be expressed as a function call, an
> intrinsic function is the method of choice for LLVM extension." (from
> http://llvm.org/docs/ExtendingLLVM.html). That sounds perfect to me.
>
> At anyrate, I apologize for my original hand-wavy-ness; I am young and
> inexperienced.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20151106/15d66bf2/attachment.html>

Alex Elsayed via llvm-dev

2015-Nov-10 01:56 UTC

head link

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

On Fri, 06 Nov 2015 09:27:32 -0800, Daniel Berlin via llvm-dev wrote:

<snip>> 
> Great!
> 
> You should then test that this happens, and additionally write a test
> that can't be optimized away, since the above is apparently not a
useful
> microbenchmark for anything but the compiler ;-)
> 
> Seriously though, there are basically three cases (with a bit of
> handwaving)
> 
> 1. You want to test that the compiler optimizes something in a certain
> way.  The above example, without anything else, you actually want to
> test that the compiler optimizes this away completely.
> This doesn't require anything except using something like FIleCheck and
> producing IR at the end of Rust's optimization pipeline.
> 
> 2. You want to make the above code into a benchmark, and ensure the
> compiler is required to keep the number and relative order of certain
> operations.
> Use volatile for this.
> 
> Volatile is not what you seem to think it is, or may think about it in
> terms of what people use it for in C/C++.
> volatile in llvm has a well defined meaning:
> http://llvm.org/docs/LangRef.html#volatile-memory-accesses
> 
> 3. You want to get the compiler to only do certain optimizations to your
> code.
> 
> Yes, you have to either write a test harness (even if that test harness
> is "your normal compiler, with certain flags passed"), or use
ours, for
> that ;-)
> 
> 
> It seems like you want #2, so you should use volatile.
> 
> But don't conflate #2 and #3.
> 
> As said:
> If you want the compiler to only do certain things to your code, you
> should tell it to only do those things by giving it a pass pipeline that
> only does those things.  Nothing else is going to solve this problem
> well.
> 
> If you want the compiler to do every optimization it knows to your code,
> but want it to maintain the number and relative order of certain
> operations, that's volatile.
I think the fundamental thing you're missing is that benchmarks are an 
exercise in if/then:

*If* a user exercises this API, *then* how well would it perform?

Of course, in the case of a user, the data could come from anywhere, and 
go anywhere - the terminal, a network socket, whatever.

However, in a benchmark, all the data comes from (and goes) to places the 
compiler and see.

Thus, it's necessary to make the compiler _pretend_ the data came from 
and goes to a "black box", in order for the benchmarks to even
*remotely*
resemble what they're meant to test.

This is actually distinct from #1, #2, _and_ #3 above - quite simply, 
what is needed is a way to simulate a "real usage" scenario without 
actually contacting the external world.

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Nov 2015 - [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

[llvm-dev] [RFC] A new intrinsic, `llvm.blackbox`, to explicitly prevent constprop, die, etc optimizations

Apparently Analagous Threads