thr3ads.net - llvm dev - [llvm-dev] RFC: Using link-time optimization to eliminate retpolines [Jan 2018]

If this information is useful, please help other people find it:
Share via:

Peter Collingbourne via llvm-dev

2018-Jan-24 00:44 UTC

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

The proposed mitigation for variant 2 of CVE-2017-5715, “branch target
injection”, is to send all indirect branches through an instruction
sequence known as a retpoline. Because the purpose of a retpoline is to
prevent attacker-controlled speculation, we also end up losing the benefits
of benign speculation, which can lead to a measurable loss of performance.

We can regain some of those benefits if we know that the set of possible
branch targets is fixed (this is sometimes known to be the case when using
whole-program devirtualization or CFI -- see
https://clang.llvm.org/docs/LTOVisibility.html). In that case, we can
construct a so-called “branch funnel” that selects one of the possible
targets by performing a binary search on an address associated with the
indirect branch (for virtual calls, this is the address of the vtable, and
for indirect calls via a function pointer, this is the function pointer
itself), eventually directly branching to the selected target. As a result,
the processor is free to speculatively execute the virtual call, but it can
only speculatively branch to addresses of valid implementations of the
virtual function, as opposed to arbitrary addresses.

For example, suppose that we have the following class hierarchy, which is
known to be closed:

struct Base { virtual void f() = 0; };
struct A : Base { virtual void f(); };
struct B : Base { virtual void f(); };
struct C : Base { virtual void f(); };

We can lay out the vtables for the derived classes in the order A, B, C,
and produce an instruction sequence that directs execution to one of the
targets A::f, B::f and C::f depending on the vtable address. In x86_64
assembly, a branch funnel would look like this:

lea B::vtable+16(%rip), %r11
cmp %r11, %r10
jb A::f
je B::f
jmp C::f

A caller performs a virtual call by loading the vtable address into
register r10, setting up the other registers for the virtual call and
directly calling the branch funnel as if it were a regular function.
Because the branch funnel enforces control flow integrity by itself, we can
also avoid emitting CFI checks at call sites that use branch funnels when
CFI is enabled.

To control the layout of vtables and function pointers, we can extend
existing mechanisms for controlling layout that are used to implement CFI
(see https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html) so that
they are also used whenever a branch funnel needs to be created.

The compiler will only use branch funnels when both the retpoline
mitigation (-mretpoline) and whole-program devirtualization
(-fwhole-program-vtables) features are enabled (the former is on the
assumption that in general a regular indirect call will be less expensive
than a branch funnel, and the latter provides the necessary guarantee that
the type hierarchy is closed). Even when retpolines are enabled, there is
still a cost associated with executing a branch funnel that needs to be
balanced against the cost of a regular CFI check and retpoline, so branch
funnels are only used when there are <=10 targets (this number has not been
tuned yet). Because the implementation uses some of the same mechanisms
that are used to implement CFI and whole-program devirtualization, it
requires LTO (it is compatible with both full LTO and ThinLTO).

To measure the performance impact of branch funnels, I ran a selection of
Chrome benchmark suites on Chrome binaries built with CFI, CFI + retpoline
and CFI + retpoline + branch funnels, and measured the median impact over
all benchmarks in each suite. The numbers are presented below. I should
preface these numbers by saying that these are largely microbenchmarks, so
the impact of retpoline on its own is unlikely to be characteristic of real
workloads. The numbers to focus on should be the impact of retpoline +
branch funnels relative to the impact of retpoline, where there is a median
5.7% regression as compared to the median 8% regression associated with
retpoline.

Benchmark suite

CFI + retpoline impact

(relative to CFI)

CFI + retpoline + BF impact

(relative to CFI)

blink_perf.bindings

0.9% improvement

9.8% improvement

blink_perf.dom

20.4% regression

17.5% regression

blink_perf.layout

17.4% regression

14.3% regression

blink_perf.parser

3.8% regression

5.7% regression

blink_perf.svg

8.0% regression

5.4% regression
Future workImplementation of branch funnels for architectures other than
x86_64.

Implementation of branch funnels for indirect calls via a function pointer
(currently only implemented for virtual calls). This will probably require
an implementation of whole-program “devirtualization” for indirect calls.

Use profile data to order the comparisons in the branch funnel by
frequency, to minimise the number of comparisons required for frequent
virtual calls.

Thanks,
-- 
-- 
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180123/6fffac9c/attachment.html>

Das, Dibyendu via llvm-dev

2018-Jan-24 16:47 UTC

head link

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

Can you run your new patch on CPU2017 INT C/C++ benchmarks and report the
regression percentage ?

From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Peter
Collingbourne via llvm-dev
Sent: Tuesday, January 23, 2018 6:45 PM
To: llvm-dev <llvm-dev at lists.llvm.org>
Cc: Vlad Tsyrklevich <vlad at tsyrklevich.net>
Subject: [llvm-dev] RFC: Using link-time optimization to eliminate retpolines

The proposed mitigation for variant 2 of CVE-2017-5715, “branch target
injection”, is to send all indirect branches through an instruction sequence
known as a retpoline. Because the purpose of a retpoline is to prevent
attacker-controlled speculation, we also end up losing the benefits of benign
speculation, which can lead to a measurable loss of performance.

We can regain some of those benefits if we know that the set of possible branch
targets is fixed (this is sometimes known to be the case when using
whole-program devirtualization or CFI -- see
https://clang.llvm.org/docs/LTOVisibility.html). In that case, we can construct
a so-called “branch funnel” that selects one of the possible targets by
performing a binary search on an address associated with the indirect branch
(for virtual calls, this is the address of the vtable, and for indirect calls
via a function pointer, this is the function pointer itself), eventually
directly branching to the selected target. As a result, the processor is free to
speculatively execute the virtual call, but it can only speculatively branch to
addresses of valid implementations of the virtual function, as opposed to
arbitrary addresses.

For example, suppose that we have the following class hierarchy, which is known
to be closed:

struct Base { virtual void f() = 0; };
struct A : Base { virtual void f(); };
struct B : Base { virtual void f(); };
struct C : Base { virtual void f(); };

We can lay out the vtables for the derived classes in the order A, B, C, and
produce an instruction sequence that directs execution to one of the targets
A::f, B::f and C::f depending on the vtable address. In x86_64 assembly, a
branch funnel would look like this:

lea B::vtable+16(%rip), %r11
cmp %r11, %r10
jb A::f
je B::f
jmp C::f

A caller performs a virtual call by loading the vtable address into register
r10, setting up the other registers for the virtual call and directly calling
the branch funnel as if it were a regular function. Because the branch funnel
enforces control flow integrity by itself, we can also avoid emitting CFI checks
at call sites that use branch funnels when CFI is enabled.

To control the layout of vtables and function pointers, we can extend existing
mechanisms for controlling layout that are used to implement CFI (see
https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html) so that they are
also used whenever a branch funnel needs to be created.

The compiler will only use branch funnels when both the retpoline mitigation
(-mretpoline) and whole-program devirtualization (-fwhole-program-vtables)
features are enabled (the former is on the assumption that in general a regular
indirect call will be less expensive than a branch funnel, and the latter
provides the necessary guarantee that the type hierarchy is closed). Even when
retpolines are enabled, there is still a cost associated with executing a branch
funnel that needs to be balanced against the cost of a regular CFI check and
retpoline, so branch funnels are only used when there are <=10 targets (this
number has not been tuned yet). Because the implementation uses some of the same
mechanisms that are used to implement CFI and whole-program devirtualization, it
requires LTO (it is compatible with both full LTO and ThinLTO).

To measure the performance impact of branch funnels, I ran a selection of Chrome
benchmark suites on Chrome binaries built with CFI, CFI + retpoline and CFI +
retpoline + branch funnels, and measured the median impact over all benchmarks
in each suite. The numbers are presented below. I should preface these numbers
by saying that these are largely microbenchmarks, so the impact of retpoline on
its own is unlikely to be characteristic of real workloads. The numbers to focus
on should be the impact of retpoline + branch funnels relative to the impact of
retpoline, where there is a median 5.7% regression as compared to the median 8%
regression associated with retpoline.

Benchmark suite

CFI + retpoline impact

(relative to CFI)

CFI + retpoline + BF impact

(relative to CFI)

blink_perf.bindings

0.9% improvement

9.8% improvement

blink_perf.dom

20.4% regression

17.5% regression

blink_perf.layout

17.4% regression

14.3% regression

blink_perf.parser

3.8% regression

5.7% regression

blink_perf.svg

8.0% regression

5.4% regression

Future work
Implementation of branch funnels for architectures other than x86_64.

Implementation of branch funnels for indirect calls via a function pointer
(currently only implemented for virtual calls). This will probably require an
implementation of whole-program “devirtualization” for indirect calls.

Use profile data to order the comparisons in the branch funnel by frequency, to
minimise the number of comparisons required for frequent virtual calls.

Thanks,
--
--
Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180124/64956794/attachment.html>

Sean Silva via llvm-dev

2018-Jan-26 17:33 UTC

head link

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

Wouldn't a branch funnel open the door to a type 1 attack?

E.g. if the code looks like this, then a branch funnel basically turns into
a standard type 1 pattern AFAICT:

struct Base {
    virtual int f(long) = 0;
};

struct A : Base {
    int f(long x) override {
        return 0;
    };
};

struct B : Base {
    int f(long x) override {
        // As in listing 1 in https://spectreattack.com/spectre.pdf
        return array2[array1[x] * 256];
    }
};

-- Sean Silva

On Tue, Jan 23, 2018 at 4:44 PM, Peter Collingbourne via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> The proposed mitigation for variant 2 of CVE-2017-5715, “branch target
> injection”, is to send all indirect branches through an instruction
> sequence known as a retpoline. Because the purpose of a retpoline is to
> prevent attacker-controlled speculation, we also end up losing the benefits
> of benign speculation, which can lead to a measurable loss of performance.
>
> We can regain some of those benefits if we know that the set of possible
> branch targets is fixed (this is sometimes known to be the case when using
> whole-program devirtualization or CFI -- see https://clang.llvm.org/docs/
> LTOVisibility.html). In that case, we can construct a so-called “branch
> funnel” that selects one of the possible targets by performing a binary
> search on an address associated with the indirect branch (for virtual
> calls, this is the address of the vtable, and for indirect calls via a
> function pointer, this is the function pointer itself), eventually directly
> branching to the selected target. As a result, the processor is free to
> speculatively execute the virtual call, but it can only speculatively
> branch to addresses of valid implementations of the virtual function, as
> opposed to arbitrary addresses.
>
> For example, suppose that we have the following class hierarchy, which is
> known to be closed:
>
> struct Base { virtual void f() = 0; };
> struct A : Base { virtual void f(); };
> struct B : Base { virtual void f(); };
> struct C : Base { virtual void f(); };
>
> We can lay out the vtables for the derived classes in the order A, B, C,
> and produce an instruction sequence that directs execution to one of the
> targets A::f, B::f and C::f depending on the vtable address. In x86_64
> assembly, a branch funnel would look like this:
>
> lea B::vtable+16(%rip), %r11
> cmp %r11, %r10
> jb A::f
> je B::f
> jmp C::f
>
> A caller performs a virtual call by loading the vtable address into
> register r10, setting up the other registers for the virtual call and
> directly calling the branch funnel as if it were a regular function.
> Because the branch funnel enforces control flow integrity by itself, we can
> also avoid emitting CFI checks at call sites that use branch funnels when
> CFI is enabled.
>
> To control the layout of vtables and function pointers, we can extend
> existing mechanisms for controlling layout that are used to implement CFI
> (see https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html) so that
> they are also used whenever a branch funnel needs to be created.
>
> The compiler will only use branch funnels when both the retpoline
> mitigation (-mretpoline) and whole-program devirtualization
> (-fwhole-program-vtables) features are enabled (the former is on the
> assumption that in general a regular indirect call will be less expensive
> than a branch funnel, and the latter provides the necessary guarantee that
> the type hierarchy is closed). Even when retpolines are enabled, there is
> still a cost associated with executing a branch funnel that needs to be
> balanced against the cost of a regular CFI check and retpoline, so branch
> funnels are only used when there are <=10 targets (this number has not
been
> tuned yet). Because the implementation uses some of the same mechanisms
> that are used to implement CFI and whole-program devirtualization, it
> requires LTO (it is compatible with both full LTO and ThinLTO).
>
> To measure the performance impact of branch funnels, I ran a selection of
> Chrome benchmark suites on Chrome binaries built with CFI, CFI + retpoline
> and CFI + retpoline + branch funnels, and measured the median impact over
> all benchmarks in each suite. The numbers are presented below. I should
> preface these numbers by saying that these are largely microbenchmarks, so
> the impact of retpoline on its own is unlikely to be characteristic of real
> workloads. The numbers to focus on should be the impact of retpoline +
> branch funnels relative to the impact of retpoline, where there is a median
> 5.7% regression as compared to the median 8% regression associated with
> retpoline.
>
> Benchmark suite
>
> CFI + retpoline impact
>
> (relative to CFI)
>
> CFI + retpoline + BF impact
>
> (relative to CFI)
>
> blink_perf.bindings
>
> 0.9% improvement
>
> 9.8% improvement
>
> blink_perf.dom
>
> 20.4% regression
>
> 17.5% regression
>
> blink_perf.layout
>
> 17.4% regression
>
> 14.3% regression
>
> blink_perf.parser
>
> 3.8% regression
>
> 5.7% regression
>
> blink_perf.svg
>
> 8.0% regression
>
> 5.4% regression
> Future workImplementation of branch funnels for architectures other than
> x86_64.
>
> Implementation of branch funnels for indirect calls via a function pointer
> (currently only implemented for virtual calls). This will probably require
> an implementation of whole-program “devirtualization” for indirect calls.
>
> Use profile data to order the comparisons in the branch funnel by
> frequency, to minimise the number of comparisons required for frequent
> virtual calls.
>
> Thanks,
> --
> --
> Peter
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180126/f852f3d0/attachment.html>

Paul Rouschal via llvm-dev

2018-Jan-26 18:10 UTC

head link

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

Hi,

Sean Silva via llvm-dev wrote:> Wouldn't a branch funnel open the door to a type 1 attack?

Only if the code looks exactly as you wrote it. If I understand this 
correctly the problem with indirect branches is that the "gadget", the
code leaking the data, could be *anywhere* in the binary, giving the 
attacker much more freedom. So restricting these calls to one of the 
known correct results will still be a (relative) win.


Hand-Wavy Example:

struct Base {
       virtual int f(long) = 0;
};

struct A : Base {
      int f(long x) override {
           return 0;
      };
};

struct B : Base {
      int f(long x) override {
           return 1;
      };
};

static int aCompletelyUnrelatedFunction() {
     someOtherCode();
Gadget:
     int z = array2[array1[somethingInTheSameRegisterAsX] * 256];
     return z;
}

Here the attacker could train the predictor to continue execution at 
"Gadget".

To quote from [1]

"To mistrain the BTB, the attacker finds the virtual ad-
dress of the gadget  in the victim’s address space,  then
performs indirect branches to this address.  This training
is done from the attacker’s address space, and it does not
matter what resides at the gadget address in the attacker’s
address space; all that is required is that the branch used
for training branches to use the same destination virtual
address."

[1] Kocher et.al.: Spectre Attacks: Exploiting Speculative Execution 
https://spectreattack.com/spectre.pdf


Best Regards,
Paul
> E.g. if the code looks like this, then a branch funnel basically turns 
> into a standard type 1 pattern AFAICT:
> 
> struct Base {
>      virtual int f(long) = 0;
> };
> 
> struct A : Base {
>      int f(long x) override {
>          return 0;
>      };
> };
> 
> struct B : Base {
>      int f(long x) override {
>          // As in listing 1 in https://spectreattack.com/spectre.pdf
>          return array2[array1[x] * 256];
>      }
> };
> 
> -- Sean Silva
> 
> On Tue, Jan 23, 2018 at 4:44 PM, Peter Collingbourne via llvm-dev 
> <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>> wrote:
> 
>     The proposed mitigation for variant 2 of CVE-2017-5715, “branch
>     target injection”, is to send all indirect branches through an
>     instruction sequence known as a retpoline. Because the purpose of a
>     retpoline is to prevent attacker-controlled speculation, we also end
>     up losing the benefits of benign speculation, which can lead to a
>     measurable loss of performance.
> 
>     We can regain some of those benefits if we know that the set of
>     possible branch targets is fixed (this is sometimes known to be the
>     case when using whole-program devirtualization or CFI -- see
>     https://clang.llvm.org/docs/LTOVisibility.html
>     <https://clang.llvm.org/docs/LTOVisibility.html>). In that case,
we
>     can construct a so-called “branch funnel” that selects one of the
>     possible targets by performing a binary search on an address
>     associated with the indirect branch (for virtual calls, this is the
>     address of the vtable, and for indirect calls via a function
>     pointer, this is the function pointer itself), eventually directly
>     branching to the selected target. As a result, the processor is free
>     to speculatively execute the virtual call, but it can only
>     speculatively branch to addresses of valid implementations of the
>     virtual function, as opposed to arbitrary addresses.
> 
>     For example, suppose that we have the following class hierarchy,
>     which is known to be closed:
> 
>     struct Base { virtual void f() = 0; };
>     struct A : Base { virtual void f(); };
>     struct B : Base { virtual void f(); };
>     struct C : Base { virtual void f(); };
> 
>     We can lay out the vtables for the derived classes in the order A,
>     B, C, and produce an instruction sequence that directs execution to
>     one of the targets A::f, B::f and C::f depending on the vtable
>     address. In x86_64 assembly, a branch funnel would look like this:
> 
>     lea B::vtable+16(%rip), %r11
>     cmp %r11, %r10
>     jb A::f
>     je B::f
>     jmp C::f
> 
>     A caller performs a virtual call by loading the vtable address into
>     register r10, setting up the other registers for the virtual call
>     and directly calling the branch funnel as if it were a regular
>     function. Because the branch funnel enforces control flow integrity
>     by itself, we can also avoid emitting CFI checks at call sites that
>     use branch funnels when CFI is enabled.
> 
>     To control the layout of vtables and function pointers, we can
>     extend existing mechanisms for controlling layout that are used to
>     implement CFI (see
>     https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html
>     <https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html>) so
>     that they are also used whenever a branch funnel needs to be created.
> 
>     The compiler will only use branch funnels when both the retpoline
>     mitigation (-mretpoline) and whole-program devirtualization
>     (-fwhole-program-vtables) features are enabled (the former is on the
>     assumption that in general a regular indirect call will be less
>     expensive than a branch funnel, and the latter provides the
>     necessary guarantee that the type hierarchy is closed). Even when
>     retpolines are enabled, there is still a cost associated with
>     executing a branch funnel that needs to be balanced against the cost
>     of a regular CFI check and retpoline, so branch funnels are only
>     used when there are <=10 targets (this number has not been tuned
>     yet). Because the implementation uses some of the same mechanisms
>     that are used to implement CFI and whole-program devirtualization,
>     it requires LTO (it is compatible with both full LTO and ThinLTO).
> 
>     To measure the performance impact of branch funnels, I ran a
>     selection of Chrome benchmark suites on Chrome binaries built with
>     CFI, CFI + retpoline and CFI + retpoline + branch funnels, and
>     measured the median impact over all benchmarks in each suite. The
>     numbers are presented below. I should preface these numbers by
>     saying that these are largely microbenchmarks, so the impact of
>     retpoline on its own is unlikely to be characteristic of real
>     workloads. The numbers to focus on should be the impact of retpoline
>     + branch funnels relative to the impact of retpoline, where there is
>     a median 5.7% regression as compared to the median 8% regression
>     associated with retpoline.
> 
>     Benchmark suite
> 
>     	
> 
>     CFI + retpoline impact
> 
>     (relative to CFI)
> 
>     	
> 
>     CFI + retpoline + BF impact
> 
>     (relative to CFI)
> 
>     blink_perf.bindings
> 
>     	
> 
>     0.9% improvement
> 
>     	
> 
>     9.8% improvement
> 
>     blink_perf.dom
> 
>     	
> 
>     20.4% regression
> 
>     	
> 
>     17.5% regression
> 
>     blink_perf.layout
> 
>     	
> 
>     17.4% regression
> 
>     	
> 
>     14.3% regression
> 
>     blink_perf.parser
> 
>     	
> 
>     3.8% regression
> 
>     	
> 
>     5.7% regression
> 
>     blink_perf.svg
> 
>     	
> 
>     8.0% regression
> 
>     	
> 
>     5.4% regression
> 
> 
>       Future work
> 
>     Implementation of branch funnels for architectures other than x86_64.
> 
>     Implementation of branch funnels for indirect calls via a function
>     pointer (currently only implemented for virtual calls). This will
>     probably require an implementation of whole-program
>     “devirtualization” for indirect calls.
> 
>     Use profile data to order the comparisons in the branch funnel by
>     frequency, to minimise the number of comparisons required for
>     frequent virtual calls.
> 
>     Thanks,
>     -- 
>     -- 
>     Peter
> 
>     _______________________________________________
>     LLVM Developers mailing list
>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
> 
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Jan 2018 - RFC: Using link-time optimization to eliminate retpolines

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

[llvm-dev] RFC: Using link-time optimization to eliminate retpolines

Reasonably Related Threads