thr3ads.net - llvm dev - [LLVMdev] Proposal to improve vzeroupper optimization strategy [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Gao, Yunzhong

2013-Sep-19 18:53 UTC

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Hi all,

I would like to make a proposal about changing the optimization strategy
regarding when to insert a vzeroupper instruction in the x86 backend.

Current implementation:
vzeroupper is inserted to any functions that use AVX instructions. The
insertion points are:
1) before a call instruction;
2) before a return instruction;

Rationale:
vzeroupper is an AVX instruction; it is inserted to avoid performance penalty
when switching between x86 AVX mode and SSE mode, e.g., when an AVX function
calls a SSE function.

My proposal:
Default to not insert vzeroupper instruction unless a function is using legacy
SSE instructions. By a legacy SSE instruction, I mean any vector instructions
that do not have a v- prefix, write XMM register but not YMM register. If a
legacy SSE instruction is spotted, then insert a vzeroupper instruction:
1) before a call instruction;
2) before a return instruction;

Explanation:
If all application and libraries are compiled with the same toolchain, then
with this proposal, a function can assume that incoming AVX registers have
their top 128 bits either specified or zeroed. Assuming that legacy SSE
instructions will be seldom generated, it should be rare to have to emit
vzeroupper instructions, which is a slow instruction by itself.

Possible problem:
This proposal is biased towards the situation when all applications and
libraries are compiled with the same toolchain. If it is common case to mix and
match applications built with different toolchains, this approach might lead to
situations when a vzeroupper instruction is missing when calling from a
LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
transition penalty. One possible solution around this issue is to add a
function attribute which specifies whether the caller and callee have the
same architecture. e.g.,
extern int foo __attribute__((nolegacy));
would declare an external function that does not use legacy SSE instruction.

Any thoughts?
- Gao.

Manny Ko

2013-Sep-19 19:04 UTC

head link

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Great idea.  I reported on this problem before and glad to see someone trying to
tackle this.

cheers.

________________________________________
From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on behalf
of Gao, Yunzhong [yunzhong_gao at playstation.sony.com]
Sent: Thursday, September 19, 2013 11:53 AM
To: llvmdev at cs.uiuc.edu
Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy

Hi all,

I would like to make a proposal about changing the optimization strategy
regarding when to insert a vzeroupper instruction in the x86 backend.

Current implementation:
vzeroupper is inserted to any functions that use AVX instructions. The
insertion points are:
1) before a call instruction;
2) before a return instruction;

Rationale:
vzeroupper is an AVX instruction; it is inserted to avoid performance penalty
when switching between x86 AVX mode and SSE mode, e.g., when an AVX function
calls a SSE function.

My proposal:
Default to not insert vzeroupper instruction unless a function is using legacy
SSE instructions. By a legacy SSE instruction, I mean any vector instructions
that do not have a v- prefix, write XMM register but not YMM register. If a
legacy SSE instruction is spotted, then insert a vzeroupper instruction:
1) before a call instruction;
2) before a return instruction;

Explanation:
If all application and libraries are compiled with the same toolchain, then
with this proposal, a function can assume that incoming AVX registers have
their top 128 bits either specified or zeroed. Assuming that legacy SSE
instructions will be seldom generated, it should be rare to have to emit
vzeroupper instructions, which is a slow instruction by itself.

Possible problem:
This proposal is biased towards the situation when all applications and
libraries are compiled with the same toolchain. If it is common case to mix and
match applications built with different toolchains, this approach might lead to
situations when a vzeroupper instruction is missing when calling from a
LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
transition penalty. One possible solution around this issue is to add a
function attribute which specifies whether the caller and callee have the
same architecture. e.g.,
extern int foo __attribute__((nolegacy));
would declare an external function that does not use legacy SSE instruction.

Any thoughts?
- Gao.

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Bin Tzeng

2013-Sep-19 19:15 UTC

head link

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Great! Glad to see you are working on this.


On Thu, Sep 19, 2013 at 3:04 PM, Manny Ko <Manny.Ko at imgtec.com> wrote:
> Great idea.  I reported on this problem before and glad to see someone
> trying to tackle this.
>
> cheers.
>
> ________________________________________
> From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on
behalf
> of Gao, Yunzhong [yunzhong_gao at playstation.sony.com]
> Sent: Thursday, September 19, 2013 11:53 AM
> To: llvmdev at cs.uiuc.edu
> Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy
>
> Hi all,
>
> I would like to make a proposal about changing the optimization strategy
> regarding when to insert a vzeroupper instruction in the x86 backend.
>
> Current implementation:
> vzeroupper is inserted to any functions that use AVX instructions. The
> insertion points are:
> 1) before a call instruction;
> 2) before a return instruction;
>
> Rationale:
> vzeroupper is an AVX instruction; it is inserted to avoid performance
> penalty
> when switching between x86 AVX mode and SSE mode, e.g., when an AVX
> function
> calls a SSE function.
>
> My proposal:
> Default to not insert vzeroupper instruction unless a function is using
> legacy
> SSE instructions. By a legacy SSE instruction, I mean any vector
> instructions
> that do not have a v- prefix, write XMM register but not YMM register. If a
> legacy SSE instruction is spotted, then insert a vzeroupper instruction:
> 1) before a call instruction;
> 2) before a return instruction;
>
> Explanation:
> If all application and libraries are compiled with the same toolchain, then
> with this proposal, a function can assume that incoming AVX registers have
> their top 128 bits either specified or zeroed. Assuming that legacy SSE
> instructions will be seldom generated, it should be rare to have to emit
> vzeroupper instructions, which is a slow instruction by itself.
>
> Possible problem:
> This proposal is biased towards the situation when all applications and
> libraries are compiled with the same toolchain. If it is common case to
> mix and
> match applications built with different toolchains, this approach might
> lead to
> situations when a vzeroupper instruction is missing when calling from a
> LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
> transition penalty. One possible solution around this issue is to add a
> function attribute which specifies whether the caller and callee have the
> same architecture. e.g.,
> extern int foo __attribute__((nolegacy));
> would declare an external function that does not use legacy SSE
> instruction.
>
> Any thoughts?
> - Gao.
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/d56ca066/attachment.html>

Eli Friedman

2013-Sep-19 19:30 UTC

head link

[LLVMdev] Proposal to improve vzeroupper optimization strategy

On Thu, Sep 19, 2013 at 11:53 AM, Gao, Yunzhong <
yunzhong_gao at playstation.sony.com> wrote:
> Hi all,
>
> I would like to make a proposal about changing the optimization strategy
> regarding when to insert a vzeroupper instruction in the x86 backend.
>
> Current implementation:
> vzeroupper is inserted to any functions that use AVX instructions. The
> insertion points are:
> 1) before a call instruction;
> 2) before a return instruction;
>
> Rationale:
> vzeroupper is an AVX instruction; it is inserted to avoid performance
> penalty
> when switching between x86 AVX mode and SSE mode, e.g., when an AVX
> function
> calls a SSE function.
>
> My proposal:
> Default to not insert vzeroupper instruction unless a function is using
> legacy
> SSE instructions. By a legacy SSE instruction, I mean any vector
> instructions
> that do not have a v- prefix, write XMM register but not YMM register. If a
> legacy SSE instruction is spotted, then insert a vzeroupper instruction:
> 1) before a call instruction;
> 2) before a return instruction;
>
This is essentially equivalent to "don't insert vzeroupper
anywhere", as
far as I can tell.  (The case of SSE instructions without a v- prefixed
equivalent is rare enough we can separate it from this discussion.)

The reason we need vzeroupper in the first place is because we can't assume
other functions won't use legacy SSE instructions; for example, on most
systems, calling sin() will use legacy SSE instructions.  I mean, if you
can make some unusual guarantee about your platform, it might make sense to
disable vzeroupper generation in general, but it simply doesn't make sense
on most platforms.

If you want a mechanism to disable vzeroupper generation for particular
function calls, that might make sense...

-Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/2b27d70f/attachment.html>

Gao, Yunzhong

2013-Sep-20 21:52 UTC

head link

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Hi Manny,
Thanks! You said that you reported on this problem before, do you know whether
there is an
existing LLVM bugzilla ticket for this issue?
- Gao.

> -----Original Message-----
> From: Manny Ko [mailto:Manny.Ko at imgtec.com]
> Sent: Thursday, September 19, 2013 12:05 PM
> To: Gao, Yunzhong; llvmdev at cs.uiuc.edu
> Subject: RE: Proposal to improve vzeroupper optimization strategy
> 
> Great idea.  I reported on this problem before and glad to see someone
> trying to tackle this.
> 
> cheers.
> 
> ________________________________________
> From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on
> behalf of Gao, Yunzhong [yunzhong_gao at playstation.sony.com]
> Sent: Thursday, September 19, 2013 11:53 AM
> To: llvmdev at cs.uiuc.edu
> Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy
> 
> Hi all,
> 
> I would like to make a proposal about changing the optimization strategy
> regarding when to insert a vzeroupper instruction in the x86 backend.
> 
> Current implementation:
> vzeroupper is inserted to any functions that use AVX instructions. The
> insertion points are:
> 1) before a call instruction;
> 2) before a return instruction;
> 
> Rationale:
> vzeroupper is an AVX instruction; it is inserted to avoid performance
penalty
> when switching between x86 AVX mode and SSE mode, e.g., when an AVX
> function calls a SSE function.
> 
> My proposal:
> Default to not insert vzeroupper instruction unless a function is using
legacy
> SSE instructions. By a legacy SSE instruction, I mean any vector
instructions
> that do not have a v- prefix, write XMM register but not YMM register. If a
> legacy SSE instruction is spotted, then insert a vzeroupper instruction:
> 1) before a call instruction;
> 2) before a return instruction;
> 
> Explanation:
> If all application and libraries are compiled with the same toolchain, then
with
> this proposal, a function can assume that incoming AVX registers have their
> top 128 bits either specified or zeroed. Assuming that legacy SSE
instructions
> will be seldom generated, it should be rare to have to emit vzeroupper
> instructions, which is a slow instruction by itself.
> 
> Possible problem:
> This proposal is biased towards the situation when all applications and
> libraries are compiled with the same toolchain. If it is common case to mix
> and match applications built with different toolchains, this approach might
> lead to situations when a vzeroupper instruction is missing when calling
from
> a LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
> transition penalty. One possible solution around this issue is to add a
function
> attribute which specifies whether the caller and callee have the same
> architecture. e.g., extern int foo __attribute__((nolegacy)); would declare
an
> external function that does not use legacy SSE instruction.
> 
> Any thoughts?
> - Gao.
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Gao, Yunzhong

2013-Sep-20 21:58 UTC

head link

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Hi Eli,
Thanks for the feedback. Please see below.
- Gao.

From: Eli Friedman [mailto:eli.friedman at gmail.com]
Sent: Thursday, September 19, 2013 12:31 PM
To: Gao, Yunzhong
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Proposal to improve vzeroupper optimization strategy
> This is essentially equivalent to "don't insert vzeroupper
anywhere", as
> far as I can tell. (The case of SSE instructions without a v- prefixed
> equivalent is rare enough we can separate it from this discussion.)
So will you be interested in a patch that disables vzeroupper by default?

I implemented this possibly over-engineering solution in our local tree to work
around some bad instruction selection issues in LLVM backend. When benchmarking
on our game codes, I noticed that sometimes legacy SSE instructions were
selected despite existence of AVX equivalent, in which case the vzeroupper
instruction was needed. And it is much easier to detect existence of vzeroupper
instruction than to detect each single legacy SSE instructions.

The instruction selection issues were later fixed in our tree (patches to be
submitted later), at least for the handful of games I tested on. So a simple
change to just disable vzeroupper by default will be acceptable to us as well.
> The reason we need vzeroupper in the first place is because we can't
assume
> other functions won't use legacy SSE instructions; for example, on most
> systems, calling sin() will use legacy SSE instructions.  I mean, if you
can
> make some unusual guarantee about your platform, it might make sense to
> disable vzeroupper generation in general, but it simply doesn't make
sense
> on most platforms.
I am confused by this point. By "most systems," do you have in mind a
platform
where the sin() function was compiled by gcc but the application codes were
compiled by clang?

If the sin() function was compiled by clang for a platform that supports AVX
instructions, I do not expect it to contain legacy SSE instructions. Is it not
the case for your platform?

I just looked at the library code for our sin() function and I do not see any
legacy SSE instructions (but for license restrictions I cannot share our
library codes; sorry).
> If you want a mechanism to disable vzeroupper generation for particular
> function calls, that might make sense...
> -Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130920/1a8d64aa/attachment.html>

Reasonably Related Threads

Search for more seemingly similar threads

llvm dev - Sep 2013 - [LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

[LLVMdev] Proposal to improve vzeroupper optimization strategy

Reasonably Related Threads