Gao, Yunzhong
2013-Sep-19 18:53 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
Hi all, I would like to make a proposal about changing the optimization strategy regarding when to insert a vzeroupper instruction in the x86 backend. Current implementation: vzeroupper is inserted to any functions that use AVX instructions. The insertion points are: 1) before a call instruction; 2) before a return instruction; Rationale: vzeroupper is an AVX instruction; it is inserted to avoid performance penalty when switching between x86 AVX mode and SSE mode, e.g., when an AVX function calls a SSE function. My proposal: Default to not insert vzeroupper instruction unless a function is using legacy SSE instructions. By a legacy SSE instruction, I mean any vector instructions that do not have a v- prefix, write XMM register but not YMM register. If a legacy SSE instruction is spotted, then insert a vzeroupper instruction: 1) before a call instruction; 2) before a return instruction; Explanation: If all application and libraries are compiled with the same toolchain, then with this proposal, a function can assume that incoming AVX registers have their top 128 bits either specified or zeroed. Assuming that legacy SSE instructions will be seldom generated, it should be rare to have to emit vzeroupper instructions, which is a slow instruction by itself. Possible problem: This proposal is biased towards the situation when all applications and libraries are compiled with the same toolchain. If it is common case to mix and match applications built with different toolchains, this approach might lead to situations when a vzeroupper instruction is missing when calling from a LLVM-compiled AVX function to a foreign-compiled SSE function, hence a transition penalty. One possible solution around this issue is to add a function attribute which specifies whether the caller and callee have the same architecture. e.g., extern int foo __attribute__((nolegacy)); would declare an external function that does not use legacy SSE instruction. Any thoughts? - Gao.
Manny Ko
2013-Sep-19 19:04 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
Great idea. I reported on this problem before and glad to see someone trying to tackle this. cheers. ________________________________________ From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on behalf of Gao, Yunzhong [yunzhong_gao at playstation.sony.com] Sent: Thursday, September 19, 2013 11:53 AM To: llvmdev at cs.uiuc.edu Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy Hi all, I would like to make a proposal about changing the optimization strategy regarding when to insert a vzeroupper instruction in the x86 backend. Current implementation: vzeroupper is inserted to any functions that use AVX instructions. The insertion points are: 1) before a call instruction; 2) before a return instruction; Rationale: vzeroupper is an AVX instruction; it is inserted to avoid performance penalty when switching between x86 AVX mode and SSE mode, e.g., when an AVX function calls a SSE function. My proposal: Default to not insert vzeroupper instruction unless a function is using legacy SSE instructions. By a legacy SSE instruction, I mean any vector instructions that do not have a v- prefix, write XMM register but not YMM register. If a legacy SSE instruction is spotted, then insert a vzeroupper instruction: 1) before a call instruction; 2) before a return instruction; Explanation: If all application and libraries are compiled with the same toolchain, then with this proposal, a function can assume that incoming AVX registers have their top 128 bits either specified or zeroed. Assuming that legacy SSE instructions will be seldom generated, it should be rare to have to emit vzeroupper instructions, which is a slow instruction by itself. Possible problem: This proposal is biased towards the situation when all applications and libraries are compiled with the same toolchain. If it is common case to mix and match applications built with different toolchains, this approach might lead to situations when a vzeroupper instruction is missing when calling from a LLVM-compiled AVX function to a foreign-compiled SSE function, hence a transition penalty. One possible solution around this issue is to add a function attribute which specifies whether the caller and callee have the same architecture. e.g., extern int foo __attribute__((nolegacy)); would declare an external function that does not use legacy SSE instruction. Any thoughts? - Gao. _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Bin Tzeng
2013-Sep-19 19:15 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
Great! Glad to see you are working on this. On Thu, Sep 19, 2013 at 3:04 PM, Manny Ko <Manny.Ko at imgtec.com> wrote:> Great idea. I reported on this problem before and glad to see someone > trying to tackle this. > > cheers. > > ________________________________________ > From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on behalf > of Gao, Yunzhong [yunzhong_gao at playstation.sony.com] > Sent: Thursday, September 19, 2013 11:53 AM > To: llvmdev at cs.uiuc.edu > Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy > > Hi all, > > I would like to make a proposal about changing the optimization strategy > regarding when to insert a vzeroupper instruction in the x86 backend. > > Current implementation: > vzeroupper is inserted to any functions that use AVX instructions. The > insertion points are: > 1) before a call instruction; > 2) before a return instruction; > > Rationale: > vzeroupper is an AVX instruction; it is inserted to avoid performance > penalty > when switching between x86 AVX mode and SSE mode, e.g., when an AVX > function > calls a SSE function. > > My proposal: > Default to not insert vzeroupper instruction unless a function is using > legacy > SSE instructions. By a legacy SSE instruction, I mean any vector > instructions > that do not have a v- prefix, write XMM register but not YMM register. If a > legacy SSE instruction is spotted, then insert a vzeroupper instruction: > 1) before a call instruction; > 2) before a return instruction; > > Explanation: > If all application and libraries are compiled with the same toolchain, then > with this proposal, a function can assume that incoming AVX registers have > their top 128 bits either specified or zeroed. Assuming that legacy SSE > instructions will be seldom generated, it should be rare to have to emit > vzeroupper instructions, which is a slow instruction by itself. > > Possible problem: > This proposal is biased towards the situation when all applications and > libraries are compiled with the same toolchain. If it is common case to > mix and > match applications built with different toolchains, this approach might > lead to > situations when a vzeroupper instruction is missing when calling from a > LLVM-compiled AVX function to a foreign-compiled SSE function, hence a > transition penalty. One possible solution around this issue is to add a > function attribute which specifies whether the caller and callee have the > same architecture. e.g., > extern int foo __attribute__((nolegacy)); > would declare an external function that does not use legacy SSE > instruction. > > Any thoughts? > - Gao. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/d56ca066/attachment.html>
Eli Friedman
2013-Sep-19 19:30 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
On Thu, Sep 19, 2013 at 11:53 AM, Gao, Yunzhong < yunzhong_gao at playstation.sony.com> wrote:> Hi all, > > I would like to make a proposal about changing the optimization strategy > regarding when to insert a vzeroupper instruction in the x86 backend. > > Current implementation: > vzeroupper is inserted to any functions that use AVX instructions. The > insertion points are: > 1) before a call instruction; > 2) before a return instruction; > > Rationale: > vzeroupper is an AVX instruction; it is inserted to avoid performance > penalty > when switching between x86 AVX mode and SSE mode, e.g., when an AVX > function > calls a SSE function. > > My proposal: > Default to not insert vzeroupper instruction unless a function is using > legacy > SSE instructions. By a legacy SSE instruction, I mean any vector > instructions > that do not have a v- prefix, write XMM register but not YMM register. If a > legacy SSE instruction is spotted, then insert a vzeroupper instruction: > 1) before a call instruction; > 2) before a return instruction; >This is essentially equivalent to "don't insert vzeroupper anywhere", as far as I can tell. (The case of SSE instructions without a v- prefixed equivalent is rare enough we can separate it from this discussion.) The reason we need vzeroupper in the first place is because we can't assume other functions won't use legacy SSE instructions; for example, on most systems, calling sin() will use legacy SSE instructions. I mean, if you can make some unusual guarantee about your platform, it might make sense to disable vzeroupper generation in general, but it simply doesn't make sense on most platforms. If you want a mechanism to disable vzeroupper generation for particular function calls, that might make sense... -Eli -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130919/2b27d70f/attachment.html>
Gao, Yunzhong
2013-Sep-20 21:52 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
Hi Manny, Thanks! You said that you reported on this problem before, do you know whether there is an existing LLVM bugzilla ticket for this issue? - Gao.> -----Original Message----- > From: Manny Ko [mailto:Manny.Ko at imgtec.com] > Sent: Thursday, September 19, 2013 12:05 PM > To: Gao, Yunzhong; llvmdev at cs.uiuc.edu > Subject: RE: Proposal to improve vzeroupper optimization strategy > > Great idea. I reported on this problem before and glad to see someone > trying to tackle this. > > cheers. > > ________________________________________ > From: llvmdev-bounces at cs.uiuc.edu [llvmdev-bounces at cs.uiuc.edu] on > behalf of Gao, Yunzhong [yunzhong_gao at playstation.sony.com] > Sent: Thursday, September 19, 2013 11:53 AM > To: llvmdev at cs.uiuc.edu > Subject: [LLVMdev] Proposal to improve vzeroupper optimization strategy > > Hi all, > > I would like to make a proposal about changing the optimization strategy > regarding when to insert a vzeroupper instruction in the x86 backend. > > Current implementation: > vzeroupper is inserted to any functions that use AVX instructions. The > insertion points are: > 1) before a call instruction; > 2) before a return instruction; > > Rationale: > vzeroupper is an AVX instruction; it is inserted to avoid performance penalty > when switching between x86 AVX mode and SSE mode, e.g., when an AVX > function calls a SSE function. > > My proposal: > Default to not insert vzeroupper instruction unless a function is using legacy > SSE instructions. By a legacy SSE instruction, I mean any vector instructions > that do not have a v- prefix, write XMM register but not YMM register. If a > legacy SSE instruction is spotted, then insert a vzeroupper instruction: > 1) before a call instruction; > 2) before a return instruction; > > Explanation: > If all application and libraries are compiled with the same toolchain, then with > this proposal, a function can assume that incoming AVX registers have their > top 128 bits either specified or zeroed. Assuming that legacy SSE instructions > will be seldom generated, it should be rare to have to emit vzeroupper > instructions, which is a slow instruction by itself. > > Possible problem: > This proposal is biased towards the situation when all applications and > libraries are compiled with the same toolchain. If it is common case to mix > and match applications built with different toolchains, this approach might > lead to situations when a vzeroupper instruction is missing when calling from > a LLVM-compiled AVX function to a foreign-compiled SSE function, hence a > transition penalty. One possible solution around this issue is to add a function > attribute which specifies whether the caller and callee have the same > architecture. e.g., extern int foo __attribute__((nolegacy)); would declare an > external function that does not use legacy SSE instruction. > > Any thoughts? > - Gao. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Gao, Yunzhong
2013-Sep-20 21:58 UTC
[LLVMdev] Proposal to improve vzeroupper optimization strategy
Hi Eli, Thanks for the feedback. Please see below. - Gao. From: Eli Friedman [mailto:eli.friedman at gmail.com] Sent: Thursday, September 19, 2013 12:31 PM To: Gao, Yunzhong Cc: llvmdev at cs.uiuc.edu Subject: Re: [LLVMdev] Proposal to improve vzeroupper optimization strategy> This is essentially equivalent to "don't insert vzeroupper anywhere", as > far as I can tell. (The case of SSE instructions without a v- prefixed > equivalent is rare enough we can separate it from this discussion.)So will you be interested in a patch that disables vzeroupper by default? I implemented this possibly over-engineering solution in our local tree to work around some bad instruction selection issues in LLVM backend. When benchmarking on our game codes, I noticed that sometimes legacy SSE instructions were selected despite existence of AVX equivalent, in which case the vzeroupper instruction was needed. And it is much easier to detect existence of vzeroupper instruction than to detect each single legacy SSE instructions. The instruction selection issues were later fixed in our tree (patches to be submitted later), at least for the handful of games I tested on. So a simple change to just disable vzeroupper by default will be acceptable to us as well.> The reason we need vzeroupper in the first place is because we can't assume > other functions won't use legacy SSE instructions; for example, on most > systems, calling sin() will use legacy SSE instructions. I mean, if you can > make some unusual guarantee about your platform, it might make sense to > disable vzeroupper generation in general, but it simply doesn't make sense > on most platforms.I am confused by this point. By "most systems," do you have in mind a platform where the sin() function was compiled by gcc but the application codes were compiled by clang? If the sin() function was compiled by clang for a platform that supports AVX instructions, I do not expect it to contain legacy SSE instructions. Is it not the case for your platform? I just looked at the library code for our sin() function and I do not see any legacy SSE instructions (but for license restrictions I cannot share our library codes; sorry).> If you want a mechanism to disable vzeroupper generation for particular > function calls, that might make sense... > -Eli-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130920/1a8d64aa/attachment.html>
Possibly Parallel Threads
- [LLVMdev] Proposal to improve vzeroupper optimization strategy
- [LLVMdev] Proposal to improve vzeroupper optimization strategy
- [LLVMdev] Proposal to improve vzeroupper optimization strategy
- [LLVMdev] Proposal to improve vzeroupper optimization strategy
- [LLVMdev] Proposal to improve vzeroupper optimization strategy