Ben Simhon, Oren via llvm-dev
2016-Nov-30 15:20 UTC
[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention
Adding Support For Vectorcall Calling Convention ==================================================== Vectorcall Calling Convention for x64 ---------------------------------------------------- The __vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible. __vectorcall uses more registers for arguments than __fastcall or the default x64 calling convention use. The __vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above. The Definition of HVA Types -------------------------------------- A Homogeneous Vector Aggregate (HVA) type is a composite type of up to four data members that have identical vector types. An HVA type has the same alignment requirement as the vector type of its members. For example: typedef struct { __m256 x; __m256 y; __m256 z; } hva3; // HVA type with 3 __m256 elements Vectorcall Extension ---------------------------- Vectorcall extends the standard x64 calling convention while adding support for HVA and vector types. There are four main differences: - Floating-point types are considered vector types just like __m128, __m256 and __m512. The first 6 vector typed arguments are saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5. - After vector types and integer types are allocated, HVA types are allocated, in ascending order, to unused vector registers XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5. - Just like in the default x65 CC, Shadow space is allocated for vector/HVA types. The size is fixed to 8 bytes per argument. - HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX For more information or examples please see also: https://msdn.microsoft.com/en-us/library/dn375768.aspx Observations ------------------ - LLVM IR must preserve the original position of the arguments. - Since HVA structures are allocated in lower priority than vector types, the vector types should be allocated first. Hence, one pass on the argument list is not sufficient anymore, because HVA structures are allocated on a second pass. Issues in Clang -------------------- Structure Expansion ~~~~~~~~~~~~~~~~~~~ The current clang implementation expends HVA structures into multiple vector types. For example: C code: int __vectorcall foo(hva3 a); LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256 %a.2); *The example omits the decoration that is added to the function name Thus the backend can't differentiate between expended HVA structures and simple vector types, and doesn't know the original position of each parameter in the argument list. We cannot rely on debug information or updated argument names to identify HVA structures. HVA Classification ~~~~~~~~~~~~~~~~~~ Clang should understand if each HVA should be expended. In other words, the FE should know if an HVA structure should be passed by value (by codegen) or passed indirect. The current implementation doesn't follow the two argument list rounds concept of vectorcall, in which Clang first goes over integer and vector types and only after that over the HVA types. As a result the HVA structures are passed incorrectly. Proposed Solution -------------------------- The ABI in LLVM IR must provide argument position. The information is important in order to allocate the correct physical register. The information can be achieved by passing HVA structures by value. It will replace the existing expansion of the HVA structure arguments. For Example: Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256 %a.2); Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a); CodeGen needs to know if the structure is an HVA. There are four possible ways to solve that: 1. CodeGen will analyze the structures just like currently done in clang in order to identify HVA structures 2. CodeGen can assume that structure arguments passed by value (not expended) are HVA structures 3. Clang will use an existing attribute that will mark that this HVA should be passed in registers. 4. Clang will pass a new attribute that will indicate if this is an HVA structure that should be expended and passed in register I propose to use the third option. The existing attribute "InReg" has similar meaning (argument should be saved in register) and is defined to be target specific. Other reasons why I prefer this option are: - Avoiding code duplication between clang and codegen - Avoiding making assumptions that are not necessarily true (for example "long double _Complex" type that is passed by structure as well) or might be violated in the future - Avoiding adding new keywords that are not necessary. In case we encounter a structure passed by value with an InReg flag set, we can surely assume that this is an HVA. I will be happy to get your comments or inputs on vectorcall calling convention and the suggested solution. Thanks, Oren --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/45b4e4dc/attachment.html>
Reid Kleckner via llvm-dev
2016-Nov-30 16:42 UTC
[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention
Don't we already implement this correctly on Windows? I agree Clang should do the HVA classification. LLVM just doesn't have the information. Right now, Clang splits HVAs passed in registers and passes other structs or HVAs that don't fit in the available vector registers with byval. Traditionally, Clang has tried very hard to split aggregates passed by value to make LLVM's job easier. Your proposal undoes a lot of that, but that seems to be the direction we're going today. See ARM and AArch64, which pass HVAs as arrays. I think either your suggestion of the array suggestion are improvements over the current situation. One problem with passing the LLVM struct type directly and marking it inreg is that it might be hard for the backend to figure out what the HVA element type is. The array convention solves this because the element type is obvious. On Wed, Nov 30, 2016 at 7:20 AM, Ben Simhon, Oren via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Adding Support For Vectorcall Calling Convention > > ====================================================> > > > Vectorcall Calling Convention for x64 > > ---------------------------------------------------- > > The __vectorcall calling convention specifies that arguments to > > functions are to be passed in registers, when possible. __vectorcall > > uses more registers for arguments than __fastcall or the default x64 > > calling convention use. The __vectorcall calling convention is only > > supported in native code on x86 and x64 processors that include > > Streaming SIMD Extensions 2 (SSE2) and above. > > > > The Definition of HVA Types > > -------------------------------------- > > A Homogeneous Vector Aggregate (HVA) type is a composite type of up > > to four data members that have identical vector types. An HVA type has > > the same alignment requirement as the vector type of its members. > > > > For example: > > typedef struct { > > __m256 x; > > __m256 y; > > __m256 z; > > } hva3; // HVA type with 3 __m256 elements > > > > Vectorcall Extension > > ---------------------------- > > Vectorcall extends the standard x64 calling convention while adding > > support for HVA and vector types. > > > > There are four main differences: > > - Floating-point types are considered vector types just like __m128, > > __m256 and __m512. The first 6 vector typed arguments are > > saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5. > > - After vector types and integer types are allocated, HVA types are > > allocated, in ascending order, to unused vector registers > > XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5. > > - Just like in the default x65 CC, Shadow space is allocated for > > vector/HVA types. The size is fixed to 8 bytes per argument. > > - HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while > > vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX > > > > For more information or examples please see also: > > https://msdn.microsoft.com/en-us/library/dn375768.aspx > > > > Observations > > ------------------ > > - LLVM IR must preserve the original position of the arguments. > > - Since HVA structures are allocated in lower priority than vector > > types, the vector types should be allocated first. Hence, one > > pass on the argument list is not sufficient anymore, because HVA > > structures are allocated on a second pass. > > > > Issues in Clang > > -------------------- > > Structure Expansion > > ~~~~~~~~~~~~~~~~~~~ > > The current clang implementation expends HVA structures into multiple > > vector types. > > > > For example: > > C code: int __vectorcall foo(hva3 a); > > LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, > __m256 %a.2); > > *The example omits the decoration that is added to the function name > > > > Thus the backend can't differentiate between expended HVA structures and > > simple vector types, and doesn't know the original position of each > > parameter in the argument list. > > > > We cannot rely on debug information or updated argument names to > > identify HVA structures. > > > > HVA Classification > > ~~~~~~~~~~~~~~~~~~ > > Clang should understand if each HVA should be expended. In other words, > > the FE should know if an HVA structure should be passed by value (by > > codegen) or passed indirect. > > > > The current implementation doesn’t follow the two argument list rounds > > concept of vectorcall, in which Clang first goes over integer and vector > > types and only after that over the HVA types. As a result the HVA > > structures are passed incorrectly. > > > > Proposed Solution > > -------------------------- > > The ABI in LLVM IR must provide argument position. The information is > > important in order to allocate the correct physical register. > > > > The information can be achieved by passing HVA structures by value. It > > will replace the existing expansion of the HVA structure arguments. > > > > For Example: > > Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, > __m256 %a.2); > > Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a); > > > > CodeGen needs to know if the structure is an HVA. > > There are four possible ways to solve that: > > > > 1. CodeGen will analyze the structures just like currently done in clang > > in order to identify HVA structures > > > > 2. CodeGen can assume that structure arguments passed by value (not > > expended) are HVA structures > > > > 3. Clang will use an existing attribute that will mark that this HVA > > should be passed in registers. > > > > 4. Clang will pass a new attribute that will indicate if this is an HVA > > structure that should be expended and passed in register > > > > I propose to use the third option. > > The existing attribute "InReg" has similar meaning (argument should be > > saved in register) and is defined to be target specific. > > > > Other reasons why I prefer this option are: > > - Avoiding code duplication between clang and codegen > > - Avoiding making assumptions that are not necessarily true (for example > > "long double _Complex" type that is passed by structure as well) or > > might be violated in the future > > - Avoiding adding new keywords that are not necessary. > > > > In case we encounter a structure passed by value with an InReg flag set, > > we can surely assume that this is an HVA. > > > > I will be happy to get your comments or inputs on vectorcall calling > convention and > > the suggested solution. > > > > Thanks, > > Oren > > > > --------------------------------------------------------------------- > Intel Israel (74) Limited > > This e-mail and any attachments may contain confidential material for > the sole use of the intended recipient(s). Any review or distribution > by others is strictly prohibited. If you are not the intended > recipient, please contact the sender and delete all copies. > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/567e4216/attachment.html>
Ben Simhon, Oren via llvm-dev
2016-Dec-01 08:19 UTC
[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention
Thanks Reid for your inputs (and code reviews BTW). The current Vectorcall implementation is incomplete for x64 and x32. Some of the issues in the current implementation are: - It doesn’t take into account the original arguments’ position (before HVA expansion) - It doesn’t allocate the HVAs in lower priority (compared to vector types and integer types) - It doesn’t allocate shadow register in case a vector type is assigned - It doesn’t allocate shadow stack for the vector types Whether it is a structure or an array, they both get to the same function in codegen: ComputeValueVTs In the function, elements are being extracted in similar recursive way, for both structures and arrays. So I really don’t see much of a difference between the two approaches. Thanks again, Oren From: Reid Kleckner [mailto:rnk at google.com] Sent: Wednesday, November 30, 2016 18:43 To: Ben Simhon, Oren <oren.ben.simhon at intel.com>; Tim Northover <t.p.northover at gmail.com> Cc: llvm-dev at lists.llvm.org Subject: Re: [llvm-dev] RFC: Adding Support For Vectorcall Calling Convention Don't we already implement this correctly on Windows? I agree Clang should do the HVA classification. LLVM just doesn't have the information. Right now, Clang splits HVAs passed in registers and passes other structs or HVAs that don't fit in the available vector registers with byval. Traditionally, Clang has tried very hard to split aggregates passed by value to make LLVM's job easier. Your proposal undoes a lot of that, but that seems to be the direction we're going today. See ARM and AArch64, which pass HVAs as arrays. I think either your suggestion of the array suggestion are improvements over the current situation. One problem with passing the LLVM struct type directly and marking it inreg is that it might be hard for the backend to figure out what the HVA element type is. The array convention solves this because the element type is obvious. On Wed, Nov 30, 2016 at 7:20 AM, Ben Simhon, Oren via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote: Adding Support For Vectorcall Calling Convention ==================================================== Vectorcall Calling Convention for x64 ---------------------------------------------------- The __vectorcall calling convention specifies that arguments to functions are to be passed in registers, when possible. __vectorcall uses more registers for arguments than __fastcall or the default x64 calling convention use. The __vectorcall calling convention is only supported in native code on x86 and x64 processors that include Streaming SIMD Extensions 2 (SSE2) and above. The Definition of HVA Types -------------------------------------- A Homogeneous Vector Aggregate (HVA) type is a composite type of up to four data members that have identical vector types. An HVA type has the same alignment requirement as the vector type of its members. For example: typedef struct { __m256 x; __m256 y; __m256 z; } hva3; // HVA type with 3 __m256 elements Vectorcall Extension ---------------------------- Vectorcall extends the standard x64 calling convention while adding support for HVA and vector types. There are four main differences: - Floating-point types are considered vector types just like __m128, __m256 and __m512. The first 6 vector typed arguments are saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5. - After vector types and integer types are allocated, HVA types are allocated, in ascending order, to unused vector registers XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5. - Just like in the default x65 CC, Shadow space is allocated for vector/HVA types. The size is fixed to 8 bytes per argument. - HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX For more information or examples please see also: https://msdn.microsoft.com/en-us/library/dn375768.aspx Observations ------------------ - LLVM IR must preserve the original position of the arguments. - Since HVA structures are allocated in lower priority than vector types, the vector types should be allocated first. Hence, one pass on the argument list is not sufficient anymore, because HVA structures are allocated on a second pass. Issues in Clang -------------------- Structure Expansion ~~~~~~~~~~~~~~~~~~~ The current clang implementation expends HVA structures into multiple vector types. For example: C code: int __vectorcall foo(hva3 a); LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256 %a.2); *The example omits the decoration that is added to the function name Thus the backend can't differentiate between expended HVA structures and simple vector types, and doesn't know the original position of each parameter in the argument list. We cannot rely on debug information or updated argument names to identify HVA structures. HVA Classification ~~~~~~~~~~~~~~~~~~ Clang should understand if each HVA should be expended. In other words, the FE should know if an HVA structure should be passed by value (by codegen) or passed indirect. The current implementation doesn’t follow the two argument list rounds concept of vectorcall, in which Clang first goes over integer and vector types and only after that over the HVA types. As a result the HVA structures are passed incorrectly. Proposed Solution -------------------------- The ABI in LLVM IR must provide argument position. The information is important in order to allocate the correct physical register. The information can be achieved by passing HVA structures by value. It will replace the existing expansion of the HVA structure arguments. For Example: Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256 %a.2); Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a); CodeGen needs to know if the structure is an HVA. There are four possible ways to solve that: 1. CodeGen will analyze the structures just like currently done in clang in order to identify HVA structures 2. CodeGen can assume that structure arguments passed by value (not expended) are HVA structures 3. Clang will use an existing attribute that will mark that this HVA should be passed in registers. 4. Clang will pass a new attribute that will indicate if this is an HVA structure that should be expended and passed in register I propose to use the third option. The existing attribute "InReg" has similar meaning (argument should be saved in register) and is defined to be target specific. Other reasons why I prefer this option are: - Avoiding code duplication between clang and codegen - Avoiding making assumptions that are not necessarily true (for example "long double _Complex" type that is passed by structure as well) or might be violated in the future - Avoiding adding new keywords that are not necessary. In case we encounter a structure passed by value with an InReg flag set, we can surely assume that this is an HVA. I will be happy to get your comments or inputs on vectorcall calling convention and the suggested solution. Thanks, Oren --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. _______________________________________________ LLVM Developers mailing list llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev --------------------------------------------------------------------- Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161201/b28b35cb/attachment.html>
Maybe Matching Threads
- KNL Assembly Code for Matrix Multiplication
- [LLVMdev] ABI incompatability when passing vector parameters on 32-bit x86
- [PATCH net V2 4/4] vhost: log dirty page correctly
- [PATCH net V2 4/4] vhost: log dirty page correctly
- [PATCH net V2 4/4] vhost: log dirty page correctly