thr3ads.net - llvm dev - [llvm-dev] RFC: Adding Support For Vectorcall Calling Convention [Nov 2016]

If this information is useful, please help other people find it:
Share via:

Ben Simhon, Oren via llvm-dev

2016-Nov-30 15:20 UTC

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

Adding Support For Vectorcall Calling Convention
====================================================
Vectorcall Calling Convention for x64
----------------------------------------------------
The __vectorcall calling convention specifies that arguments to
functions are to be passed in registers, when possible. __vectorcall
uses more registers for arguments than __fastcall or the default x64
calling convention use. The __vectorcall calling convention is only
supported in native code on x86 and x64 processors that include
Streaming SIMD Extensions 2 (SSE2) and above.

The Definition of HVA Types
--------------------------------------
A Homogeneous Vector Aggregate (HVA) type is a composite type of up
to four data members that have identical vector types. An HVA type has
the same alignment requirement as the vector type of its members.

For example:
    typedef struct {
    __m256 x;
    __m256 y;
    __m256 z;
    } hva3; // HVA type with 3 __m256 elements

Vectorcall Extension
----------------------------
Vectorcall extends the standard x64 calling convention while adding
support for HVA and vector types.

There are four main differences:
-  Floating-point types are considered vector types just like __m128,
       __m256 and __m512. The first 6 vector typed arguments are
       saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5.
-  After vector types and integer types are allocated, HVA types are
       allocated, in ascending order, to unused vector registers
       XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5.
-  Just like in the default x65 CC, Shadow space is allocated for
       vector/HVA types. The size is fixed to 8 bytes per argument.
-  HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while
       vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX

For more information or examples please see also:
https://msdn.microsoft.com/en-us/library/dn375768.aspx

Observations
------------------
-  LLVM IR must preserve the original position of the arguments.
-  Since HVA structures are allocated in lower priority than vector
       types, the vector types should be allocated first. Hence, one
       pass on the argument list is not sufficient anymore, because HVA
       structures are allocated on a second pass.

Issues in Clang
--------------------
Structure Expansion
~~~~~~~~~~~~~~~~~~~
The current clang implementation expends HVA structures into multiple
vector types.

For example:
C code: int __vectorcall foo(hva3 a);
LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1,
__m256 %a.2);
*The example omits the decoration that is added to the function name

Thus the backend can't differentiate between expended HVA structures and
simple vector types, and doesn't know the original position of each
parameter in the argument list.

We cannot rely on debug information or updated argument names to
identify HVA structures.

HVA Classification
~~~~~~~~~~~~~~~~~~
Clang should understand if each HVA should be expended. In other words,
the FE should know if an HVA structure should be passed by value (by
codegen) or passed indirect.

The current implementation doesn't follow the two argument list rounds
concept of vectorcall, in which Clang first goes over integer and vector
types and only after that over the HVA types. As a result the HVA
structures are passed incorrectly.

Proposed Solution
--------------------------
The ABI in LLVM IR must provide argument position. The information is
important in order to allocate the correct physical register.

The information can be achieved by passing HVA structures by value. It
will replace the existing expansion of the HVA structure arguments.

For Example:
Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256
%a.2);
Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a);

CodeGen needs to know if the structure is an HVA.
There are four possible ways to solve that:

1. CodeGen will analyze the structures just like currently done in clang
   in order to identify HVA structures

2. CodeGen can assume that structure arguments passed by value (not
   expended) are HVA structures

3. Clang will use an existing attribute that will mark that this HVA
   should be passed in registers.

4. Clang will pass a new attribute that will indicate if this is an HVA
   structure that should be expended and passed in register

I propose to use the third option.
The existing attribute "InReg" has similar meaning (argument should be
saved in register) and is defined to be target specific.

Other reasons why I prefer this option are:
- Avoiding code duplication between clang and codegen
- Avoiding making assumptions that are not necessarily true (for example
"long double _Complex" type that is passed by structure as well) or
might be violated in the future
- Avoiding adding new keywords that are not necessary.

In case we encounter a structure passed by value with an InReg flag set,
we can surely assume that this is an HVA.

I will be happy to get your comments or inputs on vectorcall calling convention
and
the suggested solution.

Thanks,
Oren

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/45b4e4dc/attachment.html>

Reid Kleckner via llvm-dev

2016-Nov-30 16:42 UTC

head link

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

Don't we already implement this correctly on Windows?

I agree Clang should do the HVA classification. LLVM just doesn't have the
information. Right now, Clang splits HVAs passed in registers and passes
other structs or HVAs that don't fit in the available vector registers with
byval.

Traditionally, Clang has tried very hard to split aggregates passed by
value to make LLVM's job easier. Your proposal undoes a lot of that, but
that seems to be the direction we're going today. See ARM and AArch64,
which pass HVAs as arrays.

I think either your suggestion of the array suggestion are improvements
over the current situation. One problem with passing the LLVM struct type
directly and marking it inreg is that it might be hard for the backend to
figure out what the HVA element type is. The array convention solves this
because the element type is obvious.

On Wed, Nov 30, 2016 at 7:20 AM, Ben Simhon, Oren via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Adding Support For Vectorcall Calling Convention
>
> ====================================================>
>
>
> Vectorcall Calling Convention for x64
>
> ----------------------------------------------------
>
> The __vectorcall calling convention specifies that arguments to
>
> functions are to be passed in registers, when possible. __vectorcall
>
> uses more registers for arguments than __fastcall or the default x64
>
> calling convention use. The __vectorcall calling convention is only
>
> supported in native code on x86 and x64 processors that include
>
> Streaming SIMD Extensions 2 (SSE2) and above.
>
>
>
> The Definition of HVA Types
>
> --------------------------------------
>
> A Homogeneous Vector Aggregate (HVA) type is a composite type of up
>
> to four data members that have identical vector types. An HVA type has
>
> the same alignment requirement as the vector type of its members.
>
>
>
> For example:
>
>     typedef struct {
>
>     __m256 x;
>
>     __m256 y;
>
>     __m256 z;
>
>     } hva3; // HVA type with 3 __m256 elements
>
>
>
> Vectorcall Extension
>
> ----------------------------
>
> Vectorcall extends the standard x64 calling convention while adding
>
> support for HVA and vector types.
>
>
>
> There are four main differences:
>
> -  Floating-point types are considered vector types just like __m128,
>
>        __m256 and __m512. The first 6 vector typed arguments are
>
>        saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5.
>
> -  After vector types and integer types are allocated, HVA types are
>
>        allocated, in ascending order, to unused vector registers
>
>        XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5.
>
> -  Just like in the default x65 CC, Shadow space is allocated for
>
>        vector/HVA types. The size is fixed to 8 bytes per argument.
>
> -  HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while
>
>        vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX
>
>
>
> For more information or examples please see also:
>
> https://msdn.microsoft.com/en-us/library/dn375768.aspx
>
>
>
> Observations
>
> ------------------
>
> -  LLVM IR must preserve the original position of the arguments.
>
> -  Since HVA structures are allocated in lower priority than vector
>
>        types, the vector types should be allocated first. Hence, one
>
>        pass on the argument list is not sufficient anymore, because HVA
>
>        structures are allocated on a second pass.
>
>
>
> Issues in Clang
>
> --------------------
>
> Structure Expansion
>
> ~~~~~~~~~~~~~~~~~~~
>
> The current clang implementation expends HVA structures into multiple
>
> vector types.
>
>
>
> For example:
>
> C code: int __vectorcall foo(hva3 a);
>
> LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1,
> __m256 %a.2);
>
> *The example omits the decoration that is added to the function name
>
>
>
> Thus the backend can't differentiate between expended HVA structures
and
>
> simple vector types, and doesn't know the original position of each
>
> parameter in the argument list.
>
>
>
> We cannot rely on debug information or updated argument names to
>
> identify HVA structures.
>
>
>
> HVA Classification
>
> ~~~~~~~~~~~~~~~~~~
>
> Clang should understand if each HVA should be expended. In other words,
>
> the FE should know if an HVA structure should be passed by value (by
>
> codegen) or passed indirect.
>
>
>
> The current implementation doesn’t follow the two argument list rounds
>
> concept of vectorcall, in which Clang first goes over integer and vector
>
> types and only after that over the HVA types. As a result the HVA
>
> structures are passed incorrectly.
>
>
>
> Proposed Solution
>
> --------------------------
>
> The ABI in LLVM IR must provide argument position. The information is
>
> important in order to allocate the correct physical register.
>
>
>
> The information can be achieved by passing HVA structures by value. It
>
> will replace the existing expansion of the HVA structure arguments.
>
>
>
> For Example:
>
> Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1,
> __m256 %a.2);
>
> Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a);
>
>
>
> CodeGen needs to know if the structure is an HVA.
>
> There are four possible ways to solve that:
>
>
>
> 1. CodeGen will analyze the structures just like currently done in clang
>
>    in order to identify HVA structures
>
>
>
> 2. CodeGen can assume that structure arguments passed by value (not
>
>    expended) are HVA structures
>
>
>
> 3. Clang will use an existing attribute that will mark that this HVA
>
>    should be passed in registers.
>
>
>
> 4. Clang will pass a new attribute that will indicate if this is an HVA
>
>    structure that should be expended and passed in register
>
>
>
> I propose to use the third option.
>
> The existing attribute "InReg" has similar meaning (argument
should be
>
> saved in register) and is defined to be target specific.
>
>
>
> Other reasons why I prefer this option are:
>
> - Avoiding code duplication between clang and codegen
>
> - Avoiding making assumptions that are not necessarily true (for example
>
> "long double _Complex" type that is passed by structure as well)
or
>
> might be violated in the future
>
> - Avoiding adding new keywords that are not necessary.
>
>
>
> In case we encounter a structure passed by value with an InReg flag set,
>
> we can surely assume that this is an HVA.
>
>
>
> I will be happy to get your comments or inputs on vectorcall calling
> convention and
>
> the suggested solution.
>
>
>
> Thanks,
>
> Oren
>
>
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161130/567e4216/attachment.html>

Ben Simhon, Oren via llvm-dev

2016-Dec-01 08:19 UTC

head link

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

Thanks Reid for your inputs (and code reviews BTW).

The current Vectorcall implementation is incomplete for x64 and x32.
Some of the issues in the current implementation are:

-        It doesn’t take into account the original arguments’ position (before
HVA expansion)

-        It doesn’t allocate the HVAs in lower priority (compared to vector
types and integer types)

-        It doesn’t allocate shadow register in case a vector type is assigned

-        It doesn’t allocate shadow stack for the vector types

Whether it is a structure or an array, they both get to the same function in
codegen: ComputeValueVTs
In the function, elements are being extracted in similar recursive way, for both
structures and arrays.

So I really don’t see much of a difference between the two approaches.

Thanks again,
Oren

From: Reid Kleckner [mailto:rnk at google.com]
Sent: Wednesday, November 30, 2016 18:43
To: Ben Simhon, Oren <oren.ben.simhon at intel.com>; Tim Northover
<t.p.northover at gmail.com>
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

Don't we already implement this correctly on Windows?

I agree Clang should do the HVA classification. LLVM just doesn't have the
information. Right now, Clang splits HVAs passed in registers and passes other
structs or HVAs that don't fit in the available vector registers with byval.

Traditionally, Clang has tried very hard to split aggregates passed by value to
make LLVM's job easier. Your proposal undoes a lot of that, but that seems
to be the direction we're going today. See ARM and AArch64, which pass HVAs
as arrays.

I think either your suggestion of the array suggestion are improvements over the
current situation. One problem with passing the LLVM struct type directly and
marking it inreg is that it might be hard for the backend to figure out what the
HVA element type is. The array convention solves this because the element type
is obvious.

On Wed, Nov 30, 2016 at 7:20 AM, Ben Simhon, Oren via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Adding Support For Vectorcall Calling Convention
====================================================
Vectorcall Calling Convention for x64
----------------------------------------------------
The __vectorcall calling convention specifies that arguments to
functions are to be passed in registers, when possible. __vectorcall
uses more registers for arguments than __fastcall or the default x64
calling convention use. The __vectorcall calling convention is only
supported in native code on x86 and x64 processors that include
Streaming SIMD Extensions 2 (SSE2) and above.

The Definition of HVA Types
--------------------------------------
A Homogeneous Vector Aggregate (HVA) type is a composite type of up
to four data members that have identical vector types. An HVA type has
the same alignment requirement as the vector type of its members.

For example:
    typedef struct {
    __m256 x;
    __m256 y;
    __m256 z;
    } hva3; // HVA type with 3 __m256 elements

Vectorcall Extension
----------------------------
Vectorcall extends the standard x64 calling convention while adding
support for HVA and vector types.

There are four main differences:
-  Floating-point types are considered vector types just like __m128,
       __m256 and __m512. The first 6 vector typed arguments are
       saved in physical registers XMM0/YMM0/ZMM0 until XMM5/YMM5/ZMM5.
-  After vector types and integer types are allocated, HVA types are
       allocated, in ascending order, to unused vector registers
       XMM0/YMM0/ZMM0 to XMM5/YMM5/ZMM5.
-  Just like in the default x65 CC, Shadow space is allocated for
       vector/HVA types. The size is fixed to 8 bytes per argument.
-  HVA types are returned in XMM0/YMM0/ZMM0 to XMM3/YMM3/ZMM3 while
       vector types are returned in XMM0/YMM0/ZMM0 and integers in RAX

For more information or examples please see also:
https://msdn.microsoft.com/en-us/library/dn375768.aspx

Observations
------------------
-  LLVM IR must preserve the original position of the arguments.
-  Since HVA structures are allocated in lower priority than vector
       types, the vector types should be allocated first. Hence, one
       pass on the argument list is not sufficient anymore, because HVA
       structures are allocated on a second pass.

Issues in Clang
--------------------
Structure Expansion
~~~~~~~~~~~~~~~~~~~
The current clang implementation expends HVA structures into multiple
vector types.

For example:
C code: int __vectorcall foo(hva3 a);
LLVM IR Output: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1,
__m256 %a.2);
*The example omits the decoration that is added to the function name

Thus the backend can't differentiate between expended HVA structures and
simple vector types, and doesn't know the original position of each
parameter in the argument list.

We cannot rely on debug information or updated argument names to
identify HVA structures.

HVA Classification
~~~~~~~~~~~~~~~~~~
Clang should understand if each HVA should be expended. In other words,
the FE should know if an HVA structure should be passed by value (by
codegen) or passed indirect.

The current implementation doesn’t follow the two argument list rounds
concept of vectorcall, in which Clang first goes over integer and vector
types and only after that over the HVA types. As a result the HVA
structures are passed incorrectly.

Proposed Solution
--------------------------
The ABI in LLVM IR must provide argument position. The information is
important in order to allocate the correct physical register.

The information can be achieved by passing HVA structures by value. It
will replace the existing expansion of the HVA structure arguments.

For Example:
Instead of: define x86_vectorcallcc i32 @foo(__m256 %a.0, __m256 %a.1, __m256
%a.2);
Pass the following: define x86_vectorcallcc i32 @foo(%struct.hva3 %a);

CodeGen needs to know if the structure is an HVA.
There are four possible ways to solve that:

1. CodeGen will analyze the structures just like currently done in clang
   in order to identify HVA structures

2. CodeGen can assume that structure arguments passed by value (not
   expended) are HVA structures

3. Clang will use an existing attribute that will mark that this HVA
   should be passed in registers.

4. Clang will pass a new attribute that will indicate if this is an HVA
   structure that should be expended and passed in register

I propose to use the third option.
The existing attribute "InReg" has similar meaning (argument should be
saved in register) and is defined to be target specific.

Other reasons why I prefer this option are:
- Avoiding code duplication between clang and codegen
- Avoiding making assumptions that are not necessarily true (for example
"long double _Complex" type that is passed by structure as well) or
might be violated in the future
- Avoiding adding new keywords that are not necessary.

In case we encounter a structure passed by value with an InReg flag set,
we can surely assume that this is an HVA.

I will be happy to get your comments or inputs on vectorcall calling convention
and
the suggested solution.

Thanks,
Oren


---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20161201/b28b35cb/attachment.html>

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Nov 2016 - RFC: Adding Support For Vectorcall Calling Convention

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

[llvm-dev] RFC: Adding Support For Vectorcall Calling Convention

Apparently Analagous Threads