thr3ads.net - llvm dev - [llvm-dev] [RFC] Half-Precision Support in the Arm Backends [Dec 2017]

If this information is useful, please help other people find it:
Share via:

Sjoerd Meijer via llvm-dev

2017-Dec-04 14:44 UTC

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Hi,


I am working on C/C++ language support for the Armv8.2-A half-precision
instructions.  I've added support for _Float16 as a new source language type
to
Clang. _Float16 is a C11 extension type for which arithmetic is well defined, as
opposed to e.g. __fp16 which is a storage-only type. I then fixed up the
AArch64 backend, which was mostly straightforward: this involved making
operations on f16 legal when FullFP16 is supported, thus avoiding promotions to
f32. This enables generation of AArch64 FP16 instruction from C/C++. For
AArch64, this work is finished and does not show problems in our testing; Solid
Sands provided us with beta versions of their FP16 extension to SuperTest -
their C/C++ language conformance test suite. However, as more testing can
always be done, and there are not a lot of code bases using _Float16, I
would be interested in more testing/feedback.

This RFC is thus a quick status update on the AArch64 implementation, but is
mainly about the AArch32 implementation in the ARM backend, which is a lot more
interesting than AArch64 for a number of reasons. Most importantly because
there is no soft-float ABI for AArch64 and it has half-precision H-registers,
which is all very different for AArch32. So it's the different combinations,
like
soft float, softfp with FP support but argument passing in integer registers,
hard float, hard float with FP16, and hard float with FullFP16, that makes
things
interesting.

My AArch32 implementation in the ARM backend is nearly complete and I am
working on fixing a handful of regression tests (the WIP diff can be found
here: https://reviews.llvm.org/D38315). My approach to handle f16 types should
not lead to any codegen differences for existing tests, but the way half types
are handled and legalized is totally different in some cases and from that
point of view the changes could be considered intrusive. Thus, this is a heads
up, and below I will discuss the approach and some implementation decisions,
for which feedback is welcome of course.

Half-Precision RegisterClass
-----------------------------------------

Half-precision values sit in the least-significant 16 bits of the
single-precision registers. I.e., each instruction that generates a FP16 result
writes that to the bottom 16 bits of the associated 32-bit floating-point
register and the top 16 bits of the 32-bit floating-point register are written
to 0.
I've added a new HPR half-precision register class.  This new HPR register
class is an exact copy of SPR, but it avoids adding f16 and f32 type
information to the existing rules, which would be necessary if we add f16 to
the SPR register class.

Calling Conventions
-----------------------------

For the soft float case,
- half-precision values are returned in the least significant 16 bits of r0,
- half-precision arguments are set to 4 bytes as if it had been copied to the
  least significant bits of a 32- bit register and the remaining bits filled
  with unspecified values

That's why for CC_ARM_AAPCS, f16 arguments and return values are
bitconverted
to i16 types.  I then had to make some changes to lowering of the formal
arguments and create a i16 truncate, followed by an f16 bitcast, in order not
to interpret the high 16 bits of the i32 argument values.

For the hard float case and CC_ARM_AAPCS_VFP this is straightforward, like
f32 values, f16 values are passed in the S-registers and no further changes
are required.

f16 Type Legalization
------------------------------

The HPR registerclass and f16 type are added as a legal type when:
- FullFP16 is enabled, which means support for the Armv8.2-A FP16 instructions,
- FP16 is enabled, which means support for the f16 <-> f32 conversion
  instructions, which are a VFP3 extension and part of VFP4.

It's obvious why f16 is legal for the former case, but the latter is perhaps
the
more interesting/instrusive change. Making fp16 legal for FP16, results in
f16 LOADs/STOREs while we don't have instructions for them. So the approach
is
to custom lower f16 LOAD/STORE nodes (see next section).

The reason to make f16 legal when only FP16 is supported is:
- avoid very early legalization/combining of f16 arguments to f32 types,
  which would again interpret the higher 16 bits in 32-bit registers and that
  would be wrong. Instead of trying to undo this early legalization/combining,
  I found this approach easier and cleaner.
- As a consequence, the isel dags are in a more 'normal form'. I.e. it
relies
  less on funny nodes FP_TO_FP16 and FP16_TO_FP, which are funny because they
  perform float up/down converts and produce i32 values by moving from/to
integer
  and float registers. Instead, FP_EXTEND and FP_ROUND nodes will be introduced,
  so this is more a clean up rather than e.g. addressing a correctness issue.
  Unfortunatly I found that I can't completely get rid of nodes FP16_TO_FP,
see
  'Custom Lowering' below.
- When these FP_EXTEND and FP_ROUND are introduced by the legalizer, and
  we don't have the FP16 conversion instructions available, they will be
  custom lowered to EABI calls h2f and f2h.

Custom Lowering
-------------------------

Making f16 legal and not having native load/stores instructions available,
(no FullFP16 support) means custom lowering loads/stores:
1) Since we don't have FP16 load/store instructions available, we create
   integer half-word loads. I unfortunately need the FP16_TO_FP node here,
   because that "models" creating an integer value, which is what we
need
   to create a "truncating i16" integer load instructions. Instead, of
using
   FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
   to stack loads/stores which I don't want.
2) Custom lowering f16 stores is very similar, and creates truncating
   half-word integer stores.

Relation with __fp16
-----------------------------

Usage of __fp16 results in IR that e.g. loads half-type as i16 integer types,
and
use "llvm.convert.from.fp16.f32" to promote it to single-precision
values
before any data processing is done. Now that f16 is legal, the i16 loads and
converts are combined/legalized  into f16 operations, and FP_EXTEND and
FP_ROUNDs are introduced when necessary, and thus avoids FP16_TO_FP and
FP_TO_FP16. Lowering works as described above. So LOADs and STOREs are custom
lowered, which is perhaps what you expect given that __fp16 is a storage-only
type.



IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171204/bff0c908/attachment.html>

Friedman, Eli via llvm-dev

2017-Dec-04 20:20 UTC

head link

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:>
> Custom Lowering
> -------------------------
>
> Making f16 legal and not having native load/stores instructions available,
> (no FullFP16 support) means custom lowering loads/stores:
> 1) Since we don't have FP16 load/store instructions available, we
create
>    integer half-word loads. I unfortunately need the FP16_TO_FP node here,
>    because that "models" creating an integer value, which is what
we need
>    to create a "truncating i16" integer load instructions.
Instead, of
> using
>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code generation
>    to stack loads/stores which I don't want.
> 2) Custom lowering f16 stores is very similar, and creates truncating
>    half-word integer stores.
Technically, there are no f16 load/store instructions, yes, but we can 
use NEON vdl1 and vst1 to get something roughly equivalent, right?

You probably want to custom-lower BITCAST instructions; the generic 
sequence emitted by the legalizer is pretty inefficient in most cases.

---

Overall, I think your approach makes sense.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171204/4fc9092b/attachment.html>

Sjoerd Meijer via llvm-dev

2017-Dec-06 08:32 UTC

head link

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Thanks a lot for the suggestions! I will look into using vld1/vst1, sounds good.

I am custom lowering the bitcasts, that's now the only place where
FP_TO_FP16

and FP16_TO_FP nodes are created to avoid inefficient code generation. I will

double check if I can't achieve the same without using these nodes (because
I

really would like to get completely rid of them).


Cheers,

Sjoerd.

>On 12/4/2017 6:44 AM, Sjoerd Meijer via llvm-dev wrote:
>>
>> Custom Lowering
>> -------------------------
>>
>> Making f16 legal and not having native load/stores instructions
available,
>> (no FullFP16 support) means custom lowering loads/stores:
>> 1) Since we don't have FP16 load/store instructions available, we
create
>>    integer half-word loads. I unfortunately need the FP16_TO_FP node
here,
>>    because that "models" creating an integer value, which is
what we need
>>    to create a "truncating i16" integer load instructions.
Instead, of
>> using
>>    FP16_TO_FP, I have tried BITCASTs, but this can lead to code
generation
>>    to stack loads/stores which I don't want.
>> 2) Custom lowering f16 stores is very similar, and creates truncating
>>    half-word integer stores.
>
>Technically, there are no f16 load/store instructions, yes, but we can
>use NEON vdl1 and vst1 to get something roughly equivalent, right?
>
>You probably want to custom-lower BITCAST instructions; the generic
>sequence emitted by the legalizer is pretty inefficient in most cases.
>
>---
>
>Overall, I think your approach makes sense.

IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171206/7155a07c/attachment.html>

Apparently Analagous Threads

Search for more seemingly similar threads

llvm dev - Dec 2017 - [RFC] Half-Precision Support in the Arm Backends

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

[llvm-dev] [RFC] Half-Precision Support in the Arm Backends

Apparently Analagous Threads