thr3ads.net - llvm dev - [llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Masten, Matt via llvm-dev

2016-Apr-01 00:20 UTC

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

RFC: A proposal for vectorizing loops with calls to math functions using SVML
(short
vector math library).

========Overview
========
Very simply, SVML (Intel short vector math library) functions are vector
variants of
scalar math functions that take vector arguments, apply an operation to each
element, and store the result in a vector register. These vector variants can be
generated by the compiler, based on precision requirements specified by the
user, resulting in substantial performance gains. This is an initial proposal to
add a new LLVM IR transformation pass that will translate scalar math calls to
svml calls with the help of the loop vectorizer.

===================Problem Description
===================
Currently, without the "#pragma clang loop vectorize(enable)", the
loop
vectorizer will not vectorize loops with math calls due to cost model reasons.
Additionally, When the loop pragma is used, the loop vectorizer will widen the
math call using an intrinsic, but the resulting code is inefficient because the
intrinsic is replaced with scalarized function calls. Please see the example
below for a simple loop containing a sinf call. For demonstration purposes, the
example was compiled for an xmm target, thus VF = 4 given the float type.

Example: sinf.c

#define N 1000

#pragma clang loop vectorize(enable)
for (i = 0; i < N; i++) {
  array[i] = sinf((float)i);
}

Without the loop pragma the loop vectorizer's cost model rejects the loop.

clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize
-Rpass-missed=loop-vectorize sinf.c

sinf.c:19:3: remark: the cost-model indicates that vectorization is not
beneficial [-Rpass-analysis=loop-vectorize]
  for (i = 0; i < N; i++) {
  ^
sinf.c:19:3: remark: the cost-model indicates that interleaving is not
beneficial and is explicitly disabled or interleave count is set to 1
[-Rpass-analysis=loop-vectorize]

When the the loop pragma is used, the loop is vectorized and the call to
@llvm.sin.v4f32 is generated, but the call is later scalarized with the
additional overhead of unpacking the scalar function arguments from a vector.
This can be seen from inspection of the resulting assembly code just below the
LLVM IR.

vector.body:                                ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
    !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
i32 3>,
    !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
  %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
  %4 = bitcast float* %3 to <4 x float>*, !dbg !10
  store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa
!11
  %index.next = add i64 %index, 4, !dbg !6
  %5 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        movaps  %xmm0, 16(%rsp)         # 16-byte Spill
        shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]
        callq   sinf
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]
        callq   sinf
        unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload
                                        # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        callq   sinf
        movaps  %xmm0, 32(%rsp)         # 16-byte Spill
        movapd  16(%rsp), %xmm0         # 16-byte Reload
        shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
        callq   sinf
        movaps  32(%rsp), %xmm1         # 16-byte Reload
        unpcklps        %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
        unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload
                                        # xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]
        movups  %xmm1, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

==========================Proposed New Functionality
==========================
In order to take advantage of the performance benefits of the svml library, the
proposed solution is to introduce a new LLVM IR pass that is capable of
translating the vector math intrinsics to svml calls. As an example, the LLVM IR
above for "vector.body", introduced in the Problem Description
section, would
serve as input to the proposed pass and be transformed into the following LLVM
IR. Special attention should be paid to the "__svml_sinf4_ha" call in
the LLVM
IR and resulting assembly code snippet.

vector.body:                                   ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
    !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
i32 3>,
    !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
  %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
  %3 = bitcast float* %2 to <4 x float>*, !dbg !9
  store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa
!10
  %index.next = add i64 %index, 4, !dbg !6
  %4 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14

The resulting assembly would appear as:

.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        callq   __svml_sinf4_ha
        movups  %xmm0, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

In order to perform the translation, several requirements must be met to guide
code generation. Those include:

1) In addition to the -ffast-math flag, support is needed from clang to allow
   the user to be able to specify the desired precision requirements. The
   additional flags needed include the following, where "imf" is
shorthand for
   "Intel math function".

   -fimf-absolute-error=value[:funclist]
          define the maximum allowable absolute error for math library
          function results
            value    - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-accuracy-bits=bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-arch-consistency=value[:funclist]
          ensures that the math library functions produce consistent results
          across different implementations of the same architecture
            value    - true or false
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-max-error=ulps[:funclist]
          defines the maximum allowable relative error, measured in ulps, for
          math library function results
            ulps     - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-precision=value[:funclist]
          defines the accuracy (precision) for math library functions
            value    - defined as one of the following values
                       high   - equivalent to max-error = 0.6
                       medium - equivalent to max-error = 4
                       low    - equivalent to accuracy-bits = 11 (single
                                precision); accuracy-bits = 26 (double
                                precision)
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-domain-exclusion=classlist[:funclist]
          indicates the input arguments domain on which math functions
          must provide correct results.
           classlist - defined as one of the following values
                         nans, infinities, denormals, zeros
                         all, none, common
           funclist - optional list of one or more math library
                      functions to which the attribute should be applied.

Information from the flags can then be encoded as function attributes at each
call site. In the future, this functionality will enable more fine-grained
control over specifying precision for individual calls/regions, instead of
setting the precision requirements for all call instances of a function. Please
note that the example translation presented so far does not have the IMF
attributes attached to the @llvm.sin.v4f32 call, and as a result the default is
set to an svml variant marked with "_ha" (max-error = 0.6), which is
short for
high accuracy. Other supported variants will include low precision, enhanced
performance, bitwise reproducible, and correctly rounded. Please refer to the
IEEE-754 standard for additional information regarding supported precisions.
The compiler will select the most appropriate variant based on the IMF
attributes. See #2.

2) An interface to query for the appropriate svml function variant based on the
   scalar function name and IMF attributes.

3) For calls to math functions that store to memory (e.g., sincos), additional
   analysis of the pointer arguments is beneficial in order to generate the best
   performing load/store instructions.

=====================GCC/ICC compatibility
=====================
The initial implementation will involve the translation of 6 svml functions,
which include sin, cos, log, pow, exp, and sincos (both single and double
precision variants). Support for these functions matches the current
capabilities of GCC and a subset of ICC. As more functions become open-sourced,
the plan is to support them as part of the final solution determined from this
proposal. The flags referenced in the Proposed New Functionality section are
required to maintain icc compatibility.

======================Current Implementation
======================
To evaluate the feasibility of this proposal, a prototype transform pass has
been developed, which performs the following:

1) Searches for vector math intrinsics as candidates for translation to svml.

2) Reads function attributes to obtain precision requirements for the call. If
   none, default to attributes that will force the selection of a high accuracy
   variant.

3) Since the vector factor of the intrinsic can be wider than what is legally
   supported by the target, type legalization is performed so that the correct
   svml variant is selected. For example, if a call to
   @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass
will
   generate two __svml_sinf4 calls and will do the appropriate splitting of %1 
   to create the new arguments for each call. In addition, the multiple return
   vectors are recombined and users of the original return vector are updated.
   The pass is also capable of handling less than full vector cases. E.g.,
   @llvm.sin.v2f32.

4) Special handling for sincos since the results are stored to a double wide
   vector and additional analysis is needed to optimize the stores to memory.
   Additional shuffling is required to obtain the sin and cos results from
   the double wide vector.

5) Vector intrinsics that are not translated to svml are scalarized.

6) The loop vectorizer has been taught to allow widening of sincos and
   additional utilities have been written to analyze arguments for sincos.

========Feedback
========
For those who are interested in this topic, I would like to ask for your review
of this proposal and would definitely appreciate any/all feedback on the 
proposed approach. Help is also very welcome and much appreciated in the
development process.

Sanjay Patel via llvm-dev

2016-Apr-04 17:57 UTC

head link

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

Hi Matt -

Are you using the same TLI hook as Darwin's Accelerate framework:
addVectorizableFunctionsFromVecLib()? If not, why not?

On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> RFC: A proposal for vectorizing loops with calls to math functions using
> SVML (short
> vector math library).
>
> ========> Overview
> ========>
> Very simply, SVML (Intel short vector math library) functions are vector
> variants of
> scalar math functions that take vector arguments, apply an operation to
> each
> element, and store the result in a vector register. These vector variants
> can be
> generated by the compiler, based on precision requirements specified by the
> user, resulting in substantial performance gains. This is an initial
> proposal to
> add a new LLVM IR transformation pass that will translate scalar math
> calls to
> svml calls with the help of the loop vectorizer.
>
> ===================> Problem Description
> ===================>
> Currently, without the "#pragma clang loop vectorize(enable)",
the loop
> vectorizer will not vectorize loops with math calls due to cost model
> reasons.
> Additionally, When the loop pragma is used, the loop vectorizer will widen
> the
> math call using an intrinsic, but the resulting code is inefficient
> because the
> intrinsic is replaced with scalarized function calls. Please see the
> example
> below for a simple loop containing a sinf call. For demonstration
> purposes, the
> example was compiled for an xmm target, thus VF = 4 given the float type.
>
> Example: sinf.c
>
> #define N 1000
>
> #pragma clang loop vectorize(enable)
> for (i = 0; i < N; i++) {
>   array[i] = sinf((float)i);
> }
>
> Without the loop pragma the loop vectorizer's cost model rejects the
loop.
>
> clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize
> -Rpass-missed=loop-vectorize sinf.c
>
> sinf.c:19:3: remark: the cost-model indicates that vectorization is not
> beneficial [-Rpass-analysis=loop-vectorize]
>   for (i = 0; i < N; i++) {
>   ^
> sinf.c:19:3: remark: the cost-model indicates that interleaving is not
> beneficial and is explicitly disabled or interleave count is set to 1
> [-Rpass-analysis=loop-vectorize]
>
> When the the loop pragma is used, the loop is vectorized and the call to
> @llvm.sin.v4f32 is generated, but the call is later scalarized with the
> additional overhead of unpacking the scalar function arguments from a
> vector.
> This can be seen from inspection of the resulting assembly code just below
> the
> LLVM IR.
>
> vector.body:                                ; preds = %vector.body, %
> vector.ph
>   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ], !dbg
> !6
>   %0 = trunc i64 %index to i32, !dbg !7
>   %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0,
i32 0,
>     !dbg !7
>   %broadcast.splat7 = shufflevector <4 x i32>
%broadcast.splatinsert6,
>     <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
>   %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1,
i32 2, i32
> 3>,
>     !dbg !7
>   %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
>   %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg
!8
>   %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
>   %4 = bitcast float* %3 to <4 x float>*, !dbg !10
>   store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10,
!tbaa !11
>   %index.next = add i64 %index, 4, !dbg !6
>   %5 = icmp eq i64 %index.next, 1000, !dbg !6
>   br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop
> !15
>
>
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
Depth=1
>         movd    %ebx, %xmm0
>         pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
>         paddd   .LCPI0_0(%rip), %xmm0
>         cvtdq2ps        %xmm0, %xmm0
>         movaps  %xmm0, 16(%rsp)         # 16-byte Spill
>         shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]
>         callq   sinf
>         movaps  %xmm0, (%rsp)           # 16-byte Spill
>         movaps  16(%rsp), %xmm0         # 16-byte Reload
>         shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]
>         callq   sinf
>         unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload
>                                         # xmm0 >
xmm0[0],mem[0],xmm0[1],mem[1]
>         movaps  %xmm0, (%rsp)           # 16-byte Spill
>         movaps  16(%rsp), %xmm0         # 16-byte Reload
>         callq   sinf
>         movaps  %xmm0, 32(%rsp)         # 16-byte Spill
>         movapd  16(%rsp), %xmm0         # 16-byte Reload
>         shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
>         callq   sinf
>         movaps  32(%rsp), %xmm1         # 16-byte Reload
>         unpcklps        %xmm0, %xmm1    # xmm1 >
xmm1[0],xmm0[0],xmm1[1],xmm0[1]
>         unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload
>                                         # xmm1 >
xmm1[0],mem[0],xmm1[1],mem[1]
>         movups  %xmm1, (%r14,%rbx,4)
>         addq    $4, %rbx
>         cmpq    $1000, %rbx             # imm = 0x3E8
>         jne     .LBB0_1
>
> ==========================> Proposed New Functionality
> ==========================>
> In order to take advantage of the performance benefits of the svml
> library, the
> proposed solution is to introduce a new LLVM IR pass that is capable of
> translating the vector math intrinsics to svml calls. As an example, the
> LLVM IR
> above for "vector.body", introduced in the Problem Description
section,
> would
> serve as input to the proposed pass and be transformed into the following
> LLVM
> IR. Special attention should be paid to the "__svml_sinf4_ha"
call in the
> LLVM
> IR and resulting assembly code snippet.
>
> vector.body:                                   ; preds = %vector.body,
> %entry
>   %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
>   %0 = trunc i64 %index to i32, !dbg !7
>   %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0,
i32 0,
>     !dbg !7
>   %broadcast.splat7 = shufflevector <4 x i32>
%broadcast.splatinsert6,
>     <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
>   %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1,
i32 2, i32
> 3>,
>     !dbg !7
>   %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
>   %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
>   %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
>   %3 = bitcast float* %2 to <4 x float>*, !dbg !9
>   store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9,
!tbaa !10
>   %index.next = add i64 %index, 4, !dbg !6
>   %4 = icmp eq i64 %index.next, 1000, !dbg !6
>   br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14
>
> The resulting assembly would appear as:
>
> .LBB0_1:                                # %vector.body
>                                         # =>This Inner Loop Header:
Depth=1
>         movd    %ebx, %xmm0
>         pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
>         paddd   .LCPI0_0(%rip), %xmm0
>         cvtdq2ps        %xmm0, %xmm0
>         callq   __svml_sinf4_ha
>         movups  %xmm0, (%r14,%rbx,4)
>         addq    $4, %rbx
>         cmpq    $1000, %rbx             # imm = 0x3E8
>         jne     .LBB0_1
>
> In order to perform the translation, several requirements must be met to
> guide
> code generation. Those include:
>
> 1) In addition to the -ffast-math flag, support is needed from clang to
> allow
>    the user to be able to specify the desired precision requirements. The
>    additional flags needed include the following, where "imf" is
shorthand
> for
>    "Intel math function".
>
>    -fimf-absolute-error=value[:funclist]
>           define the maximum allowable absolute error for math library
>           function results
>             value    - a positive, floating-point number conforming to the
>                        format [digits][.digits][{e|E}[sign]digits]
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-accuracy-bits=bits[:funclist]
>           define the relative error, measured by the number of correct
> bits,
>           for math library function results
>             bits     - a positive, floating-point number
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-arch-consistency=value[:funclist]
>           ensures that the math library functions produce consistent
> results
>           across different implementations of the same architecture
>             value    - true or false
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-max-error=ulps[:funclist]
>           defines the maximum allowable relative error, measured in ulps,
> for
>           math library function results
>             ulps     - a positive, floating-point number conforming to the
>                        format [digits][.digits][{e|E}[sign]digits]
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-precision=value[:funclist]
>           defines the accuracy (precision) for math library functions
>             value    - defined as one of the following values
>                        high   - equivalent to max-error = 0.6
>                        medium - equivalent to max-error = 4
>                        low    - equivalent to accuracy-bits = 11 (single
>                                 precision); accuracy-bits = 26 (double
>                                 precision)
>             funclist - optional comma separated list of one or more math
>                        library functions to which the attribute should be
>                        applied
>
>    -fimf-domain-exclusion=classlist[:funclist]
>           indicates the input arguments domain on which math functions
>           must provide correct results.
>            classlist - defined as one of the following values
>                          nans, infinities, denormals, zeros
>                          all, none, common
>            funclist - optional list of one or more math library
>                       functions to which the attribute should be applied.
>
> Information from the flags can then be encoded as function attributes at
> each
> call site. In the future, this functionality will enable more fine-grained
> control over specifying precision for individual calls/regions, instead of
> setting the precision requirements for all call instances of a function.
> Please
> note that the example translation presented so far does not have the IMF
> attributes attached to the @llvm.sin.v4f32 call, and as a result the
> default is
> set to an svml variant marked with "_ha" (max-error = 0.6), which
is short
> for
> high accuracy. Other supported variants will include low precision,
> enhanced
> performance, bitwise reproducible, and correctly rounded. Please refer to
> the
> IEEE-754 standard for additional information regarding supported
> precisions.
> The compiler will select the most appropriate variant based on the IMF
> attributes. See #2.
>
> 2) An interface to query for the appropriate svml function variant based
> on the
>    scalar function name and IMF attributes.
>
> 3) For calls to math functions that store to memory (e.g., sincos),
> additional
>    analysis of the pointer arguments is beneficial in order to generate
> the best
>    performing load/store instructions.
>
> =====================> GCC/ICC compatibility
> =====================>
> The initial implementation will involve the translation of 6 svml
> functions,
> which include sin, cos, log, pow, exp, and sincos (both single and double
> precision variants). Support for these functions matches the current
> capabilities of GCC and a subset of ICC. As more functions become
> open-sourced,
> the plan is to support them as part of the final solution determined from
> this
> proposal. The flags referenced in the Proposed New Functionality section
> are
> required to maintain icc compatibility.
>
> ======================> Current Implementation
> ======================>
> To evaluate the feasibility of this proposal, a prototype transform pass
> has
> been developed, which performs the following:
>
> 1) Searches for vector math intrinsics as candidates for translation to
> svml.
>
> 2) Reads function attributes to obtain precision requirements for the
> call. If
>    none, default to attributes that will force the selection of a high
> accuracy
>    variant.
>
> 3) Since the vector factor of the intrinsic can be wider than what is
> legally
>    supported by the target, type legalization is performed so that the
> correct
>    svml variant is selected. For example, if a call to
>    @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the
pass will
>    generate two __svml_sinf4 calls and will do the appropriate splitting
> of %1
>    to create the new arguments for each call. In addition, the multiple
> return
>    vectors are recombined and users of the original return vector are
> updated.
>    The pass is also capable of handling less than full vector cases. E.g.,
>    @llvm.sin.v2f32.
>
> 4) Special handling for sincos since the results are stored to a double
> wide
>    vector and additional analysis is needed to optimize the stores to
> memory.
>    Additional shuffling is required to obtain the sin and cos results from
>    the double wide vector.
>
> 5) Vector intrinsics that are not translated to svml are scalarized.
>
> 6) The loop vectorizer has been taught to allow widening of sincos and
>    additional utilities have been written to analyze arguments for sincos.
>
> ========> Feedback
> ========>
> For those who are interested in this topic, I would like to ask for your
> review
> of this proposal and would definitely appreciate any/all feedback on the
> proposed approach. Help is also very welcome and much appreciated in the
> development process.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160404/e2773916/attachment-0001.html>

Masten, Matt via llvm-dev

2016-Apr-04 23:39 UTC

head link

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

Hi Sanjay,

For sincos calls, I’m currently just going through isTriviallyVectorizable(),
which was good enough to get things working so that I could test the
translation. I don’t see why this cannot be changed to use
addVectorizableFunctionsFromVecLib(). The other functions that I’m working with
are already vectorized using the loop pragma. Those include sin, cos, exp, log,
and pow.

From: Sanjay Patel [mailto:spatel at rotateright.com]
Sent: Monday, April 04, 2016 10:57 AM
To: Masten, Matt
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] RFC: A proposal for vectorizing loops with calls to math
functions using SVML

Hi Matt -
Are you using the same TLI hook as Darwin's Accelerate framework:
addVectorizableFunctionsFromVecLib()? If not, why not?

On Thu, Mar 31, 2016 at 6:20 PM, Masten, Matt via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
RFC: A proposal for vectorizing loops with calls to math functions using SVML
(short
vector math library).

========Overview
========
Very simply, SVML (Intel short vector math library) functions are vector
variants of
scalar math functions that take vector arguments, apply an operation to each
element, and store the result in a vector register. These vector variants can be
generated by the compiler, based on precision requirements specified by the
user, resulting in substantial performance gains. This is an initial proposal to
add a new LLVM IR transformation pass that will translate scalar math calls to
svml calls with the help of the loop vectorizer.

===================Problem Description
===================
Currently, without the "#pragma clang loop vectorize(enable)", the
loop
vectorizer will not vectorize loops with math calls due to cost model reasons.
Additionally, When the loop pragma is used, the loop vectorizer will widen the
math call using an intrinsic, but the resulting code is inefficient because the
intrinsic is replaced with scalarized function calls. Please see the example
below for a simple loop containing a sinf call. For demonstration purposes, the
example was compiled for an xmm target, thus VF = 4 given the float type.

Example: sinf.c

#define N 1000

#pragma clang loop vectorize(enable)
for (i = 0; i < N; i++) {
  array[i] = sinf((float)i);
}

Without the loop pragma the loop vectorizer's cost model rejects the loop.

clang -c -ffast-math -O2 -Rpass-analysis=loop-vectorize
-Rpass-missed=loop-vectorize sinf.c

sinf.c:19:3: remark: the cost-model indicates that vectorization is not
beneficial [-Rpass-analysis=loop-vectorize]
  for (i = 0; i < N; i++) {
  ^
sinf.c:19:3: remark: the cost-model indicates that interleaving is not
beneficial and is explicitly disabled or interleave count is set to 1
[-Rpass-analysis=loop-vectorize]

When the the loop pragma is used, the loop is vectorized and the call to
@llvm.sin.v4f32 is generated, but the call is later scalarized with the
additional overhead of unpacking the scalar function arguments from a vector.
This can be seen from inspection of the resulting assembly code just below the
LLVM IR.

vector.body:                                ; preds = %vector.body,
%vector.ph<http://vector.ph>
  %index = phi i64 [ 0, %vector.ph<http://vector.ph> ], [ %index.next,
%vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
    !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
i32 3>,
    !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %2 = call <4 x float> @llvm.sin.v4f32(<4 x float> %1), !dbg !8
  %3 = getelementptr inbounds float, float* %array, i64 %index, !dbg !9
  %4 = bitcast float* %3 to <4 x float>*, !dbg !10
  store <4 x float> %2, <4 x float>* %4, align 4, !dbg !10, !tbaa
!11
  %index.next = add i64 %index, 4, !dbg !6
  %5 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %5, label %middle.block, label %vector.body, !dbg !6, !llvm.loop !15


.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        movaps  %xmm0, 16(%rsp)         # 16-byte Spill
        shufps  $231, %xmm0, %xmm0      # xmm0 = xmm0[3,1,2,3]
        callq   sinf
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        shufps  $229, %xmm0, %xmm0      # xmm0 = xmm0[1,1,2,3]
        callq   sinf
        unpcklps        (%rsp), %xmm0   # 16-byte Folded Reload
                                        # xmm0 = xmm0[0],mem[0],xmm0[1],mem[1]
        movaps  %xmm0, (%rsp)           # 16-byte Spill
        movaps  16(%rsp), %xmm0         # 16-byte Reload
        callq   sinf
        movaps  %xmm0, 32(%rsp)         # 16-byte Spill
        movapd  16(%rsp), %xmm0         # 16-byte Reload
        shufpd  $1, %xmm0, %xmm0        # xmm0 = xmm0[1,0]
        callq   sinf
        movaps  32(%rsp), %xmm1         # 16-byte Reload
        unpcklps        %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
        unpcklps        (%rsp), %xmm1   # 16-byte Folded Reload
                                        # xmm1 = xmm1[0],mem[0],xmm1[1],mem[1]
        movups  %xmm1, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

==========================Proposed New Functionality
==========================
In order to take advantage of the performance benefits of the svml library, the
proposed solution is to introduce a new LLVM IR pass that is capable of
translating the vector math intrinsics to svml calls. As an example, the LLVM IR
above for "vector.body", introduced in the Problem Description
section, would
serve as input to the proposed pass and be transformed into the following LLVM
IR. Special attention should be paid to the "__svml_sinf4_ha" call in
the LLVM
IR and resulting assembly code snippet.

vector.body:                                   ; preds = %vector.body, %entry
  %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ], !dbg !6
  %0 = trunc i64 %index to i32, !dbg !7
  %broadcast.splatinsert6 = insertelement <4 x i32> undef, i32 %0, i32 0,
    !dbg !7
  %broadcast.splat7 = shufflevector <4 x i32> %broadcast.splatinsert6,
    <4 x i32> undef, <4 x i32> zeroinitializer, !dbg !7
  %induction8 = add <4 x i32> %broadcast.splat7, <i32 0, i32 1, i32 2,
i32 3>,
    !dbg !7
  %1 = sitofp <4 x i32> %induction8 to <4 x float>, !dbg !7
  %vcall = call <4 x float> @__svml_sinf4_ha(<4 x float> %1)
  %2 = getelementptr inbounds float, float* %array, i64 %index, !dbg !8
  %3 = bitcast float* %2 to <4 x float>*, !dbg !9
  store <4 x float> %vcall, <4 x float>* %3, align 4, !dbg !9, !tbaa
!10
  %index.next = add i64 %index, 4, !dbg !6
  %4 = icmp eq i64 %index.next, 1000, !dbg !6
  br i1 %4, label %for.end, label %vector.body, !dbg !6, !llvm.loop !14

The resulting assembly would appear as:

.LBB0_1:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
        movd    %ebx, %xmm0
        pshufd  $0, %xmm0, %xmm0        # xmm0 = xmm0[0,0,0,0]
        paddd   .LCPI0_0(%rip), %xmm0
        cvtdq2ps        %xmm0, %xmm0
        callq   __svml_sinf4_ha
        movups  %xmm0, (%r14,%rbx,4)
        addq    $4, %rbx
        cmpq    $1000, %rbx             # imm = 0x3E8
        jne     .LBB0_1

In order to perform the translation, several requirements must be met to guide
code generation. Those include:

1) In addition to the -ffast-math flag, support is needed from clang to allow
   the user to be able to specify the desired precision requirements. The
   additional flags needed include the following, where "imf" is
shorthand for
   "Intel math function".

   -fimf-absolute-error=value[:funclist]
          define the maximum allowable absolute error for math library
          function results
            value    - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-accuracy-bits=bits[:funclist]
          define the relative error, measured by the number of correct bits,
          for math library function results
            bits     - a positive, floating-point number
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-arch-consistency=value[:funclist]
          ensures that the math library functions produce consistent results
          across different implementations of the same architecture
            value    - true or false
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-max-error=ulps[:funclist]
          defines the maximum allowable relative error, measured in ulps, for
          math library function results
            ulps     - a positive, floating-point number conforming to the
                       format [digits][.digits][{e|E}[sign]digits]
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-precision=value[:funclist]
          defines the accuracy (precision) for math library functions
            value    - defined as one of the following values
                       high   - equivalent to max-error = 0.6
                       medium - equivalent to max-error = 4
                       low    - equivalent to accuracy-bits = 11 (single
                                precision); accuracy-bits = 26 (double
                                precision)
            funclist - optional comma separated list of one or more math
                       library functions to which the attribute should be
                       applied

   -fimf-domain-exclusion=classlist[:funclist]
          indicates the input arguments domain on which math functions
          must provide correct results.
           classlist - defined as one of the following values
                         nans, infinities, denormals, zeros
                         all, none, common
           funclist - optional list of one or more math library
                      functions to which the attribute should be applied.

Information from the flags can then be encoded as function attributes at each
call site. In the future, this functionality will enable more fine-grained
control over specifying precision for individual calls/regions, instead of
setting the precision requirements for all call instances of a function. Please
note that the example translation presented so far does not have the IMF
attributes attached to the @llvm.sin.v4f32 call, and as a result the default is
set to an svml variant marked with "_ha" (max-error = 0.6), which is
short for
high accuracy. Other supported variants will include low precision, enhanced
performance, bitwise reproducible, and correctly rounded. Please refer to the
IEEE-754 standard for additional information regarding supported precisions.
The compiler will select the most appropriate variant based on the IMF
attributes. See #2.

2) An interface to query for the appropriate svml function variant based on the
   scalar function name and IMF attributes.

3) For calls to math functions that store to memory (e.g., sincos), additional
   analysis of the pointer arguments is beneficial in order to generate the best
   performing load/store instructions.

=====================GCC/ICC compatibility
=====================
The initial implementation will involve the translation of 6 svml functions,
which include sin, cos, log, pow, exp, and sincos (both single and double
precision variants). Support for these functions matches the current
capabilities of GCC and a subset of ICC. As more functions become open-sourced,
the plan is to support them as part of the final solution determined from this
proposal. The flags referenced in the Proposed New Functionality section are
required to maintain icc compatibility.

======================Current Implementation
======================
To evaluate the feasibility of this proposal, a prototype transform pass has
been developed, which performs the following:

1) Searches for vector math intrinsics as candidates for translation to svml.

2) Reads function attributes to obtain precision requirements for the call. If
   none, default to attributes that will force the selection of a high accuracy
   variant.

3) Since the vector factor of the intrinsic can be wider than what is legally
   supported by the target, type legalization is performed so that the correct
   svml variant is selected. For example, if a call to
   @llvm.sin.v8f32(<8 x float> %1) is made for an xmm target, the pass
will
   generate two __svml_sinf4 calls and will do the appropriate splitting of %1
   to create the new arguments for each call. In addition, the multiple return
   vectors are recombined and users of the original return vector are updated.
   The pass is also capable of handling less than full vector cases. E.g.,
   @llvm.sin.v2f32.

4) Special handling for sincos since the results are stored to a double wide
   vector and additional analysis is needed to optimize the stores to memory.
   Additional shuffling is required to obtain the sin and cos results from
   the double wide vector.

5) Vector intrinsics that are not translated to svml are scalarized.

6) The loop vectorizer has been taught to allow widening of sincos and
   additional utilities have been written to analyze arguments for sincos.

========Feedback
========
For those who are interested in this topic, I would like to ask for your review
of this proposal and would definitely appreciate any/all feedback on the
proposed approach. Help is also very welcome and much appreciated in the
development process.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160404/cd3df0e8/attachment-0001.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Apr 2016 - RFC: A proposal for vectorizing loops with calls to math functions using SVML

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

[llvm-dev] RFC: A proposal for vectorizing loops with calls to math functions using SVML

Maybe Matching Threads