thr3ads.net - llvm dev - [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer [Dec 2016]

If this information is useful, please help other people find it:
Share via:

Tian, Xinmin via llvm-dev

2016-Dec-08 18:11 UTC

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Hi Francesco, a bit more information.  GCC veclib is implemented based on GCC
VectorABI for declare simd as well.

For name mangling, we have to follow certain rules of C/C++ (e.g. prefix needs
to _ZVG ....).  David Majnemer who is the owner and stakeholder for approval for
Clang and LLVM.  Also,  we need to pay attention to GCC compatibility.  I would
suggest you look into how GCC VectorABI can be extended support your Arch.

Thanks,
Xinmin

-----Original Message-----
From: Odeh, Saher 
Sent: Thursday, December 8, 2016 3:49 AM
To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org;
Francesco.Petrogalli at arm.com
Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal
Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
a.bataev at hotmail.com
Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

Hi Francesco,

As you stated in the RFC, when vectorizing a scalar function (e.g. when using
omp declare simd), one needs to incorporate attributes to the resulting
vectorized-function.
These attributes describe a) the behavior of the function, e.g. mask-able or
not, and b) the type of the parameters, e.g. scalar or linear or any other
option.

As this list is extensive, it is only logical to use an existing infrastructure
of ICC and GCC vectorABI which already covers all of these options as stated in
Xinmin's RFC
[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
Moreover, when considering other compilers such as GCC, I do see that the
resulting assembly actually does incorporate this exact infrastructure. So if we
wish to link different parts of the program using clang and GCC we'll need
to adhere to the same name mangling/ABI. Please see the below result after
compiling an omp declare simd function using GCC.
Lastly, please note the two out of the three components of the implementation
have already been committed or submitted, and both are adhering the name
mangling proposed by Xinmin's RFC. A) committed - the FE portion by Alexey
[https://reviews.llvm.org/rL264853], it generates mangled names in the manner
described by Xinmin's RFC, See below B) Submitted - the callee side by Matt
[https://reviews.llvm.org/D22792], it uses these mangled names. and C) caller
which is covered by this patch.

In order to mitigate the needed effort and possible issues when implementing, I
believe it is best to follow the name mangling proposed in Xinmin's RFC.
What do you think?

GCC Example
----------------
Compiler version: GCC 6.1.0
Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s

omp.c
#include <omp.h>

#pragma omp declare simd
int dowork(int* a, int idx)
{
 return a[idx] * a[idx]*7;
}

less omp.s | grep @function
        .type   dowork, @function
        .type   _ZGVbN4vv_dowork, @function
        .type   _ZGVbM4vv_dowork, @function
        .type   _ZGVcN4vv_dowork, @function
        .type   _ZGVcM4vv_dowork, @function
        .type   _ZGVdN8vv_dowork, @function
        .type   _ZGVdM8vv_dowork, @function
        .type   _ZGVeN16vv_dowork, @function
        .type   _ZGVeM16vv_dowork, @function

Clang on FE using Alexey's patch
---------------------------------------
Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all >&
out

#pragma omp declare simd
extern int dowork(int* a, int idx)
{
  return a[idx]*7;
}


int main() {
  dowork(0,1);
}

attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork"
"_ZGVbN4vv_dowork" "_ZGVcM8vv_dowork"
"_ZGVcN8vv_dowork" "_ZGVdM8vv_dowork"
"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork"
"_ZGVeN16vv_dowork"
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false"
"less-precise-fpmad"="false"
"no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf"
"no-infs-fp-math"="false"
"no-jump-tables"="false"
"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
"target-cpu"="x86-64"
"target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
"unsafe-fp-math"="false"
"use-soft-float"="false" }


Thanks Saher

-----Original Message-----
From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com] 
Sent: Tuesday, December 06, 2016 17:22
To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal
Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
a.bataev at hotmail.com
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

Hi Xinmin,

Thank you for your email.

I have been catching up with the content of your proposal, and I have some
questions/remarks below that I'd like to discuss with you - see the final
section in the proposal.

I have specifically added Alexey B. to the mail so we can move our conversation
from phabricator to the mailing list.

Before we start, I just want to mention that the initial idea of using
llvm::FunctionType for vector function generation and matching has been proposed
by a colleague, Paul Walker, when we first tried out supporting this on AArch64
on an internal version of llvm. I received some input also from Amara Emerson.

In our case we had a slightly different problem to solve: we wanted to support
in the vectorizer a rich set of vector math routines provided with an external
library. We managed to do this by adding the pragma to the (scalar) function
declaration of the header file provided with the library, and as shown by the
patches I have submitted, by generating vector function signatures that the
vectorizer can search in the TargetLibraryInfo.

Here is an updated version of the proposal. Please let me know what you think,
and if you have any solution we could use for the final section.

# RFC for "pragma omp declare simd"

Hight level components:

A) Global variable generator (clang FE)
B) Parameter descriptors (as new enumerations in llvm::Attribute)
C) TLII methods and fields for the multimap (llvm middle-end)

## Workflow

Example user input, with a declaration and definition:

    #pragma omp declare simd
    #pragma omp declare simd uniform(y)
    extern double pow(double x, double y);

    #pragma omp declare simd
    #pragma omp declare simd linear(x:2)
    float foo(float x) {....}

    /// code using both functions

### Step 1


The compiler FE process these definition and declaration and generates a list of
globals as follows:

    @prefix_vector_pow1_midfix_pow_postfix = external global
                                             <4 x double>(<4 x
double>,
                                                          <4 x double>)
    @prefix_vector_pow2_midfix_pow_postfix = external global
                                             <4 x double>(<4 x
double>,
                                                          double)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x
float>,
                                                         <8 x float>)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x
float>,
                                                         <8 x float> #0)
    ...
    attribute #0  = {linear = 2}


Notes about step 1:

1. The mapping scalar name <-> vector name is in the
   prefix/midfix/postfix mangled name of the global variable.
2. The examples shows only a set of possible vector function for a
   sizeof(<4 x double>) vector extension. If multiple vector extension
   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
   and AVX512) the front end takes care to generate each of the
   associated functions (like it is done now).
3. Vector function parameters are rendered using the same
   Characteristic Data Type (CDT) rule already in the compiler FE.
4. Uniform parameters are rendered with the original scalar type.
5. Linear parameters are rendered with vectors using the same
   CDT-generated vector length, and decorated with proper
   attributes. I think we could extent the llvm::Attribute enumeration adding
the following:
   - linear : numeric, specify_the step
   - linear_var : numeric, specify the position of the uniform variable holding
the step
   - linear_uval[_var]: numeric as before, but for the "uval" modifier
(both constant step or variable step)
   - linear_val[_var]: numeric, as before, but for "val" modifier
   - linear_ref[_var] numeric, for "ref" modifier.

   For example, "attribute #0 = {linear = 2}" says that the vector of
   the associated parameter in the function signature has a linear
   step of 2.

### Step 2

The compiler FE invokes a TLII method in BackendUtils.cpp that populate a
multimap in the TLII by checking the globals created in the previous step.

Each global is processed, demangling the [pre/mid/post]fix name and generate a
mapping in the TLII as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
    };
    std::multimap<std:string, VectorFnInfo> VFInfo;


For the initial example, the multimap in the TLI is populated as follows:

    "pow" -> [(vector_pow1, <4 x double>(<4 x double>,
<4 x double>)),
              (vector_pow2, <4 x double>(<4 x double>, double))]

    "foo" -> [(vector_foo1, <8 x float>(<8 x float>,
<8 x float>)),
              (vector_foo2, <8 x float>(<8 x float>, <8 x
float> #0))]

Notes about step 2:

Given the fact that the external globals that the FE have generated are removed
_before_ the vectorizer kicks in, I am not sure if the "attribute #0"
needed for one of the parameter is still present at this point. IF NOT, I think
that in this case we could enrich the "VectorFnInfo" as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
       std::set<unsigned, llvm:Attribute> Attrs;
    };

The field "Attrs" maps the position of the parameter with the
correspondent llvm::Attribute present in the global variable.

I have added this note for the sake of completeness. I *think* that we won't
be needing this additional Attrs field: I have already shown in the llvm patch I
submitted that the function type "survives" after the global gets
removed, I don't see why the parameter attribute shouldn't survive too
(last famous words?).

### Step 3

This step happens in the LoopVectorizer. The InnerLoopVectorizer queries the
TargetLibraryInfo looking for a vectorized version of the function by scalar
name and function signature with the following method:

    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName,
FuncionType *FTy);

This is done in a way similar to what my current llvm patch does: the loop
vectorizer makes up the function signature it needs and look for it in the TLI.
If a match is found, vectorization is possible. Right now the compiler is not
aware of uniform/linear function attributes, but it still can refer to them in a
target agnostic way, by using scalar signatures for the uniform ones and using
llvm::Attributes for the linear ones.

Notice that the vector name here is not used at all, which is good as any
architecture can come up with it's own name mangling for vector functions,
without breaking the ability of the vectorizer to vectorize the same code with
the new name mangling.

## External libraries vs user provided code

The example with "pow" and "foo" I have provided before
shows a function declaration and a function definition. Although the TLII
mechanism I have described seems to be valid only for the former case, I think
that it is valid also for the latter.  In fact, in case of a function
definition, the compiler would have to generate also the body of the vector
function, but that external global variable could still be used to inform the
TLII of such function. The fact that the vector function needed by the
vectorizer is in some module instead of in an external library doesn't seems
to make all that difference at compile time to me.

# Some final notes (call for ideas!)

There is one level of target dependence that I still have to sort out, and for
this I need input from the community and in particular from the Intel folks.

I will start with this example:

    #pragma omp declare simd
    float foo(float x);

In case of NEON, this would generate 2 globals, one for vectors holding 2
floats, and one for vector holding 4 floats, corresponding to NEON 64-bit and
128-bit respectively. This means that the vectorizer have a unique function it
could choose from the list the TLI provides.

This is not the same on Intel, for example when this code generates vector names
for AVX and AVX2. The register width for these architecture extensions are the
same, so all the TLI has is a mapping between scalar name and (vectro_name,
function_type) who's two elements differ only in the vector_name string.

This breaks the target independence of the vectorizer, as it would require it to
parse the vector_name to be able to choose between the AVX or the AVX2
implementation.

Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
information in the VectorFnInfo structure. Does anybody have an idea on how best
to do it? For the sake of keeping the vectorizer target independent, I would
like to avoid encoding this piece of information in the VectorFnInfo struct. I
have seen that in your code you are generating SSE/AVX/AVX2/AVX512 vector
functions, how do you plan to choose between them in the vectorizer? I could not
find how you planned to solve this problem in your proposal, or have I just
missed it?

Is there a way to do this in the TLII? The function type of the vector function
could use the "target-feature" attribute of function definitions, but
how coudl the vectorizer decide which one to use?

Anyway, that's it. Your feedback will be much appreciated.

Cheers,
Francesco

________________________________________
From: Tian, Xinmin <xinmin.tian at intel.com>
Sent: 30 November 2016 17:16:12
To: Francesco Petrogalli; llvm-dev at lists.llvm.org
Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

Hi Francesco,

Good to know, you are working on the support for this feature. I assume you knew
the RFC below.  The VectorABI mangling we proposed were approved by C++ Clang FE
name mangling owner David M from Google,  the ClangFE support was committed in
its main trunk by Alexey.

"Proposal for function vectorization and loop vectorization with function
calls", March 2, 2016. Intel Corp. 
http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.

Matt submitted patch to generate vector variants for function definitions, not
just function declarations. You may want to take a look.  Ayal's RFC will be
also needed to support vectorization of function body in general.

I agreed, we should have an option -fopenmp-simd to enable SIMD only, both GCC
and ICC have similar options.

I would suggest we shall sync-up on these work, so we don't duplicate the
effort.

Thanks,
Xinmin

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
Francesco Petrogalli via llvm-dev
Sent: Wednesday, November 30, 2016 7:11 AM
To: llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>
Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

Dear all,

I have just created a couple of differential reviews to enable the vectorisation
of loops that have function calls to routines marked with "#pragma omp
declare simd".

They can be (re)viewed here:

* https://reviews.llvm.org/D27249

* https://reviews.llvm.org/D27250

The current implementation allows the loop vectorizer to generate vector code
for source file as:

  #pragma omp declare simd
  double f(double x);

  void aaa(double *x, double *y, int N) {
    for (int i = 0; i < N; ++i) {
      x[i] = f(y[i]);
    }
  }


by invoking clang with arguments:

  $> clang -fopenmp -c -O3 file.c [...]


Such functionality should provide a nice interface for vector libraries
developers that can be used to inform the loop vectorizer of the availability of
an external library with the vector implementation of the scalar functions in
the loops. For this, all is needed to do is to mark with "#pragma omp
declare simd" the function declaration in the header file of the library
and generate the associated symbols in the object file of the library according
to the name scheme of the vector ABI (see notes below).

I am interested in any feedback/suggestion/review the community might have
regarding this behaviour.

Below you find a description of the implementation and some notes.

Thanks,

Francesco

-----------

The functionality is implemented as follow:

1. Clang CodeGen generates a set of global external variables for each of the
function declarations marked with the OpenMP pragma. Each of such globals are
named according a mangling that is generated by llvm::TargetLibraryInfoImpl
(TLII), and holds the vector signature of the associated vector function. (See
examples in the tests of the clang patch.
Each scalar function can generate multiple vector functions depending on the
clauses of the declare simd directives) 2. When clang created the TLII, it
processes the llvm::Module and finds out which of the globals of the module have
the correct mangling and type so that they be added to the TLII as a list of
vector function that can be associated to the original scalar one.
3. The LoopVectorizer looks for the available vector functions through the TLII
not by scalar name and vectorisation factor but by scalar name and vector
function signature, thus enabling the vectorizer to be able to distinguish a
"vector vpow1(vector x, vector y)" from a "vector vpow2(vector x,
scalar y)". (The second one corresponds to a "declare simd
uniform(y)" for a "scalar pow(scalar x, scalar y)" declaration).
(Notice that the changes in the loop vectorizer are minimal.)


Notes:

1. To enable SIMD only for OpenMP, leaving all the multithread/target behaviour
behind, we should enable this also with a new option:
-fopenmp-simd
2. The AArch64 vector ABI in the code is essentially the same as for the Intel
one (apart from the prefix and the masking argument), and it is based on the
clauses associated to "declare simd" in OpenMP 4.0. For OpenMP4.5, the
parameters section of the mangled name should be updated.
This update will not change the vectorizer behaviour as all the vectorizer needs
to detect a vectorizable function is the original scalar name and a compatible
vector function signature. Of course, any changes/updates in the ABI will have
to be reflected in the symbols of the binary file of the library.
3. Whistle this is working only for function declaration, the same functionality
can be used when (if) clang will implement the declare simd OpenMP pragma for
function definitions.
4. I have enabled this for any loop that invokes the scalar function call, not
just for those annotated with "#pragma omp for simd". I don't have
any preference here, but at the same time I don't see any reason why this
shouldn't be enabled by default for non annotated loops. Let me know if you
disagree, I'd happily change the functionality if there are sound reasons
behind that.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Dec-08 22:08 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

On 8 December 2016 at 18:11, Tian, Xinmin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> For name mangling, we have to follow certain rules of C/C++ (e.g. prefix
needs to _ZVG ....). David Majnemer who is the owner and stakeholder for
approval for Clang and LLVM. Also, we need to pay attention to GCC
compatibility. I would suggest you look into how GCC VectorABI can be extended
support your Arch.
Hi Xinmin,

I only began to review this proposal, and like yours, I think this is
a really important feature to get in.

I agree with you on the name mangling need for C++, as well as
compatibility with GCC, but according to Francesco, there are some
problems that those two alone don't solve.

I'm still unsure how the simplistic mangling we have today will work
around the multiple versions we could have with NEON (and in the
future, SVE) without polluting the mangling quite a lot (have you seen
arm_neon.h?).

So, we may get away with it for now with some basic support and the
current style, but this should grow into a more flexible scheme.

About the current IR form, I don't particularly like how they're tied
up together, but other than having multiple global functions defined
(something like weak linkage?), I don't have a better idea right now.

Francesco,

Maybe the best thing to do right now would be to try and fit NEON
alternatives in this mangling scheme and see how it goes. If anything,
it'll give us an idea on what's broken, and hopefully, how to fix it.

cheers,
--renato

Francesco Petrogalli via llvm-dev

2016-Dec-12 13:44 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Hi Xinmin,

I have updated the clang patch using the standard name mangling you
suggested - I was not fully aware of the C++ mangling convention “_ZVG”.

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON
vector symbols look as follows:

_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff


Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit

Please notice that although I have changed the name mangling in clang [1],
there have been no need to update the relative llvm patch [2], as the
vectorisation process is _independent_ of the name mangling.

Regards,

Francesco

[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the
copy constructor of the TLII and in the return value of the TLII::mangle()
method. None of the underlying scalar/vector function matching algorithms
have been touched.


On 08/12/2016 18:11, "Tian, Xinmin" <xinmin.tian at intel.com>
wrote:
>Hi Francesco, a bit more information.  GCC veclib is implemented based on
>GCC VectorABI for declare simd as well.
>
>For name mangling, we have to follow certain rules of C/C++ (e.g. prefix
>needs to _ZVG ....).  David Majnemer who is the owner and stakeholder for
>approval for Clang and LLVM.  Also,  we need to pay attention to GCC
>compatibility.  I would suggest you look into how GCC VectorABI can be
>extended support your Arch.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: Odeh, Saher 
>Sent: Thursday, December 8, 2016 3:49 AM
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at
lists.llvm.org;
>Francesco.Petrogalli at arm.com
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>;
Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
a.bataev at hotmail.com
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Francesco,
>
>As you stated in the RFC, when vectorizing a scalar function (e.g. when
>using omp declare simd), one needs to incorporate attributes to the
>resulting vectorized-function.
>These attributes describe a) the behavior of the function, e.g. mask-able
>or not, and b) the type of the parameters, e.g. scalar or linear or any
>other option.
>
>As this list is extensive, it is only logical to use an existing
>infrastructure of ICC and GCC vectorABI which already covers all of these
>options as stated in Xinmin's RFC
>[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
>Moreover, when considering other compilers such as GCC, I do see that the
>resulting assembly actually does incorporate this exact infrastructure.
>So if we wish to link different parts of the program using clang and GCC
>we'll need to adhere to the same name mangling/ABI. Please see the below
>result after compiling an omp declare simd function using GCC.
>Lastly, please note the two out of the three components of the
>implementation have already been committed or submitted, and both are
>adhering the name mangling proposed by Xinmin's RFC. A) committed - the
>FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates
>mangled names in the manner described by Xinmin's RFC, See below B)
>Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], it
>uses these mangled names. and C) caller which is covered by this patch.
>
>In order to mitigate the needed effort and possible issues when
>implementing, I believe it is best to follow the name mangling proposed
>in Xinmin's RFC. What do you think?
>
>GCC Example
>----------------
>Compiler version: GCC 6.1.0
>Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s
>
>omp.c
>#include <omp.h>
>
>#pragma omp declare simd
>int dowork(int* a, int idx)
>{
> return a[idx] * a[idx]*7;
>}
>
>less omp.s | grep @function
>        .type   dowork, @function
>        .type   _ZGVbN4vv_dowork, @function
>        .type   _ZGVbM4vv_dowork, @function
>        .type   _ZGVcN4vv_dowork, @function
>        .type   _ZGVcM4vv_dowork, @function
>        .type   _ZGVdN8vv_dowork, @function
>        .type   _ZGVdM8vv_dowork, @function
>        .type   _ZGVeN16vv_dowork, @function
>        .type   _ZGVeM16vv_dowork, @function
>
>Clang on FE using Alexey's patch
>---------------------------------------
>Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all
>&
>out
>
>#pragma omp declare simd
>extern int dowork(int* a, int idx)
>{
>  return a[idx]*7;
>}
>
>
>int main() {
>  dowork(0,1);
>}
>
>attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork"
"_ZGVbN4vv_dowork"
>"_ZGVcM8vv_dowork" "_ZGVcN8vv_dowork"
"_ZGVdM8vv_dowork"
>"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork"
"_ZGVeN16vv_dowork"
>"correctly-rounded-divide-sqrt-fp-math"="false"
>"disable-tail-calls"="false"
"less-precise-fpmad"="false"
>"no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf"
>"no-infs-fp-math"="false"
"no-jump-tables"="false"
>"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
>"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
>"target-cpu"="x86-64"
"target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
>"unsafe-fp-math"="false"
"use-soft-float"="false" }
>
>
>Thanks Saher
>
>-----Original Message-----
>From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
>Sent: Tuesday, December 06, 2016 17:22
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at
lists.llvm.org
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>;
Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
a.bataev at hotmail.com
>Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Xinmin,
>
>Thank you for your email.
>
>I have been catching up with the content of your proposal, and I have
>some questions/remarks below that I'd like to discuss with you - see the
>final section in the proposal.
>
>I have specifically added Alexey B. to the mail so we can move our
>conversation from phabricator to the mailing list.
>
>Before we start, I just want to mention that the initial idea of using
>llvm::FunctionType for vector function generation and matching has been
>proposed by a colleague, Paul Walker, when we first tried out supporting
>this on AArch64 on an internal version of llvm. I received some input
>also from Amara Emerson.
>
>In our case we had a slightly different problem to solve: we wanted to
>support in the vectorizer a rich set of vector math routines provided
>with an external library. We managed to do this by adding the pragma to
>the (scalar) function declaration of the header file provided with the
>library, and as shown by the patches I have submitted, by generating
>vector function signatures that the vectorizer can search in the
>TargetLibraryInfo.
>
>Here is an updated version of the proposal. Please let me know what you
>think, and if you have any solution we could use for the final section.
>
># RFC for "pragma omp declare simd"
>
>Hight level components:
>
>A) Global variable generator (clang FE)
>B) Parameter descriptors (as new enumerations in llvm::Attribute)
>C) TLII methods and fields for the multimap (llvm middle-end)
>
>## Workflow
>
>Example user input, with a declaration and definition:
>
>    #pragma omp declare simd
>    #pragma omp declare simd uniform(y)
>    extern double pow(double x, double y);
>
>    #pragma omp declare simd
>    #pragma omp declare simd linear(x:2)
>    float foo(float x) {....}
>
>    /// code using both functions
>
>### Step 1
>
>
>The compiler FE process these definition and declaration and generates a
>list of globals as follows:
>
>    @prefix_vector_pow1_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x
double>,
>                                                          <4 x
double>)
>    @prefix_vector_pow2_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x
double>,
>                                                          double)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x
float>,
>                                                         <8 x float>)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x
float>,
>                                                         <8 x float>
#0)
>    ...
>    attribute #0  = {linear = 2}
>
>
>Notes about step 1:
>
>1. The mapping scalar name <-> vector name is in the
>   prefix/midfix/postfix mangled name of the global variable.
>2. The examples shows only a set of possible vector function for a
>   sizeof(<4 x double>) vector extension. If multiple vector extension
>   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
>   and AVX512) the front end takes care to generate each of the
>   associated functions (like it is done now).
>3. Vector function parameters are rendered using the same
>   Characteristic Data Type (CDT) rule already in the compiler FE.
>4. Uniform parameters are rendered with the original scalar type.
>5. Linear parameters are rendered with vectors using the same
>   CDT-generated vector length, and decorated with proper
>   attributes. I think we could extent the llvm::Attribute enumeration
>adding the following:
>   - linear : numeric, specify_the step
>   - linear_var : numeric, specify the position of the uniform variable
>holding the step
>   - linear_uval[_var]: numeric as before, but for the "uval"
modifier
>(both constant step or variable step)
>   - linear_val[_var]: numeric, as before, but for "val" modifier
>   - linear_ref[_var] numeric, for "ref" modifier.
>
>   For example, "attribute #0 = {linear = 2}" says that the vector
of
>   the associated parameter in the function signature has a linear
>   step of 2.
>
>### Step 2
>
>The compiler FE invokes a TLII method in BackendUtils.cpp that populate a
>multimap in the TLII by checking the globals created in the previous step.
>
>Each global is processed, demangling the [pre/mid/post]fix name and
>generate a mapping in the TLII as follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>    };
>    std::multimap<std:string, VectorFnInfo> VFInfo;
>
>
>For the initial example, the multimap in the TLI is populated as follows:
>
>    "pow" -> [(vector_pow1, <4 x double>(<4 x
double>, <4 x double>)),
>              (vector_pow2, <4 x double>(<4 x double>, double))]
>
>    "foo" -> [(vector_foo1, <8 x float>(<8 x
float>, <8 x float>)),
>              (vector_foo2, <8 x float>(<8 x float>, <8 x
float> #0))]
>
>Notes about step 2:
>
>Given the fact that the external globals that the FE have generated are
>removed _before_ the vectorizer kicks in, I am not sure if the
"attribute
>#0" needed for one of the parameter is still present at this point. IF
>NOT, I think that in this case we could enrich the "VectorFnInfo"
as
>follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>       std::set<unsigned, llvm:Attribute> Attrs;
>    };
>
>The field "Attrs" maps the position of the parameter with the
>correspondent llvm::Attribute present in the global variable.
>
>I have added this note for the sake of completeness. I *think* that we
>won't be needing this additional Attrs field: I have already shown in
the
>llvm patch I submitted that the function type "survives" after the
global
>gets removed, I don't see why the parameter attribute shouldn't
survive
>too (last famous words?).
>
>### Step 3
>
>This step happens in the LoopVectorizer. The InnerLoopVectorizer queries
>the TargetLibraryInfo looking for a vectorized version of the function by
>scalar name and function signature with the following method:
>
>    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName,
>FuncionType *FTy);
>
>This is done in a way similar to what my current llvm patch does: the
>loop vectorizer makes up the function signature it needs and look for it
>in the TLI. If a match is found, vectorization is possible. Right now the
>compiler is not aware of uniform/linear function attributes, but it still
>can refer to them in a target agnostic way, by using scalar signatures
>for the uniform ones and using llvm::Attributes for the linear ones.
>
>Notice that the vector name here is not used at all, which is good as any
>architecture can come up with it's own name mangling for vector
>functions, without breaking the ability of the vectorizer to vectorize
>the same code with the new name mangling.
>
>## External libraries vs user provided code
>
>The example with "pow" and "foo" I have provided before
shows a function
>declaration and a function definition. Although the TLII mechanism I have
>described seems to be valid only for the former case, I think that it is
>valid also for the latter.  In fact, in case of a function definition,
>the compiler would have to generate also the body of the vector function,
>but that external global variable could still be used to inform the TLII
>of such function. The fact that the vector function needed by the
>vectorizer is in some module instead of in an external library doesn't
>seems to make all that difference at compile time to me.
>
># Some final notes (call for ideas!)
>
>There is one level of target dependence that I still have to sort out,
>and for this I need input from the community and in particular from the
>Intel folks.
>
>I will start with this example:
>
>    #pragma omp declare simd
>    float foo(float x);
>
>In case of NEON, this would generate 2 globals, one for vectors holding 2
>floats, and one for vector holding 4 floats, corresponding to NEON 64-bit
>and 128-bit respectively. This means that the vectorizer have a unique
>function it could choose from the list the TLI provides.
>
>This is not the same on Intel, for example when this code generates
>vector names for AVX and AVX2. The register width for these architecture
>extensions are the same, so all the TLI has is a mapping between scalar
>name and (vectro_name, function_type) who's two elements differ only in
>the vector_name string.
>
>This breaks the target independence of the vectorizer, as it would
>require it to parse the vector_name to be able to choose between the AVX
>or the AVX2 implementation.
>
>Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
>information in the VectorFnInfo structure. Does anybody have an idea on
>how best to do it? For the sake of keeping the vectorizer target
>independent, I would like to avoid encoding this piece of information in
>the VectorFnInfo struct. I have seen that in your code you are generating
>SSE/AVX/AVX2/AVX512 vector functions, how do you plan to choose between
>them in the vectorizer? I could not find how you planned to solve this
>problem in your proposal, or have I just missed it?
>
>Is there a way to do this in the TLII? The function type of the vector
>function could use the "target-feature" attribute of function
>definitions, but how coudl the vectorizer decide which one to use?
>
>Anyway, that's it. Your feedback will be much appreciated.
>
>Cheers,
>Francesco
>
>________________________________________
>From: Tian, Xinmin <xinmin.tian at intel.com>
>Sent: 30 November 2016 17:16:12
>To: Francesco Petrogalli; llvm-dev at lists.llvm.org
>Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Francesco,
>
>Good to know, you are working on the support for this feature. I assume
>you knew the RFC below.  The VectorABI mangling we proposed were approved
>by C++ Clang FE name mangling owner David M from Google,  the ClangFE
>support was committed in its main trunk by Alexey.
>
>"Proposal for function vectorization and loop vectorization with
function
>calls", March 2, 2016. Intel Corp.
>http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
>
>Matt submitted patch to generate vector variants for function
>definitions, not just function declarations. You may want to take a look.
> Ayal's RFC will be also needed to support vectorization of function
body
>in general.
>
>I agreed, we should have an option -fopenmp-simd to enable SIMD only,
>both GCC and ICC have similar options.
>
>I would suggest we shall sync-up on these work, so we don't duplicate
the
>effort.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>Francesco Petrogalli via llvm-dev
>Sent: Wednesday, November 30, 2016 7:11 AM
>To: llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>
>Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Dear all,
>
>I have just created a couple of differential reviews to enable the
>vectorisation of loops that have function calls to routines marked with
>"#pragma omp declare simd".
>
>They can be (re)viewed here:
>
>* https://reviews.llvm.org/D27249
>
>* https://reviews.llvm.org/D27250
>
>The current implementation allows the loop vectorizer to generate vector
>code for source file as:
>
>  #pragma omp declare simd
>  double f(double x);
>
>  void aaa(double *x, double *y, int N) {
>    for (int i = 0; i < N; ++i) {
>      x[i] = f(y[i]);
>    }
>  }
>
>
>by invoking clang with arguments:
>
>  $> clang -fopenmp -c -O3 file.c [...]
>
>
>Such functionality should provide a nice interface for vector libraries
>developers that can be used to inform the loop vectorizer of the
>availability of an external library with the vector implementation of the
>scalar functions in the loops. For this, all is needed to do is to mark
>with "#pragma omp declare simd" the function declaration in the
header
>file of the library and generate the associated symbols in the object
>file of the library according to the name scheme of the vector ABI (see
>notes below).
>
>I am interested in any feedback/suggestion/review the community might
>have regarding this behaviour.
>
>Below you find a description of the implementation and some notes.
>
>Thanks,
>
>Francesco
>
>-----------
>
>The functionality is implemented as follow:
>
>1. Clang CodeGen generates a set of global external variables for each of
>the function declarations marked with the OpenMP pragma. Each of such
>globals are named according a mangling that is generated by
>llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of the
>associated vector function. (See examples in the tests of the clang patch.
>Each scalar function can generate multiple vector functions depending on
>the clauses of the declare simd directives) 2. When clang created the
>TLII, it processes the llvm::Module and finds out which of the globals of
>the module have the correct mangling and type so that they be added to
>the TLII as a list of vector function that can be associated to the
>original scalar one.
>3. The LoopVectorizer looks for the available vector functions through
>the TLII not by scalar name and vectorisation factor but by scalar name
>and vector function signature, thus enabling the vectorizer to be able to
>distinguish a "vector vpow1(vector x, vector y)" from a
"vector
>vpow2(vector x, scalar y)". (The second one corresponds to a
"declare
>simd uniform(y)" for a "scalar pow(scalar x, scalar y)"
declaration).
>(Notice that the changes in the loop vectorizer are minimal.)
>
>
>Notes:
>
>1. To enable SIMD only for OpenMP, leaving all the multithread/target
>behaviour behind, we should enable this also with a new option:
>-fopenmp-simd
>2. The AArch64 vector ABI in the code is essentially the same as for the
>Intel one (apart from the prefix and the masking argument), and it is
>based on the clauses associated to "declare simd" in OpenMP 4.0.
For
>OpenMP4.5, the parameters section of the mangled name should be updated.
>This update will not change the vectorizer behaviour as all the
>vectorizer needs to detect a vectorizable function is the original scalar
>name and a compatible vector function signature. Of course, any
>changes/updates in the ABI will have to be reflected in the symbols of
>the binary file of the library.
>3. Whistle this is working only for function declaration, the same
>functionality can be used when (if) clang will implement the declare simd
>OpenMP pragma for function definitions.
>4. I have enabled this for any loop that invokes the scalar function
>call, not just for those annotated with "#pragma omp for simd". I
don't
>have any preference here, but at the same time I don't see any reason
why
>this shouldn't be enabled by default for non annotated loops. Let me
know
>if you disagree, I'd happily change the functionality if there are sound
>reasons behind that.
>
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Renato Golin via llvm-dev

2016-Dec-12 14:32 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

)On 12 December 2016 at 13:44, Francesco Petrogalli
<Francesco.Petrogalli at arm.com> wrote:> I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON
> vector symbols look as follows:
>
> _ZVGQN2v__Z1fd
> _ZVGDN2v__Z1ff
> _ZVGQN4v__Z1ff
Hi Francesco,

The ARM AAPCS (A.2.1) says:

"For C++ the mangled name for parameters is as though the equivalent
type name was used."

Clang is already able to mangle NEON vectors of any length
(CXXNameMangler::mangleNeonVectorType), you should use that, as this
is very likely to be compatible with other compilers as well.

cheers,
--renato

Francesco Petrogalli via llvm-dev

2016-Dec-12 16:49 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Xinmin,

Allow me to share a couple of comments about what Renato is saying.

On 08/12/2016 22:08, "Renato Golin" <renato.golin at linaro.org>
wrote:
>I'm still unsure how the simplistic mangling we have today will work
>around the multiple versions we could have with NEON (and in the
>future, SVE) without polluting the mangling quite a lot (have you seen
>arm_neon.h?).
Reconstructing the vector parameter types from the name mangling works for
fixed-width vector architectures, including NEON.
For SVE, the alternative method I am proposing of using IR types will make
easier the handling of width agnostic
vector function types.

With SVE in we could have multiple version of the same function for
different vector lengths, plus a totally width agnostic version that would
work on any SVE implementation. All these information could be potentially
used 
by the compiler, I see an advantage in having them encoded in IR
structures (FunctionType and VectorType) instead of strings, as is would
make the information directly accessible by other parts of the compiler.
There is a proposal in the ML for extending the IR vector type to support
width agnostic 
vectors. Whatever will be the final shape of such vectors, I suspect it
would be easier to handle multiple width agnostic version of functions by
classifying them with IR types.
>
>So, we may get away with it for now with some basic support and the
>current style, but this should grow into a more flexible scheme.
>
>About the current IR form, I don't particularly like how they're
tied
>up together, but other than having multiple global functions defined
>(something like weak linkage?), I don't have a better idea right now.
I am not sure I understand here. In my patch, all I am doing is “vector
symbol awareness generation”. There are no globals that are generated in
the final object file, it is just the TargetLibraryInfoImpl that is being
populated with the info needed by the vectorizer.

Tian, Xinmin via llvm-dev

2016-Dec-12 17:32 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Francesco, thanks for updating the patch.

GCC used b, c, d,   you used  Q for ARM 128-bit which seems fine. For D
(64-bit), do you have to use it, or you can find another letter to avoid the
future conflict / confusion if they need D vs. d?  Is GCC community ok with them
for compatibility for ARM?

Thanks,
Xinmin

-----Original Message-----
From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com] 
Sent: Monday, December 12, 2016 5:45 AM
To: Tian, Xinmin <xinmin.tian at intel.com>; Odeh, Saher <saher.odeh at
intel.com>; llvm-dev at lists.llvm.org
Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>; Hal
Finkel <hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>;
a.bataev at hotmail.com; David Majnemer <david.majnemer at gmail.com>;
Renato Golin <renato.golin at linaro.org>
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

Hi Xinmin,

I have updated the clang patch using the standard name mangling you suggested -
I was not fully aware of the C++ mangling convention “_ZVG”.

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON vector
symbols look as follows:

_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff


Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit

Please notice that although I have changed the name mangling in clang [1], there
have been no need to update the relative llvm patch [2], as the vectorisation
process is _independent_ of the name mangling.

Regards,

Francesco

[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the copy
constructor of the TLII and in the return value of the TLII::mangle() method.
None of the underlying scalar/vector function matching algorithms have been
touched.


On 08/12/2016 18:11, "Tian, Xinmin" <xinmin.tian at intel.com>
wrote:
>Hi Francesco, a bit more information.  GCC veclib is implemented based 
>on GCC VectorABI for declare simd as well.
>
>For name mangling, we have to follow certain rules of C/C++ (e.g. 
>prefix needs to _ZVG ....).  David Majnemer who is the owner and 
>stakeholder for approval for Clang and LLVM.  Also,  we need to pay 
>attention to GCC compatibility.  I would suggest you look into how GCC 
>VectorABI can be extended support your Arch.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: Odeh, Saher
>Sent: Thursday, December 8, 2016 3:49 AM
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at
lists.llvm.org;
>Francesco.Petrogalli at arm.com
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>;
Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; 
>a.bataev at hotmail.com
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Francesco,
>
>As you stated in the RFC, when vectorizing a scalar function (e.g. when 
>using omp declare simd), one needs to incorporate attributes to the 
>resulting vectorized-function.
>These attributes describe a) the behavior of the function, e.g. 
>mask-able or not, and b) the type of the parameters, e.g. scalar or 
>linear or any other option.
>
>As this list is extensive, it is only logical to use an existing 
>infrastructure of ICC and GCC vectorABI which already covers all of 
>these options as stated in Xinmin's RFC 
>[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
>Moreover, when considering other compilers such as GCC, I do see that 
>the resulting assembly actually does incorporate this exact infrastructure.
>So if we wish to link different parts of the program using clang and 
>GCC we'll need to adhere to the same name mangling/ABI. Please see the 
>below result after compiling an omp declare simd function using GCC.
>Lastly, please note the two out of the three components of the 
>implementation have already been committed or submitted, and both are 
>adhering the name mangling proposed by Xinmin's RFC. A) committed - the 
>FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates 
>mangled names in the manner described by Xinmin's RFC, See below B) 
>Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], 
>it uses these mangled names. and C) caller which is covered by this patch.
>
>In order to mitigate the needed effort and possible issues when 
>implementing, I believe it is best to follow the name mangling proposed 
>in Xinmin's RFC. What do you think?
>
>GCC Example
>----------------
>Compiler version: GCC 6.1.0
>Compile line: gcc -c omp.c -fopenmp -Wall -S -o - -O3 > omp.s
>
>omp.c
>#include <omp.h>
>
>#pragma omp declare simd
>int dowork(int* a, int idx)
>{
> return a[idx] * a[idx]*7;
>}
>
>less omp.s | grep @function
>        .type   dowork, @function
>        .type   _ZGVbN4vv_dowork, @function
>        .type   _ZGVbM4vv_dowork, @function
>        .type   _ZGVcN4vv_dowork, @function
>        .type   _ZGVcM4vv_dowork, @function
>        .type   _ZGVdN8vv_dowork, @function
>        .type   _ZGVdM8vv_dowork, @function
>        .type   _ZGVeN16vv_dowork, @function
>        .type   _ZGVeM16vv_dowork, @function
>
>Clang on FE using Alexey's patch
>---------------------------------------
>Compile line: clang -c tst/omp_fun.c -fopenmp -mllvm -print-after-all 
>>& out
>
>#pragma omp declare simd
>extern int dowork(int* a, int idx)
>{
>  return a[idx]*7;
>}
>
>
>int main() {
>  dowork(0,1);
>}
>
>attributes #0 = { nounwind uwtable "_ZGVbM4vv_dowork"
"_ZGVbN4vv_dowork"
>"_ZGVcM8vv_dowork" "_ZGVcN8vv_dowork"
"_ZGVdM8vv_dowork"
>"_ZGVdN8vv_dowork" "_ZGVeM16vv_dowork"
"_ZGVeN16vv_dowork"
>"correctly-rounded-divide-sqrt-fp-math"="false"
>"disable-tail-calls"="false"
"less-precise-fpmad"="false"
>"no-frame-pointer-elim"="true"
"no-frame-pointer-elim-non-leaf"
>"no-infs-fp-math"="false"
"no-jump-tables"="false"
>"no-nans-fp-math"="false"
"no-signed-zeros-fp-math"="false"
>"no-trapping-math"="false"
"stack-protector-buffer-size"="8"
>"target-cpu"="x86-64"
"target-features"="+fxsr,+mmx,+sse,+sse2,+x87"
>"unsafe-fp-math"="false"
"use-soft-float"="false" }
>
>
>Thanks Saher
>
>-----Original Message-----
>From: Francesco Petrogalli [mailto:Francesco.Petrogalli at arm.com]
>Sent: Tuesday, December 06, 2016 17:22
>To: Tian, Xinmin <xinmin.tian at intel.com>; llvm-dev at
lists.llvm.org
>Cc: nd <nd at arm.com>; Masten, Matt <matt.masten at intel.com>;
Hal Finkel
><hfinkel at anl.gov>; Zaks, Ayal <ayal.zaks at intel.com>; 
>a.bataev at hotmail.com
>Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Xinmin,
>
>Thank you for your email.
>
>I have been catching up with the content of your proposal, and I have 
>some questions/remarks below that I'd like to discuss with you - see 
>the final section in the proposal.
>
>I have specifically added Alexey B. to the mail so we can move our 
>conversation from phabricator to the mailing list.
>
>Before we start, I just want to mention that the initial idea of using 
>llvm::FunctionType for vector function generation and matching has been 
>proposed by a colleague, Paul Walker, when we first tried out 
>supporting this on AArch64 on an internal version of llvm. I received 
>some input also from Amara Emerson.
>
>In our case we had a slightly different problem to solve: we wanted to 
>support in the vectorizer a rich set of vector math routines provided 
>with an external library. We managed to do this by adding the pragma to 
>the (scalar) function declaration of the header file provided with the 
>library, and as shown by the patches I have submitted, by generating 
>vector function signatures that the vectorizer can search in the 
>TargetLibraryInfo.
>
>Here is an updated version of the proposal. Please let me know what you 
>think, and if you have any solution we could use for the final section.
>
># RFC for "pragma omp declare simd"
>
>Hight level components:
>
>A) Global variable generator (clang FE)
>B) Parameter descriptors (as new enumerations in llvm::Attribute)
>C) TLII methods and fields for the multimap (llvm middle-end)
>
>## Workflow
>
>Example user input, with a declaration and definition:
>
>    #pragma omp declare simd
>    #pragma omp declare simd uniform(y)
>    extern double pow(double x, double y);
>
>    #pragma omp declare simd
>    #pragma omp declare simd linear(x:2)
>    float foo(float x) {....}
>
>    /// code using both functions
>
>### Step 1
>
>
>The compiler FE process these definition and declaration and generates 
>a list of globals as follows:
>
>    @prefix_vector_pow1_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x
double>,
>                                                          <4 x
double>)
>    @prefix_vector_pow2_midfix_pow_postfix = external global
>                                             <4 x double>(<4 x
double>,
>                                                          double)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x
float>,
>                                                         <8 x float>)
>    @prefix_vector_foo1_midfix_foo_postfix = external global
>                                             <8 x float>(<8 x
float>,
>                                                         <8 x float>
#0)
>    ...
>    attribute #0  = {linear = 2}
>
>
>Notes about step 1:
>
>1. The mapping scalar name <-> vector name is in the
>   prefix/midfix/postfix mangled name of the global variable.
>2. The examples shows only a set of possible vector function for a
>   sizeof(<4 x double>) vector extension. If multiple vector extension
>   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
>   and AVX512) the front end takes care to generate each of the
>   associated functions (like it is done now).
>3. Vector function parameters are rendered using the same
>   Characteristic Data Type (CDT) rule already in the compiler FE.
>4. Uniform parameters are rendered with the original scalar type.
>5. Linear parameters are rendered with vectors using the same
>   CDT-generated vector length, and decorated with proper
>   attributes. I think we could extent the llvm::Attribute enumeration 
>adding the following:
>   - linear : numeric, specify_the step
>   - linear_var : numeric, specify the position of the uniform variable 
>holding the step
>   - linear_uval[_var]: numeric as before, but for the "uval"
modifier
>(both constant step or variable step)
>   - linear_val[_var]: numeric, as before, but for "val" modifier
>   - linear_ref[_var] numeric, for "ref" modifier.
>
>   For example, "attribute #0 = {linear = 2}" says that the vector
of
>   the associated parameter in the function signature has a linear
>   step of 2.
>
>### Step 2
>
>The compiler FE invokes a TLII method in BackendUtils.cpp that populate 
>a multimap in the TLII by checking the globals created in the previous step.
>
>Each global is processed, demangling the [pre/mid/post]fix name and 
>generate a mapping in the TLII as follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>    };
>    std::multimap<std:string, VectorFnInfo> VFInfo;
>
>
>For the initial example, the multimap in the TLI is populated as follows:
>
>    "pow" -> [(vector_pow1, <4 x double>(<4 x
double>, <4 x double>)),
>              (vector_pow2, <4 x double>(<4 x double>, double))]
>
>    "foo" -> [(vector_foo1, <8 x float>(<8 x
float>, <8 x float>)),
>              (vector_foo2, <8 x float>(<8 x float>, <8 x
float> #0))]
>
>Notes about step 2:
>
>Given the fact that the external globals that the FE have generated are 
>removed _before_ the vectorizer kicks in, I am not sure if the 
>"attribute #0" needed for one of the parameter is still present at
this
>point. IF NOT, I think that in this case we could enrich the 
>"VectorFnInfo" as
>follows:
>
>    struct VectorFnInfo {
>       std::string Name;
>       FunctionType *Signature;
>       std::set<unsigned, llvm:Attribute> Attrs;
>    };
>
>The field "Attrs" maps the position of the parameter with the 
>correspondent llvm::Attribute present in the global variable.
>
>I have added this note for the sake of completeness. I *think* that we 
>won't be needing this additional Attrs field: I have already shown in 
>the llvm patch I submitted that the function type "survives" after
the
>global gets removed, I don't see why the parameter attribute
shouldn't
>survive too (last famous words?).
>
>### Step 3
>
>This step happens in the LoopVectorizer. The InnerLoopVectorizer 
>queries the TargetLibraryInfo looking for a vectorized version of the 
>function by scalar name and function signature with the following method:
>
>    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName, 
>FuncionType *FTy);
>
>This is done in a way similar to what my current llvm patch does: the 
>loop vectorizer makes up the function signature it needs and look for 
>it in the TLI. If a match is found, vectorization is possible. Right 
>now the compiler is not aware of uniform/linear function attributes, 
>but it still can refer to them in a target agnostic way, by using 
>scalar signatures for the uniform ones and using llvm::Attributes for the
linear ones.
>
>Notice that the vector name here is not used at all, which is good as 
>any architecture can come up with it's own name mangling for vector 
>functions, without breaking the ability of the vectorizer to vectorize 
>the same code with the new name mangling.
>
>## External libraries vs user provided code
>
>The example with "pow" and "foo" I have provided before
shows a
>function declaration and a function definition. Although the TLII 
>mechanism I have described seems to be valid only for the former case, 
>I think that it is valid also for the latter.  In fact, in case of a 
>function definition, the compiler would have to generate also the body 
>of the vector function, but that external global variable could still 
>be used to inform the TLII of such function. The fact that the vector 
>function needed by the vectorizer is in some module instead of in an 
>external library doesn't seems to make all that difference at compile
time to me.
>
># Some final notes (call for ideas!)
>
>There is one level of target dependence that I still have to sort out, 
>and for this I need input from the community and in particular from the 
>Intel folks.
>
>I will start with this example:
>
>    #pragma omp declare simd
>    float foo(float x);
>
>In case of NEON, this would generate 2 globals, one for vectors holding 
>2 floats, and one for vector holding 4 floats, corresponding to NEON 
>64-bit and 128-bit respectively. This means that the vectorizer have a 
>unique function it could choose from the list the TLI provides.
>
>This is not the same on Intel, for example when this code generates 
>vector names for AVX and AVX2. The register width for these 
>architecture extensions are the same, so all the TLI has is a mapping 
>between scalar name and (vectro_name, function_type) who's two elements 
>differ only in the vector_name string.
>
>This breaks the target independence of the vectorizer, as it would 
>require it to parse the vector_name to be able to choose between the 
>AVX or the AVX2 implementation.
>
>Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2 
>information in the VectorFnInfo structure. Does anybody have an idea on 
>how best to do it? For the sake of keeping the vectorizer target 
>independent, I would like to avoid encoding this piece of information 
>in the VectorFnInfo struct. I have seen that in your code you are 
>generating
>SSE/AVX/AVX2/AVX512 vector functions, how do you plan to choose between 
>them in the vectorizer? I could not find how you planned to solve this 
>problem in your proposal, or have I just missed it?
>
>Is there a way to do this in the TLII? The function type of the vector 
>function could use the "target-feature" attribute of function 
>definitions, but how coudl the vectorizer decide which one to use?
>
>Anyway, that's it. Your feedback will be much appreciated.
>
>Cheers,
>Francesco
>
>________________________________________
>From: Tian, Xinmin <xinmin.tian at intel.com>
>Sent: 30 November 2016 17:16:12
>To: Francesco Petrogalli; llvm-dev at lists.llvm.org
>Cc: nd; Masten, Matt; Hal Finkel; Zaks, Ayal
>Subject: RE: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in
the
>LoopVectorizer
>
>Hi Francesco,
>
>Good to know, you are working on the support for this feature. I assume 
>you knew the RFC below.  The VectorABI mangling we proposed were 
>approved by C++ Clang FE name mangling owner David M from Google,  the 
>ClangFE support was committed in its main trunk by Alexey.
>
>"Proposal for function vectorization and loop vectorization with 
>function calls", March 2, 2016. Intel Corp.
>http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.
>
>Matt submitted patch to generate vector variants for function 
>definitions, not just function declarations. You may want to take a look.
> Ayal's RFC will be also needed to support vectorization of function 
>body in general.
>
>I agreed, we should have an option -fopenmp-simd to enable SIMD only, 
>both GCC and ICC have similar options.
>
>I would suggest we shall sync-up on these work, so we don't duplicate 
>the effort.
>
>Thanks,
>Xinmin
>
>-----Original Message-----
>From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of 
>Francesco Petrogalli via llvm-dev
>Sent: Wednesday, November 30, 2016 7:11 AM
>To: llvm-dev at lists.llvm.org
>Cc: nd <nd at arm.com>
>Subject: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
>LoopVectorizer
>
>Dear all,
>
>I have just created a couple of differential reviews to enable the 
>vectorisation of loops that have function calls to routines marked with 
>"#pragma omp declare simd".
>
>They can be (re)viewed here:
>
>* https://reviews.llvm.org/D27249
>
>* https://reviews.llvm.org/D27250
>
>The current implementation allows the loop vectorizer to generate 
>vector code for source file as:
>
>  #pragma omp declare simd
>  double f(double x);
>
>  void aaa(double *x, double *y, int N) {
>    for (int i = 0; i < N; ++i) {
>      x[i] = f(y[i]);
>    }
>  }
>
>
>by invoking clang with arguments:
>
>  $> clang -fopenmp -c -O3 file.c [...]
>
>
>Such functionality should provide a nice interface for vector libraries 
>developers that can be used to inform the loop vectorizer of the 
>availability of an external library with the vector implementation of 
>the scalar functions in the loops. For this, all is needed to do is to 
>mark with "#pragma omp declare simd" the function declaration in
the
>header file of the library and generate the associated symbols in the 
>object file of the library according to the name scheme of the vector 
>ABI (see notes below).
>
>I am interested in any feedback/suggestion/review the community might 
>have regarding this behaviour.
>
>Below you find a description of the implementation and some notes.
>
>Thanks,
>
>Francesco
>
>-----------
>
>The functionality is implemented as follow:
>
>1. Clang CodeGen generates a set of global external variables for each 
>of the function declarations marked with the OpenMP pragma. Each of 
>such globals are named according a mangling that is generated by 
>llvm::TargetLibraryInfoImpl (TLII), and holds the vector signature of 
>the associated vector function. (See examples in the tests of the clang
patch.
>Each scalar function can generate multiple vector functions depending 
>on the clauses of the declare simd directives) 2. When clang created 
>the TLII, it processes the llvm::Module and finds out which of the 
>globals of the module have the correct mangling and type so that they 
>be added to the TLII as a list of vector function that can be 
>associated to the original scalar one.
>3. The LoopVectorizer looks for the available vector functions through 
>the TLII not by scalar name and vectorisation factor but by scalar name 
>and vector function signature, thus enabling the vectorizer to be able 
>to distinguish a "vector vpow1(vector x, vector y)" from a
"vector
>vpow2(vector x, scalar y)". (The second one corresponds to a
"declare
>simd uniform(y)" for a "scalar pow(scalar x, scalar y)"
declaration).
>(Notice that the changes in the loop vectorizer are minimal.)
>
>
>Notes:
>
>1. To enable SIMD only for OpenMP, leaving all the multithread/target 
>behaviour behind, we should enable this also with a new option:
>-fopenmp-simd
>2. The AArch64 vector ABI in the code is essentially the same as for 
>the Intel one (apart from the prefix and the masking argument), and it 
>is based on the clauses associated to "declare simd" in OpenMP
4.0. For
>OpenMP4.5, the parameters section of the mangled name should be updated.
>This update will not change the vectorizer behaviour as all the 
>vectorizer needs to detect a vectorizable function is the original 
>scalar name and a compatible vector function signature. Of course, any 
>changes/updates in the ABI will have to be reflected in the symbols of 
>the binary file of the library.
>3. Whistle this is working only for function declaration, the same 
>functionality can be used when (if) clang will implement the declare 
>simd OpenMP pragma for function definitions.
>4. I have enabled this for any loop that invokes the scalar function 
>call, not just for those annotated with "#pragma omp for simd". I
don't
>have any preference here, but at the same time I don't see any reason 
>why this shouldn't be enabled by default for non annotated loops. Let 
>me know if you disagree, I'd happily change the functionality if there 
>are sound reasons behind that.
>
>_______________________________________________
>LLVM Developers mailing list
>llvm-dev at lists.llvm.org
>http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

Tian, Xinmin via llvm-dev

2016-Dec-12 18:45 UTC

head link

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Thanks Renato.  Per the latest email from Francesco, it seems the current
mangling mechanism works for ARM as well, except we need to use different arch
"letter" for Neon-64-bit and Neon-128-bit.

Cheers
Xinmin



-----Original Message-----
From: Renato Golin [mailto:renato.golin at linaro.org] 
Sent: Thursday, December 8, 2016 2:09 PM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Odeh, Saher <saher.odeh at intel.com>; llvm-dev at lists.llvm.org;
Francesco.Petrogalli at arm.com; a.bataev at hotmail.com; Masten, Matt
<matt.masten at intel.com>; nd <nd at arm.com>
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the
LoopVectorizer

On 8 December 2016 at 18:11, Tian, Xinmin via llvm-dev <llvm-dev at
lists.llvm.org> wrote:> For name mangling, we have to follow certain rules of C/C++ (e.g. prefix
needs to _ZVG ....).  David Majnemer who is the owner and stakeholder for
approval for Clang and LLVM.  Also,  we need to pay attention to GCC
compatibility.  I would suggest you look into how GCC VectorABI can be extended
support your Arch.
Hi Xinmin,

I only began to review this proposal, and like yours, I think this is a really
important feature to get in.

I agree with you on the name mangling need for C++, as well as compatibility
with GCC, but according to Francesco, there are some problems that those two
alone don't solve.

I'm still unsure how the simplistic mangling we have today will work around
the multiple versions we could have with NEON (and in the future, SVE) without
polluting the mangling quite a lot (have you seen arm_neon.h?).

So, we may get away with it for now with some basic support and the current
style, but this should grow into a more flexible scheme.

About the current IR form, I don't particularly like how they're tied up
together, but other than having multiple global functions defined (something
like weak linkage?), I don't have a better idea right now.

Francesco,

Maybe the best thing to do right now would be to try and fit NEON alternatives
in this mangling scheme and see how it goes. If anything, it'll give us an
idea on what's broken, and hopefully, how to fix it.

cheers,
--renato

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Dec 2016 - [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

[llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Apparently Analagous Threads