Saito, Hideki via llvm-dev
2018-Jun-29 02:11 UTC
[llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example:
clang -fveclib=SVML -O3 svml.c -mavx
#include <math.h>
void foo(double *a, int N){
int i;
#pragma clang loop vectorize_width(8)
for (i=0;i<N;i++){
a[i] = sin(i);
}
}
Currently, this results in a call to <8 x double> __svml_sin8(<8 x
double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface,
this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and
return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes
zmm0 and returns zmm0.
i.e., not legal to use for AVX.
What we need to see instead is two calls to __svml_sin4(), like below.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin4
vmovups %ymm0, 32(%r15,%r12,8)
vmovups %ymm1, ymm0
callq __svml_sin4
vmovups %ymm0, (%r15,%r12,8)
What would be the most acceptable way to make this happen? Anybody having had a
similar need previously?
Easiest workaround is to serialize the call above "type legal"
vectorization factor. This can be done with a few lines of code,
plus the code to recognize that the call is "SVML" (which is currently
string match against "__svml" prefix in my local workspace).
If higher VF is not forced, cost model will likely favor lower VF. Functionally
correct, but obviously not an ideal solution.
Here are a few ideas I thought about:
1) Standard LegalizeVectorType() in CodeGen/SelectionDAG doesn't seem
to work. We could define a generic ISD::VECLIB
and try to split into two or more VECLIB nodes, but at that moment we lost the
information about which function to call.
We can't define ISD opcode per function. There will be too many libm entries
to deal with. We need a scalable solution.
2) We could write an IR to IR pass to perform IR level legalization. This
is essentially duplicating the functionality of LegalizeVectorType()
but we can make this available for other similar things that can't use ISD
level vector type legalization. This looks to be attractive enough
from that perspective.
3) We have implemented something similar to 2), but legalization code is
specialized for SVML legalization. This was much quicker than
trying to generalize the legalization scheme, but I'd imagine community
won't like it.
4) Vectorizer emit legalized VECLIB calls. Since it can emit instructions
in scalarized form, adding legalized call functionality is in some sense
similar to that. Vectorizer can't simply choose type legal function name
with illegal vector ---- since LegalizeVectorType() will still
end up using one call instead of two.
Anything else?
Also, doing any of this requires reverse mapping from VECLIB name to scalar
function name. What's the most recommended way to do so?
Can we use TableGen to create a reverse map?
Your input is greatly appreciated. Is there a real need/desire for 2) outside of
VECLIB (or outside of SVML)?
Thanks,
Hideki Saito
Intel Corporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180629/298d24ac/attachment.html>
Nema, Ashutosh via llvm-dev
2018-Jun-29 06:36 UTC
[llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls?
Hi Saito,
At AMD we have our own version of vector library and faced similar problems, we
followed the SVML path and from vectorizer generated the respective vector
calls. When vectorizer generates the respective calls i.e __svml_sin_4 or
__amdlibm_sin_4, later one can perform only string matching to identify the
vector lib call. I'm not sure it's the proper way, may be instead of
generating respective calls it's better to generate some standard call (may
be intrinsics) and lower it later. A late IR pass can be introduced to perform
lowering, this will lower the intrinsic calls to specific lib calls(__svml_sin_4
or __amdlibm_sin_4 or ... ). This can be table driven to decide the action based
on the vector library, function name, VF and target information, the action can
be full-serialize, partial-serialize(VF8 to 2 VF4) or generate the lib call with
same VF.
Thanks,
Ashutosh
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Saito,
Hideki via llvm-dev
Sent: Friday, June 29, 2018 7:41 AM
To: 'Saito, Hideki via llvm-dev' <llvm-dev at lists.llvm.org>
Subject: [llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example:
clang -fveclib=SVML -O3 svml.c -mavx
#include <math.h>
void foo(double *a, int N){
int i;
#pragma clang loop vectorize_width(8)
for (i=0;i<N;i++){
a[i] = sin(i);
}
}
Currently, this results in a call to <8 x double> __svml_sin8(<8 x
double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface,
this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and
return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes
zmm0 and returns zmm0.
i.e., not legal to use for AVX.
What we need to see instead is two calls to __svml_sin4(), like below.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin4
vmovups %ymm0, 32(%r15,%r12,8)
vmovups %ymm1, ymm0
callq __svml_sin4
vmovups %ymm0, (%r15,%r12,8)
What would be the most acceptable way to make this happen? Anybody having had a
similar need previously?
Easiest workaround is to serialize the call above "type legal"
vectorization factor. This can be done with a few lines of code,
plus the code to recognize that the call is "SVML" (which is currently
string match against "__svml" prefix in my local workspace).
If higher VF is not forced, cost model will likely favor lower VF. Functionally
correct, but obviously not an ideal solution.
Here are a few ideas I thought about:
1) Standard LegalizeVectorType() in CodeGen/SelectionDAG doesn't seem
to work. We could define a generic ISD::VECLIB
and try to split into two or more VECLIB nodes, but at that moment we lost the
information about which function to call.
We can't define ISD opcode per function. There will be too many libm entries
to deal with. We need a scalable solution.
2) We could write an IR to IR pass to perform IR level legalization. This
is essentially duplicating the functionality of LegalizeVectorType()
but we can make this available for other similar things that can't use ISD
level vector type legalization. This looks to be attractive enough
from that perspective.
3) We have implemented something similar to 2), but legalization code is
specialized for SVML legalization. This was much quicker than
trying to generalize the legalization scheme, but I'd imagine community
won't like it.
4) Vectorizer emit legalized VECLIB calls. Since it can emit instructions
in scalarized form, adding legalized call functionality is in some sense
similar to that. Vectorizer can't simply choose type legal function name
with illegal vector ---- since LegalizeVectorType() will still
end up using one call instead of two.
Anything else?
Also, doing any of this requires reverse mapping from VECLIB name to scalar
function name. What's the most recommended way to do so?
Can we use TableGen to create a reverse map?
Your input is greatly appreciated. Is there a real need/desire for 2) outside of
VECLIB (or outside of SVML)?
Thanks,
Hideki Saito
Intel Corporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180629/a4145e0a/attachment.html>
Saito, Hideki via llvm-dev
2018-Jun-29 20:15 UTC
[llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls?
Ashutosh,
Thanks for the repy.
Related earlier topic on this appears in the review of the SVML patch
(@mmasten). Adding few names from there.
https://reviews.llvm.org/D19544
There, I see Hal's review comment "let's start only with the
directly-legal calls". Apparently, what we have right now
in the trunk is "not legal enough". I'll work on the patch to stop
bleeding while we continue to discuss legalization topic.
I suppose
1) LV only solution (let LV emit already legalized VECLIB calls) is
certainly not scalable. It won't help if VECLIB calls
are generated elsewhere. Also, keeping VF low enough to prevent the legalization
problem is only a workaround,
not a solution.
2) Assuming that we have to go to IR to IR pass route, there are 3 ways to
think:
a. Go with very generic IR to IR legalization pass comparable to ISD level
legalization. This is most general
but I'd think this is the highest cost for development.
b. Go with Intrinsic-only legalization and then apply VECLIB afterwards.
This requires all scalar functions
with VECLIB mapping to be added to intrinsic.
c. Go with generic enough function call legalization, with the ability to
add custom legalization for each VECLIB
(and if needed each VECLIB or non-VECLIB entry).
I think the cost of 2.b) and 2.c) are similar and 2.c) seems to be more
flexible. So, I guess we don't really have to tie this
discussion with "letting LV emit widened math call instead of VECLIB",
even though I strongly favor that than LV emitting
VECLIB calls.
@Davide, in D19544, @spatel thought LibCallSimplifier has relevance to this
legalization topic. Do you know enough about
LibCallSimiplifer to tell whether it can be extended to deal with 2.b) or 2.c)?
If we think 2.b)/2.c) are right enough directions, I can clean up what we have
and upload it to Phabricator as a starting point
to get to 2.b)/2.c).
Continue waiting for more feedback. I guess I shouldn't expect a lot this
week and next due to the big holiday in the U.S.
Thanks,
Hideki
From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
Sent: Thursday, June 28, 2018 11:37 PM
To: Saito, Hideki <hideki.saito at intel.com>
Cc: llvm-dev at lists.llvm.org
Subject: RE: [RFC][VECLIB] how should we legalize VECLIB calls?
Hi Saito,
At AMD we have our own version of vector library and faced similar problems, we
followed the SVML path and from vectorizer generated the respective vector
calls. When vectorizer generates the respective calls i.e __svml_sin_4 or
__amdlibm_sin_4, later one can perform only string matching to identify the
vector lib call. I'm not sure it's the proper way, may be instead of
generating respective calls it's better to generate some standard call (may
be intrinsics) and lower it later. A late IR pass can be introduced to perform
lowering, this will lower the intrinsic calls to specific lib calls(__svml_sin_4
or __amdlibm_sin_4 or ... ). This can be table driven to decide the action based
on the vector library, function name, VF and target information, the action can
be full-serialize, partial-serialize(VF8 to 2 VF4) or generate the lib call with
same VF.
Thanks,
Ashutosh
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Saito,
Hideki via llvm-dev
Sent: Friday, June 29, 2018 7:41 AM
To: 'Saito, Hideki via llvm-dev' <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: [llvm-dev] [RFC][VECLIB] how should we legalize VECLIB calls?
Illustrative Example:
clang -fveclib=SVML -O3 svml.c -mavx
#include <math.h>
void foo(double *a, int N){
int i;
#pragma clang loop vectorize_width(8)
for (i=0;i<N;i++){
a[i] = sin(i);
}
}
Currently, this results in a call to <8 x double> __svml_sin8(<8 x
double>) after the vectorizer.
This is 8-element SVML sin() called with 8-element argument. On the surface,
this looks very good.
Later on, standard vector type legalization kicks-in but only the argument and
return data are legalized.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin8
vmovups %ymm1, 32(%r15,%r12,8)
vmovups %ymm0, (%r15,%r12,8)
Unfortunately, __svml_sin8() doesn't use this form of input/output. It takes
zmm0 and returns zmm0.
i.e., not legal to use for AVX.
What we need to see instead is two calls to __svml_sin4(), like below.
vmovaps %ymm0, %ymm1
vcvtdq2pd %xmm1, %ymm0
vextractf128 $1, %ymm1, %xmm1
vcvtdq2pd %xmm1, %ymm1
callq __svml_sin4
vmovups %ymm0, 32(%r15,%r12,8)
vmovups %ymm1, ymm0
callq __svml_sin4
vmovups %ymm0, (%r15,%r12,8)
What would be the most acceptable way to make this happen? Anybody having had a
similar need previously?
Easiest workaround is to serialize the call above "type legal"
vectorization factor. This can be done with a few lines of code,
plus the code to recognize that the call is "SVML" (which is currently
string match against "__svml" prefix in my local workspace).
If higher VF is not forced, cost model will likely favor lower VF. Functionally
correct, but obviously not an ideal solution.
Here are a few ideas I thought about:
1) Standard LegalizeVectorType() in CodeGen/SelectionDAG doesn't seem
to work. We could define a generic ISD::VECLIB
and try to split into two or more VECLIB nodes, but at that moment we lost the
information about which function to call.
We can't define ISD opcode per function. There will be too many libm entries
to deal with. We need a scalable solution.
2) We could write an IR to IR pass to perform IR level legalization. This
is essentially duplicating the functionality of LegalizeVectorType()
but we can make this available for other similar things that can't use ISD
level vector type legalization. This looks to be attractive enough
from that perspective.
3) We have implemented something similar to 2), but legalization code is
specialized for SVML legalization. This was much quicker than
trying to generalize the legalization scheme, but I'd imagine community
won't like it.
4) Vectorizer emit legalized VECLIB calls. Since it can emit instructions
in scalarized form, adding legalized call functionality is in some sense
similar to that. Vectorizer can't simply choose type legal function name
with illegal vector ---- since LegalizeVectorType() will still
end up using one call instead of two.
Anything else?
Also, doing any of this requires reverse mapping from VECLIB name to scalar
function name. What's the most recommended way to do so?
Can we use TableGen to create a reverse map?
Your input is greatly appreciated. Is there a real need/desire for 2) outside of
VECLIB (or outside of SVML)?
Thanks,
Hideki Saito
Intel Corporation
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180629/a6475dc6/attachment.html>