thr3ads.net - llvm dev - [llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly? [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Wang, Pengfei via llvm-dev

2021-Mar-17 10:11 UTC

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

Hi,

We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].
We met several cases when the certain type we added are optimized unexpectedly
in the middle-end. E.g. optimizing phi + biscast + load:

From
%a = load <256 x i32>, <256 x i32>* %mem, align 64
... ...
%b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
%c = bitcast <256 x i32> %b to x86_amx
To
%a = bitcast <256 x i32>* %mem to x86_amx*
%b = load x86_amx, x86_amx*, align 64
... ...
%c = phi x86_amx [ %b, %label1 ], [%someother, %label2]

To prevent such unexpected transforms, we concretely added the type check in
each point of the optimizations.
Roman pointed out the changes are not the right direction [3], and thought
it's bug for backend. While we agreed backend might be able to handle it for
the functionality, we think it is better to handle it in the midden-end since
they are negative optimizations for AMX.

First, let me put some background here:

  1.  x86_amx* is different from trivial pointers.

The AMX load instruction is much different from other load instructions. It is
not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape information
from the context. We don't want the pass to add new x86_amx related usage
because this will result in the difficulty in deduction. That said bitcasting
other pointer types to x86_amx* is not trivial as assumed here.

  1.  The physical tile registers have more limitations.
     *   No copy instruction between tile registers.
     *   Spilling / reload a tile register is expensive in light of its size is
1024 bytes.
     *   The shapes of tile registers need to be pre-configured before use and
all data in tile registers will turn into invalid once re-configured. That said
we need to dominate as more tile registers as possible to configure their shapes
with one configure instruction, otherwise we need to spill and reload the live
registers once we need to re-configure.
     *   The number of tile registers is rather small (only 8) and different
shapes cannot be reused.
Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So even we
can handle some combined operation for AMX type, we still prefer to prevent it
from the beginning. Unless we can totally roll back the optimization. Which is
also not a good solution in my opinion.

  1.  For more information, please refer to discussion in [3].
For other optimization points, please refer [4][5].

I think the main controversy from Roman is if middle-end pass should consider
some special type when doing optimization. I tend to let middle-end do the type
check on account of the peculiarity of AMX type. But I'm not sure if we have
precedent to handle the similar issue in other targets. I'm open and glad to
do it either way so long as we have an elegant solution.
Any suggestions are welcome.

[1]
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#architecture
[2] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146770.html
[3] https://reviews.llvm.org/D98247
[4] https://reviews.llvm.org/D98595
[5] https://reviews.llvm.org/D98757

Thanks
Pengfei

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210317/c39f8c3c/attachment.html>

Zhang, Xiang1 via llvm-dev

2021-Mar-18 00:56 UTC

head link

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

@lebedev.ri at gmail.com<mailto:lebedev.ri at gmail.com>
Current we see  " if (Ty.isVectorTy()) {...}" is make sense in
Mid-End.
Why we can't see "if (Ty.isX86_AMXTy()){...}" is make sense ?

Just because more targets support the VectorTy, less target (only x86) support
the AMXTy ?
The logic is not make sense.

-xiang
From: Wang, Pengfei <pengfei.wang at intel.com>
Sent: Wednesday, March 17, 2021 6:11 PM
To: llvm-dev <llvm-dev at lists.llvm.org>
Cc: lebedev.ri at gmail.com; Luo, Yuanke <yuanke.luo at intel.com>; Zhang,
Xiang1 <xiang1.zhang at intel.com>
Subject: Does middle-end pass need to consider some special type when doing
optimization? Or letting back-end to revert the optimization accordingly?

Hi,

We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].
We met several cases when the certain type we added are optimized unexpectedly
in the middle-end. E.g. optimizing phi + biscast + load:

From
%a = load <256 x i32>, <256 x i32>* %mem, align 64
... ...
%b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
%c = bitcast <256 x i32> %b to x86_amx
To
%a = bitcast <256 x i32>* %mem to x86_amx*
%b = load x86_amx, x86_amx*, align 64
... ...
%c = phi x86_amx [ %b, %label1 ], [%someother, %label2]

To prevent such unexpected transforms, we concretely added the type check in
each point of the optimizations.
Roman pointed out the changes are not the right direction [3], and thought
it's bug for backend. While we agreed backend might be able to handle it for
the functionality, we think it is better to handle it in the midden-end since
they are negative optimizations for AMX.

First, let me put some background here:

  1.  x86_amx* is different from trivial pointers.

The AMX load instruction is much different from other load instructions. It is
not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape information
from the context. We don't want the pass to add new x86_amx related usage
because this will result in the difficulty in deduction. That said bitcasting
other pointer types to x86_amx* is not trivial as assumed here.

  1.  The physical tile registers have more limitations.
     *   No copy instruction between tile registers.
     *   Spilling / reload a tile register is expensive in light of its size is
1024 bytes.
     *   The shapes of tile registers need to be pre-configured before use and
all data in tile registers will turn into invalid once re-configured. That said
we need to dominate as more tile registers as possible to configure their shapes
with one configure instruction, otherwise we need to spill and reload the live
registers once we need to re-configure.
     *   The number of tile registers is rather small (only 8) and different
shapes cannot be reused.
Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So even we
can handle some combined operation for AMX type, we still prefer to prevent it
from the beginning. Unless we can totally roll back the optimization. Which is
also not a good solution in my opinion.

  1.  For more information, please refer to discussion in [3].
For other optimization points, please refer [4][5].

I think the main controversy from Roman is if middle-end pass should consider
some special type when doing optimization. I tend to let middle-end do the type
check on account of the peculiarity of AMX type. But I'm not sure if we have
precedent to handle the similar issue in other targets. I'm open and glad to
do it either way so long as we have an elegant solution.
Any suggestions are welcome.

[1]
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#architecture
[2] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146770.html
[3] https://reviews.llvm.org/D98247
[4] https://reviews.llvm.org/D98595
[5] https://reviews.llvm.org/D98757

Thanks
Pengfei

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210318/e90795c0/attachment.html>

Nuno Lopes via llvm-dev

2021-Mar-18 09:54 UTC

head link

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

I don't know anything about AMX, but let me give you some pointers (no pun
intended).

 

Regarding pointers, the direction LLVM is taking is to have just 2 pointer
types: a data pointer type and a function pointer type. That's it. That
allows us to remove a lot of bitcasts between pointers. You see now that
load instructions have an argument with the type, which for now is
duplicated with the pointer type, but won't be as soon as pointer types
disappear.

So if you need a special pointer type that can't be casted to other pointer
types, the way to do it in LLVM is with a different address space. Then you
can configure how many bits it takes, etc. And more importantly, pointers in
that space can't be casted to another space without using a special
instruction (which LLVM optimizers won't introduce).

 

FYI by using a different address space, you may lose a few optimizations,
because optimizers assume nothing about the non-default address space. We
have discussed an API to let folks express assumptions optimizers could make
(e.g., is null == (void*)0 ?), but nothing was implemented so far.

 

Nuno

 

 

From: Wang, Pengfei
Sent: 17 March 2021 10:11
To: llvm-dev <llvm-dev at lists.llvm.org>
Cc: Luo, Yuanke <yuanke.luo at intel.com>
Subject: [llvm-dev] Does middle-end pass need to consider some special type
when doing optimization? Or letting back-end to revert the optimization
accordingly?

 

Hi,

 

We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].

We met several cases when the certain type we added are optimized
unexpectedly in the middle-end. E.g. optimizing phi + biscast + load:

 

From

%a = load <256 x i32>, <256 x i32>* %mem, align 64

. .

%b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]

%c = bitcast <256 x i32> %b to x86_amx

To

%a = bitcast <256 x i32>* %mem to x86_amx*

%b = load x86_amx, x86_amx*, align 64

. .

%c = phi x86_amx [ %b, %label1 ], [%someother, %label2]

 

To prevent such unexpected transforms, we concretely added the type check in
each point of the optimizations. 

Roman pointed out the changes are not the right direction [3], and thought
it's bug for backend. While we agreed backend might be able to handle it for
the functionality, we think it is better to handle it in the midden-end
since they are negative optimizations for AMX.

 

First, let me put some background here:

1.	x86_amx* is different from trivial pointers.

The AMX load instruction is much different from other load instructions. It
is not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape
information from the context. We don't want the pass to add new x86_amx
related usage because this will result in the difficulty in deduction. That
said bitcasting other pointer types to x86_amx* is not trivial as assumed
here.

2.	The physical tile registers have more limitations.              

a.	No copy instruction between tile registers.
b.	Spilling / reload a tile register is expensive in light of its size
is 1024 bytes.
c.	The shapes of tile registers need to be pre-configured before use
and all data in tile registers will turn into invalid once re-configured.
That said we need to dominate as more tile registers as possible to
configure their shapes with one configure instruction, otherwise we need to
spill and reload the live registers once we need to re-configure.
d.	The number of tile registers is rather small (only 8) and different
shapes cannot be reused.

Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So
even we can handle some combined operation for AMX type, we still prefer to
prevent it from the beginning. Unless we can totally roll back the
optimization. Which is also not a good solution in my opinion.

3.	For more information, please refer to discussion in [3].

For other optimization points, please refer [4][5].

 

I think the main controversy from Roman is if middle-end pass should
consider some special type when doing optimization. I tend to let middle-end
do the type check on account of the peculiarity of AMX type. But I'm not
sure if we have precedent to handle the similar issue in other targets. I'm
open and glad to do it either way so long as we have an elegant solution.

Any suggestions are welcome.

 

[1]
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html
#architecture

[2] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146770.html

[3] https://reviews.llvm.org/D98247

[4] https://reviews.llvm.org/D98595

[5] https://reviews.llvm.org/D98757

 

Thanks

Pengfei

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210318/eba7b85b/attachment-0001.html>

Florian Hahn via llvm-dev

2021-Mar-18 10:03 UTC

head link

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

> On Mar 17, 2021, at 10:11, Wang, Pengfei via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi,
>  
> We are developing prototypes for Intel Advanced Matrix Extensions (AMX) [1]
programing model in Clang and LLVM [2].
> We met several cases when the certain type we added are optimized
unexpectedly in the middle-end. E.g. optimizing phi + biscast + load:
>  
> From
> %a = load <256 x i32>, <256 x i32>* %mem, align 64
> … …
> %b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
> %c = bitcast <256 x i32> %b to x86_amx
> To
> %a = bitcast <256 x i32>* %mem to x86_amx*
> %b = load x86_amx, x86_amx*, align 64
> … …
> %c = phi x86_amx [ %b, %label1 ], [%someother, %label2]
>  
> To prevent such unexpected transforms, we concretely added the type check
in each point of the optimizations.
> Roman pointed out the changes are not the right direction [3], and thought
it’s bug for backend. While we agreed backend might be able to handle it for the
functionality, we think it is better to handle it in the midden-end since they
are negative optimizations for AMX.
>  
> First, let me put some background here:
> x86_amx* is different from trivial pointers.
> The AMX load instruction is much different from other load instructions. It
is not only need the memory address but also the shape / stride of the tile
register. We did some extra work in the backend to deduce the shape information
from the context. We don’t want the pass to add new x86_amx related usage
because this will result in the difficulty in deduction. That said bitcasting
other pointer types to x86_amx* is not trivial as assumed here.
The problem appears to be that this difference is not modeled or specified in
LLVM IR AFAICT. The current LangRef does not appear to specific `x86_amx` to
start with. If pointers to `x86_amx` have different semantics than regular LLVM
pointer types, using regular LLVM pointer types for pointers to `x86_amx` may
not be appropriate. I’ve not followed the previous AMX discussions closely, but
it sounds like it may be good to reconsider how x86_amx pointers are modeled in
LVM IR.

Also note that `bitcast` is specified as `no-op`
(https://llvm.org/docs/LangRef.html#id293) (expect for pointers with different
address spaces), but from what you mentioned above this does not match the
semantics for `x86_amx*`. It sounds like this is the underlying problem that
should be addressed, because trying to update various middle end optimization
tot ry to enforce the special semantics does not seem to be a scalable solution.

As Nuno mentioned, you could try and use a separate address space for `x86_amx`
pointers to avoid pointer optimizations.
> The physical tile registers have more limitations.             
> No copy instruction between tile registers.
> Spilling / reload a tile register is expensive in light of its size is 1024
bytes.
> The shapes of tile registers need to be pre-configured before use and all
data in tile registers will turn into invalid once re-configured. That said we
need to dominate as more tile registers as possible to configure their shapes
with one configure instruction, otherwise we need to spill and reload the live
registers once we need to re-configure.
> The number of tile registers is rather small (only 8) and different shapes
cannot be reused.
> Based on the limitations, we need to reduce the use / live range of tile
registers. But optimizations may increase the opportunity of the use. So even we
can handle some combined operation for AMX type, we still prefer to prevent it
from the beginning. Unless we can totally roll back the optimization. Which is
also not a good solution in my opinion.
> For more information, please refer to discussion in [3].
> For other optimization points, please refer [4][5].
>  
> I think the main controversy from Roman is if middle-end pass should
consider some special type when doing optimization. I tend to let middle-end do
the type check on account of the peculiarity of AMX type. But I’m not sure if we
have precedent to handle the similar issue in other targets. I’m open and glad
to do it either way so long as we have an elegant solution.
> Any suggestions are welcome.
>  

IIUC the main problem is not that middle-end passes perform or not perform
optimizations based on certain types. To me it sounds like the actual problem is
that pointers to `x86_amx` do not behave like regular LLVM IR pointers and you
are trying to enforce extra restrictions for bit casts.

Cheers,
Florian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210318/8f9ad876/attachment.html>

Juneyoung Lee via llvm-dev

2021-Mar-20 13:13 UTC

head link

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

I also think the pointee type shouldn't matter; my impression was that ty*
and ty'* should be treated equivalently and bitcasting between these should
not have any side effects.
But, when it is used by load, which receives a type for interpretation of
the loaded value, I don't think it's safe to convert load ty to load
ty'
with the same bit width in general.
A relevant bug in gcc: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58416 ,
the transformation is also happening in LLVM:
https://bugs.llvm.org/show_bug.cgi?id=45152

On Thu, Mar 18, 2021 at 5:56 PM Wang, Pengfei via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
>
>
> We are developing prototypes for Intel Advanced Matrix Extensions (AMX)
> [1] programing model in Clang and LLVM [2].
>
> We met several cases when the certain type we added are optimized
> unexpectedly in the middle-end. E.g. optimizing phi + biscast + load:
>
>
>
> From
>
> %a = load <256 x i32>, <256 x i32>* %mem, align 64
>
> … …
>
> %b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
>
> %c = bitcast <256 x i32> %b to x86_amx
>
> To
>
> %a = bitcast <256 x i32>* %mem to x86_amx*
>
> %b = load x86_amx, x86_amx*, align 64
>
> … …
>
> %c = phi x86_amx [ %b, %label1 ], [%someother, %label2]
>
>
>
> To prevent such unexpected transforms, we concretely added the type check
> in each point of the optimizations.
>
> Roman pointed out the changes are not the right direction [3], and thought
> it’s bug for backend. While we agreed backend might be able to handle it
> for the functionality, we think it is better to handle it in the midden-end
> since they are negative optimizations for AMX.
>
>
>
> First, let me put some background here:
>
>    1. x86_amx* is different from trivial pointers.
>
> The AMX load instruction is much different from other load instructions.
> It is not only need the memory address but also the shape / stride of the
> tile register. We did some extra work in the backend to deduce the shape
> information from the context. We don’t want the pass to add new x86_amx
> related usage because this will result in the difficulty in deduction. That
> said bitcasting other pointer types to x86_amx* is not trivial as assumed
> here.
>
>    1. The physical tile registers have more limitations.
>       1. No copy instruction between tile registers.
>       2. Spilling / reload a tile register is expensive in light of its
>       size is 1024 bytes.
>       3. The shapes of tile registers need to be pre-configured before
>       use and all data in tile registers will turn into invalid once
>       re-configured. That said we need to dominate as more tile registers
as
>       possible to configure their shapes with one configure instruction,
>       otherwise we need to spill and reload the live registers once we need
to
>       re-configure.
>       4. The number of tile registers is rather small (only 8) and
>       different shapes cannot be reused.
>
> Based on the limitations, we need to reduce the use / live range of tile
> registers. But optimizations may increase the opportunity of the use. So
> even we can handle some combined operation for AMX type, we still prefer to
> prevent it from the beginning. Unless we can totally roll back the
> optimization. Which is also not a good solution in my opinion.
>
>    1. For more information, please refer to discussion in [3].
>
> For other optimization points, please refer [4][5].
>
>
>
> I think the main controversy from Roman is if middle-end pass should
> consider some special type when doing optimization. I tend to let
> middle-end do the type check on account of the peculiarity of AMX type. But
> I’m not sure if we have precedent to handle the similar issue in other
> targets. I’m open and glad to do it either way so long as we have an
> elegant solution.
>
> Any suggestions are welcome.
>
>
>
> [1]
>
https://software.intel.com/content/www/us/en/develop/articles/intel-sdm.html#architecture
>
> [2] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146770.html
>
> [3] https://reviews.llvm.org/D98247
>
> [4] https://reviews.llvm.org/D98595
>
> [5] https://reviews.llvm.org/D98757
>
>
>
> Thanks
>
> Pengfei
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 

Juneyoung Lee
Software Foundation Lab, Seoul National University
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210320/d379a8fd/attachment.html>

llvm dev - Mar 2021 - Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?