thr3ads.net - llvm dev - [llvm-dev] Intel AMX programming model discussion. [Sep 2020]

If this information is useful, please help other people find it:
Share via:

Luo, Yuanke via llvm-dev

2020-Sep-04 13:50 UTC

[llvm-dev] Intel AMX programming model discussion.

Fix typo

From: Luo, Yuanke
Sent: Friday, September 4, 2020 9:47 PM
To: 'Hal Finkel' <hfinkel at anl.gov>; Topper, Craig
<craig.topper at intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com>; Philip Reames <listmail at philipreames.com>; llvm-dev at
lists.llvm.org; florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at
intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.

Hi Hal,
Generally, your proposal to adapt tile RA to Greedy RA looks good to me. Thank
you! I plan to do some prototype for the proposal. Since there is 3 RA in LLVM
infrastructure, we need 3 schemes to adapt tile RA to each existing RA. Do you
like to finalize the 3 schemes first, or you would like to review the left part
of the AMX programming model? We have some limitation to support dynamic shape
and I'd like to hear your advice. The dynamic shape requires the ldtilecfg
post-dominate the point that define shape, so we encourage user to define their
shape in the entry of the function. Take below code as example. Ideally, we hope
to insert ldtilecfg at line 57 to config a, b, c, but in this function the
c's shape {row, col} is defined in each if/else clause. So at line 57, the
shape of c in unknown. Do you have any advice for such problem?
52 void kernel(int cond) {
53   _tile a = {row, 8};
54   _tile b = {8, col};
55
56   // copy shape to stack slot
57   // ldtilecfg a, b, c
58   if(cond) {
59     short row = get_row();
60     short col = get_row();
61     _tile c = {row, col};
62     __tile_loadd(&a, buf, STRIDE);
63     __tile_loadd(&b, buf, STRIDE);
64     __tile_loadd(&c, buf, STRIDE);
65   } else {
66     short row = get_row();
67     short col = get_row();
68     _tile c = {row, col};
69     __tile_loadd(&a, buf2, STRIDE);
70     __tile_loadd(&b, buf2, STRIDE);
71     __tile_loadd(&c, buf2, STRIDE);
72   }
73   __tile_dpbsud(&c, a, b);
74   __tile_stored(buf, STRIDE, c);
75 }

Thanks
Yuanke
From: Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at anl.gov>>
Sent: Friday, September 4, 2020 5:59 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>; Topper, Craig <craig.topper at
intel.com<mailto:craig.topper at intel.com>>; Kaylor, Andrew
<andrew.kaylor at intel.com<mailto:andrew.kaylor at intel.com>>;
Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com<mailto:hongjiu.lu at
intel.com>>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.

On 9/4/20 3:37 AM, Luo, Yuanke wrote:
Hi Hal,
Thank you for the ideas that help us to improve the design, and sorry for
replying late. There is something I am not able to figure out and there some
special trait for tile RA.

You're quite welcome.

1.       X86RegisterInfo::getRegAllocationHints can tell RA which physical
register is preferred, but it can't force RA to just allocate the hinted
register. If the hinted register is not meet, RA would allocate other register.

I addressed this below, but I could have been clearer. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile
registers, the function will return true. This turns the preference into a hard
constraint, and the allocator will not allocate any other register. That's
my understanding from reading the code.

2.       The shape information should be attached to each virtual register and
physical register which is allocated. How to store and get the shape information
with limited code change on existing RA?

For each virtual register, getRegAllocationHints could just recompute the shape
information. If this isn't a constant-time operation, however, you'll
probably want to cache the computed shape requirements in
X86MachineFunctionInfo. You can add a map from registers to shape information in
that class, and accesses it from getRegAllocationHints. You can store
information about the physical registers there too.

Regarding the physical registers, you can grab this information in the
pre-rewrite phase. Override addPreRewrite in X86TargetMachine.cpp. You'll
need a small pass that records relevant information about the assignments
(which, I imagine, is the same small pass that updates the LDTILECFG
instructions). For an example of such a pass, see AMDGPU/GCNNSAReassign.cpp

3.       When a tile register is spilled, the shape should also be bound the
corresponding spill stack slot, so that it can be assigned the physical tile
register with the same shape.

I'm not sure what you mean. If you don't want to just be conservative
about the spill size allocation, you do need to know the shape in order to
compute the spill-location size. I assume that you can grab that out of
X86MachineFunctionInfo from storeRegToStackSlot/loadRegFromStackSlot or
eliminateFrameIndex (or copyPhysReg) as needed.

4.       There is no mov/copy instruction for tile register. To copy tile
register, we need to store the tile register to memory and load the data from
memory to another register. So a lot of code for live interval split in Greedy
RA is unnecessary for tile register allocation.

Yes, but this just means that you need to support copying through memory.
Setting CopyCost = -1 in X86RegisterInfo.td might help as well.

5.       Compiler can support register spill, but spill should be avoided for
performance benefit. We prefer reporting warning on register spill, so that user
can realize it and adjust their code to avoid register spill.

If you want to emit a diagnostic, you may be able to do that from
storeRegToStackSlot. In any case, please make use of the optimization-remark
infrastructure. For an example of how to do this, see
RAGreedy::reportNumberOfSplillsReloads in RegAllocGreedy.cpp.

If there is no easy way to take the advantage of current RA infrastructure,
there are some pros to have a separate RA for tile register.

1.       We can limit the risk to break RA for general register on each arch. If
there are some bugs on tile RA, only application that use AMX is affected.

That's true. But I also worry about that. Any time you need to write
non-trivial code that will be used relatively rarely, it's likely to have
bugs that take a long time to show up. If you can plug into the generic
infrastructure, you benefit from the fact that it's highly-covered,
often-used code. Not that you might not run into bugs, of course, especially if
you're using it in a new way, but the base logic is likely to already be
robust.

2.       We can customize the special trait (config, spilt, spill) of tile
register in the sperate RA more freely.

True.

 -Hal

For RegAllocFast, I agree with you. Each region of register is small, and since
the performance is not the first priority, we can insert multiply config for
each small region.
As you recommend looking at the PBQP solver, I'll take some time to
investigate it and go back to you.

Thanks
-Yuanke

From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Monday, August 24, 2020 5:03 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Topper, Craig <craig.topper at
intel.com><mailto:craig.topper at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com><mailto:andrew.kaylor at intel.com>;
Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu
at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.

Hi, Yuanke,

Thanks for writing this up. Let me back up a bit because the scheme I proposed
last week doesn't work without further modification: within a particular
"configuration region" (i.e., the code in between the LDTILECFG and
the TILERELEASE (or next LDTILECFG)), each tile register can only be used with
one shape, and in addition, no register can have its shape changed without
zeroing out all of the tile registers. Thus, just using different register
classes for the different shapes, as I had suggested, isn't sufficient to
model the allocation requirements. That would not prevent the same register from
essentially being assigned to differently-shaped virtual registers with
non-overlapping live ranges within one configuration region.

Also, as you point out, when multiple non-static tile shapes are in use, if you
use one register class for each shape, you would need different register classes
for these too. Luckily, I don't think that using the separate register
classes actually buys us anything, so please disregard that suggestion of mine.
Use only one register class.

Once the configuration regions are identified, you'll know how many tile
register shapes are required. If this number is greater than eight, then
you'll need to cut the region (requiring all live tiles to be spilled and
restored around each re-configuration point). After that, we'll assume that
we have eight or fewer distinct shapes.

Now the problem is that you need to allocate registers, satisfying all of the
usual constraints (non-overlapping live ranges, etc.), but with an additional
constraint: once a physical register has been used with some particular tile
shape, it cannot be assigned to any other tile shape.

I think that the current infrastructure can support this as follows:

 1. Add an override X86RegisterInfo::getRegAllocationHints. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile
registers, the function will return true (to indicate a hard constraint). As
registers are assigned in RegAllocGreedy, getRegAllocationHints is called for
each virtual register. For virtual tile registers, look at the passed
VirtRegMap, etc. for already-assigned tile virtual registers with different
shape requirements as the current virtual register (you'll need to cache the
shape requirements in X86MachineFunctionInfo for this to be efficient), and
return a hints list consisting of all other non-reserved tile registers.

 2. To support RegAllocFast, which doesn't use getRegAllocationHints, you
would need to make the configuration regions small enough that it doesn't
matter (and if you're doing this around every tile instruction, this is
automatically true).

 3. To support RegAllocPBQP (which is likely a good thing to do, but probably
not required), I believe you can support this by adding custom constraints to
the solver (kind of like what AArch64PBQPRegAlloc.cpp does).

Once the allocation process is complete, you'll need to go back and update
the LDTILECFG data to reflect the chosen shape -> register mapping.

What I don't know, however, is how well the getRegAllocationHints method
will work. The benefit is that you don't need to write a custom
pre-allocator allocator. On the other hand, it might visit the virtual registers
to assign in a suboptimal order because it doesn't really understand the
constraint being imposed (generally, we just assign larger live ranges first).
On the other hand, it is a greedy algorithm and if you want something
systematically closer to optimal, maybe you should be using PBQP anyway. If you
do end up needing a custom allocator for these, I recommend looking at the PBQP
solver (which, as I recall, is independently reusable).

Hopefully, this is more-helpful advice.

 -Hal
On 8/21/20 9:54 PM, Luo, Yuanke wrote:
It seems I make a mistake on sharing register unit. Can we share register unit
for tile register that is within different tile register class (different
register class has different tile shape)?  Think about two virtual tile register
%2:vtile1x1 and %3:vtile1x2. First %2 is allocated to $tmm0, after that %2 is
killed and %t3 is allocated to $tmm0. This is not allowed, because when $tmm0 is
allocated to %2, its shape is configured to 1x1. If we reallocated $tmm0 to %3,
then we need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.

Yuanke

From: Luo, Yuanke
Sent: Friday, August 21, 2020 2:12 PM
To: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>;
Topper, Craig <craig.topper at intel.com><mailto:craig.topper at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Lu, Hongjiu
<hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.

Hi Hal,
The proposal is attractive to me, but there is something I still can't
figure out. Let's take below MIR as an example. We assume we have 256
register classes (vtile1x1, vtile1x2, ..., tile16x16).

1.       After instruction selection, the pseudo AMX instruction is generated.
The name of pseudo instructions have 'P' prefix. Now all the AMX pseudo
instruction take vtile as register class. Let's assume %13 is constant 3,
%10 is constant 4 and %14 is variable.
  %1:vtile = PTILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %2:vtile = PTILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %3:vtile = PTILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
%21:vtile = PTDPBSSDV %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),
%1:vtile, %2:vtile

2.       The configuration-placement pass looks at all of the AMX
pseudo-instructions and identifies regions in which the pseudo-instructions use
the same configuration parameters. It first replaces the register class for all
tile registers whose shape is known in compile-time. Since the shape of %1 is
constant, so it replaces %1:vtile with %1:vtile3x4 which change the register
class and morph pseudo instruction into AMX real instruction. The shape of %2
and %3 is unknown in compile-time, so it arbitrarily picks up a tile register
class which is not assigned before and assign the register class to %2 and %3.
After register class allocation, the code is transformed as this. The register
class for %2:vtile1x1 and %3:vtile1x2 is allocated.
   PLDTILECFG
  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
Something I am not figured out.

1.       I not sure if we can have AMX instruction's inputs and outputs fit
multiple register classes (vtile1x1, ..., vtile16x16), otherwise we need 256
pseudo instructions.

2.       Whether 256 register class is enough to be allocated. There may be more
256 unknow shape tile registers.

3.       In this pass we also find the proper pointer (common dominator) to
insert ldtilecfg, but at this time the register is allocated, we don't know
the shape of each physical tile register. So we just insert a pseudo tile config
instruction.

3.       All tile register class share the same register unit. We do register
allocation by the framework, and the code is transformed as this.
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

4.       Run config pass to collect the shape of each physical tile register and
config them. The code can be generated as below. Here is the problem, how can we
know the shape of the physical tile register?
   MOV row, col info to %stack.0 for each physical tile register   ??????
  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def
$tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def
$tmm5, implicit-def $tmm6, implicit-def $tmm7
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

Thanks
Yuanke

...

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200904/8e9fbf8d/attachment-0001.html>

Hal Finkel via llvm-dev

2020-Sep-05 01:30 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

On 9/4/20 8:50 AM, Luo, Yuanke wrote:>
> Fix typo
>
> *From:* Luo, Yuanke
> *Sent:* Friday, September 4, 2020 9:47 PM
> *To:* 'Hal Finkel' <hfinkel at anl.gov>; Topper, Craig 
> <craig.topper at intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com>;
> Philip Reames <listmail at philipreames.com>; llvm-dev at
lists.llvm.org;
> florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
> *Subject:* RE: [llvm-dev] Intel AMX programming model discussion.
>
> Hi Hal,
>
> Generally, your proposal to adapt tile RA to Greedy RA looks good to 
> me. Thank you! I plan to do some prototype for the proposal. Since 
> there is 3 RA in LLVM infrastructure, we need 3 schemes to adapt tile 
> RA to each existing RA. Do you like to finalize the 3 schemes first, 
> or you would like to review the left part of the AMX programming 
> model? We have some limitation to support dynamic shape and I’d like 
> to hear your advice. The dynamic shape requires the ldtilecfg 
> post-dominate the point that define shape, so we encourage user to 
> define their shape in the entry of the function. Take below code as 
> example. Ideally, we hope to insert ldtilecfg at line 57 to config a, 
> b, c, but in this function the c’s shape {row, col} is defined in each 
> if/else clause. So at line 57, the shape of c in unknown. Do you have 
> any advice for such problem?
>In the example below, I'm going to assume that the function calls are 
actually to get_row1() and get_row2(), neither of which can be hoisted.

Just to think about this: First, we're starting the MIR with intrinsics 
that take the shape parameters directly. Now you need to:

  1. Identify "configuration regions". Because reconfiguring must be 
done for all registers at once, and because reconfiguring zeros all of 
the tile registers, each configuration region is a connected component 
in the union of the live ranges of all virtual tile registers. Thus, 
first collect the configuration regions via trivial clustering (two 
instructions are part of the same configuration region is they share any 
live range of a tile register).

  2. If the region will require more than eight types of shapes, then 
you'll need to calculate a min cut of the region, split the region by 
inserting spill/restores, so that the region requires only <= 8 number 
of shapes.

  3. If you do it this way, all of the instructions in your code below 
will be part of one, big configuration region. Generally, you want to 
put the ldtilecfg at the common dominating point of all of the tile 
instructions in the region. Now, as you point out in your example below, 
we can't simply put the ldtilecfg at the common dominating point: that 
point might not actually be dominated by the definitions of all of the 
shape inputs needed.

  4. One thing that you might do is iterative splitting. If not all of 
the definitions of the shape inputs dominate the desired insertion 
point, first you might try iteratively hosting the defining instructions 
to make it so the definitions do dominate. If they still don't, then 
split the ldtilecfg into each successor of the desired insertion point. 
Do this recursively until, for each ldtilecfg, the inputs for each 
dynamic-shape tile register size dominate the insertion point.

  5. This procedure, alone, might fail in the case where the ldtilecfg 
is sunk past the point of definition of one of the tile registers. 
Imagine, in your example below, that there was some use of the tile 
registers a and b before the if. In that case, you'll need to split 
those live ranges by spilling into memory around the desired ldtilecfg 
insertion point. That creates a new configuration region that you'll 
insert into the queue of configuration regions to process.

I'm sure that this is not the only possible heuristic. This would be 
easier, I think, if the hardware did not zero all of the registers when 
you reconfigured any of them, but I suppose that it is what it is at 
this point.

  -Hal

> 52 void kernel(int cond) {
>
> 53   _tile a = {row, 8};
>
> 54   _tile b = {8, col};
>
> 55
>
> 56   // copy shape to stack slot
>
> 57   // ldtilecfg a, b, c
>
> 58   if(cond) {
>
> 59     short row = get_row();
>
> 60     short col = get_row();
>
> 61     _tile c = {row, col};
>
> 62     __tile_loadd(&a, buf, STRIDE);
>
> 63     __tile_loadd(&b, buf, STRIDE);
>
> 64     __tile_loadd(&c, buf, STRIDE);
>
> 65   } else {
>
> 66     short row = get_row();
>
> 67     short col = get_row();
>
> 68     _tile c = {row, col};
>
> 69     __tile_loadd(&a, buf2, STRIDE);
>
> 70     __tile_loadd(&b, buf2, STRIDE);
>
> 71     __tile_loadd(&c, buf2, STRIDE);
>
> 72   }
>
> 73   __tile_dpbsud(&c, a, b);
>
> 74   __tile_stored(buf, STRIDE, c);
>
> 75 }
>
> Thanks
>
> Yuanke
>
> *From:* Hal Finkel <hfinkel at anl.gov <mailto:hfinkel at
anl.gov>>
> *Sent:* Friday, September 4, 2020 5:59 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com 
> <mailto:yuanke.luo at intel.com>>; Topper, Craig <craig.topper
at intel.com
> <mailto:craig.topper at intel.com>>; Kaylor, Andrew 
> <andrew.kaylor at intel.com <mailto:andrew.kaylor at
intel.com>>; Philip
> Reames <listmail at philipreames.com <mailto:listmail at
philipreames.com>>;
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>; 
> florian_hahn at apple.com <mailto:florian_hahn at apple.com>; Lu,
Hongjiu
> <hongjiu.lu at intel.com <mailto:hongjiu.lu at intel.com>>
> *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
> On 9/4/20 3:37 AM, Luo, Yuanke wrote:
>
>     Hi Hal,
>
>     Thank you for the ideas that help us to improve the design, and
>     sorry for replying late. There is something I am not able to
>     figure out and there some special trait for tile RA.
>
> You're quite welcome.
>
>     1.X86RegisterInfo::getRegAllocationHints can tell RA which
>     physical register is preferred, but it can’t force RA to just
>     allocate the hinted register. If the hinted register is not meet,
>     RA would allocate other register.
>
> I addressed this below, but I could have been clearer. Like 
> SystemZRegisterInfo::getRegAllocationHints does sometimes, when 
> hinting the tile registers, the function will return true. This turns 
> the preference into a hard constraint, and the allocator will not 
> allocate any other register. That's my understanding from reading the 
> code.
>
>     2.The shape information should be attached to each virtual
>     register and physical register which is allocated. How to store
>     and get the shape information with limited code change on existing RA?
>
> For each virtual register, getRegAllocationHints could just recompute 
> the shape information. If this isn't a constant-time operation, 
> however, you'll probably want to cache the computed shape requirements 
> in X86MachineFunctionInfo. You can add a map from registers to shape 
> information in that class, and accesses it from getRegAllocationHints. 
> You can store information about the physical registers there too.
>
> Regarding the physical registers, you can grab this information in the 
> pre-rewrite phase. Override addPreRewrite in X86TargetMachine.cpp. 
> You'll need a small pass that records relevant information about the 
> assignments (which, I imagine, is the same small pass that updates the 
> LDTILECFG instructions). For an example of such a pass, see 
> AMDGPU/GCNNSAReassign.cpp
>
>     3.When a tile register is spilled, the shape should also be bound
>     the corresponding spill stack slot, so that it can be assigned the
>     physical tile register with the same shape.
>
> I'm not sure what you mean. If you don't want to just be
conservative
> about the spill size allocation, you do need to know the shape in 
> order to compute the spill-location size. I assume that you can grab 
> that out of X86MachineFunctionInfo from 
> storeRegToStackSlot/loadRegFromStackSlot or eliminateFrameIndex (or 
> copyPhysReg) as needed.
>
>     4.There is no mov/copy instruction for tile register. To copy tile
>     register, we need to store the tile register to memory and load
>     the data from memory to another register. So a lot of code for
>     live interval split in Greedy RA is unnecessary for tile register
>     allocation.
>
> Yes, but this just means that you need to support copying through 
> memory. Setting CopyCost = -1 in X86RegisterInfo.td might help as well.
>
>     5.Compiler can support register spill, but spill should be avoided
>     for performance benefit. We prefer reporting warning on register
>     spill, so that user can realize it and adjust their code to avoid
>     register spill.
>
> If you want to emit a diagnostic, you may be able to do that from 
> storeRegToStackSlot. In any case, please make use of the 
> optimization-remark infrastructure. For an example of how to do this, 
> see RAGreedy::reportNumberOfSplillsReloads in RegAllocGreedy.cpp.
>
>     If there is no easy way to take the advantage of current RA
>     infrastructure, there are some pros to have a separate RA for tile
>     register.
>
>     1.We can limit the risk to break RA for general register on each
>     arch. If there are some bugs on tile RA, only application that use
>     AMX is affected.
>
> That's true. But I also worry about that. Any time you need to write 
> non-trivial code that will be used relatively rarely, it's likely to 
> have bugs that take a long time to show up. If you can plug into the 
> generic infrastructure, you benefit from the fact that it's 
> highly-covered, often-used code. Not that you might not run into bugs, 
> of course, especially if you're using it in a new way, but the base 
> logic is likely to already be robust.
>
>     2.We can customize the special trait (config, spilt, spill) of
>     tile register in the sperate RA more freely.
>
> True.
>
>  -Hal
>
>     For RegAllocFast, I agree with you. Each region of register is
>     small, and since the performance is not the first priority, we can
>     insert multiply config for each small region.
>
>     As you recommend looking at the PBQP solver, I’ll take some time
>     to investigate it and go back to you.
>
>     Thanks
>
>     -Yuanke
>
>     *From:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at
anl.gov>
>     *Sent:* Monday, August 24, 2020 5:03 PM
>     *To:* Luo, Yuanke <yuanke.luo at intel.com>
>     <mailto:yuanke.luo at intel.com>; Topper, Craig
>     <craig.topper at intel.com> <mailto:craig.topper at
intel.com>; Kaylor,
>     Andrew <andrew.kaylor at intel.com> <mailto:andrew.kaylor at
intel.com>;
>     Philip Reames <listmail at philipreames.com>
>     <mailto:listmail at philipreames.com>; llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>; florian_hahn at apple.com
>     <mailto:florian_hahn at apple.com>; Lu, Hongjiu
>     <hongjiu.lu at intel.com> <mailto:hongjiu.lu at intel.com>
>     *Subject:* Re: [llvm-dev] Intel AMX programming model discussion.
>
>     Hi, Yuanke,
>
>     Thanks for writing this up. Let me back up a bit because the
>     scheme I proposed last week doesn't work without further
>     modification: within a particular "configuration region"
(i.e.,
>     the code in between the LDTILECFG and the TILERELEASE (or next
>     LDTILECFG)), each tile register can only be used with one shape,
>     and in addition, no register can have its shape changed without
>     zeroing out all of the tile registers. Thus, just using different
>     register classes for the different shapes, as I had suggested,
>     isn't sufficient to model the allocation requirements. That would
>     not prevent the same register from essentially being assigned to
>     differently-shaped virtual registers with non-overlapping live
>     ranges within one configuration region.
>
>     Also, as you point out, when multiple non-static tile shapes are
>     in use, if you use one register class for each shape, you would
>     need different register classes for these too. Luckily, I don't
>     think that using the separate register classes actually buys us
>     anything, so please disregard that suggestion of mine. Use only
>     one register class.
>
>     Once the configuration regions are identified, you'll know how
>     many tile register shapes are required. If this number is greater
>     than eight, then you'll need to cut the region (requiring all live
>     tiles to be spilled and restored around each re-configuration
>     point). After that, we'll assume that we have eight or fewer
>     distinct shapes.
>
>     Now the problem is that you need to allocate registers, satisfying
>     all of the usual constraints (non-overlapping live ranges, etc.),
>     but with an additional constraint: once a physical register has
>     been used with some particular tile shape, it cannot be assigned
>     to any other tile shape.
>
>     I think that the current infrastructure can support this as follows:
>
>      1. Add an override X86RegisterInfo::getRegAllocationHints. Like
>     SystemZRegisterInfo::getRegAllocationHints does sometimes, when
>     hinting the tile registers, the function will return true (to
>     indicate a hard constraint). As registers are assigned in
>     RegAllocGreedy, getRegAllocationHints is called for each virtual
>     register. For virtual tile registers, look at the passed
>     VirtRegMap, etc. for already-assigned tile virtual registers with
>     different shape requirements as the current virtual register
>     (you'll need to cache the shape requirements in
>     X86MachineFunctionInfo for this to be efficient), and return a
>     hints list consisting of all other non-reserved tile registers.
>
>      2. To support RegAllocFast, which doesn't use
>     getRegAllocationHints, you would need to make the configuration
>     regions small enough that it doesn't matter (and if you're
doing
>     this around every tile instruction, this is automatically true).
>
>      3. To support RegAllocPBQP (which is likely a good thing to do,
>     but probably not required), I believe you can support this by
>     adding custom constraints to the solver (kind of like what
>     AArch64PBQPRegAlloc.cpp does).
>
>     Once the allocation process is complete, you'll need to go back
>     and update the LDTILECFG data to reflect the chosen shape ->
>     register mapping.
>
>     What I don't know, however, is how well the getRegAllocationHints
>     method will work. The benefit is that you don't need to write a
>     custom pre-allocator allocator. On the other hand, it might visit
>     the virtual registers to assign in a suboptimal order because it
>     doesn't really understand the constraint being imposed (generally,
>     we just assign larger live ranges first). On the other hand, it is
>     a greedy algorithm and if you want something systematically closer
>     to optimal, maybe you should be using PBQP anyway. If you do end
>     up needing a custom allocator for these, I recommend looking at
>     the PBQP solver (which, as I recall, is independently reusable).
>
>     Hopefully, this is more-helpful advice.
>
>      -Hal
>
>     On 8/21/20 9:54 PM, Luo, Yuanke wrote:
>
>         It seems I make a mistake on sharing register unit. Can we
>         share register unit for tile register that is within different
>         tile register class (different register class has different
>         tile shape)?  Think about two virtual tile register
>         /%2:vtile1x1 /and /%3:vtile1x2/. First %2 is allocated to
>         $tmm0, after that %2 is killed and %t3 is allocated to $tmm0.
>         This is not allowed, because when $tmm0 is allocated to %2,
>         its shape is configured to 1x1. If we reallocated $tmm0 to %3,
>         then we need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7
>         be clobbered.
>
>         Yuanke
>
>         *From:* Luo, Yuanke
>         *Sent:* Friday, August 21, 2020 2:12 PM
>         *To:* Hal Finkel <hfinkel at anl.gov> <mailto:hfinkel at
anl.gov>;
>         Topper, Craig <craig.topper at intel.com>
>         <mailto:craig.topper at intel.com>; Kaylor, Andrew
>         <andrew.kaylor at intel.com> <mailto:andrew.kaylor at
intel.com>;
>         Philip Reames <listmail at philipreames.com>
>         <mailto:listmail at philipreames.com>; llvm-dev at
lists.llvm.org
>         <mailto:llvm-dev at lists.llvm.org>; florian_hahn at
apple.com
>         <mailto:florian_hahn at apple.com>; Lu, Hongjiu
>         <hongjiu.lu at intel.com> <mailto:hongjiu.lu at
intel.com>
>         *Subject:* RE: [llvm-dev] Intel AMX programming model discussion.
>
>         Hi Hal,
>
>         The proposal is attractive to me, but there is something I
>         still can’t figure out. Let’s take below MIR as an example. We
>         assume we have 256 register classes (vtile1x1, vtile1x2, …,
>         tile16x16).
>
>         1.After instruction selection, the pseudo AMX instruction is
>         generated. The name of pseudo instructions have ‘P’ prefix.
>         Now all the AMX pseudo instruction take vtile as register
>         class. Let’s assume %13 is constant 3, %10 is constant 4 and
>         %14 is variable.
>
>         /  %1:vtile = *P*TILELOADDV %13:gr16, %10:gr16, %17:gr64, 1,
>         %18:gr64_nosp, 0, $noreg/
>
>         /  %2:vtile = *P*TILELOADDV %10:gr16, %14:gr16, %17:gr64, 1,
>         %18:gr64_nosp, 0, $noreg/
>
>         /  %3:vtile = *P*TILELOADDV %13:gr16, %14:gr16, %17:gr64, 1,
>         %18:gr64_nosp, 0, $noreg/
>
>         /%21:vtile = *P*TDPBSSDV %13:gr16, %10:gr16, %14:gr16,
>         %3:vtile(tied-def 0), %1:vtile, %2:vtile /
>
>         2.The configuration-placement pass looks at all of the AMX
>         pseudo-instructions and identifies regions in which the
>         pseudo-instructions use the same configuration parameters. It
>         first replaces the register class for all tile registers whose
>         shape is known in compile-time. Since the shape of %1 is
>         constant, so it replaces %1:vtile with %1:vtile3x4 which
>         change the register class and morph pseudo instruction into
>         AMX real instruction. The shape of %2 and %3 is unknown in
>         compile-time, so it arbitrarily picks up a tile register class
>         which is not assigned before and assign the register class to
>         %2 and %3. After register class allocation, the code is
>         transformed as this. The register class for %2:vtile1x1 and
>         %3:vtile1x2 is allocated.
>
>         /*P*LDTILECFG/
>
>         /  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0,
>         $noreg/
>
>         /  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4,
>         %2:vtile1x1 /
>
>         Something I am not figured out.
>
>         1.I not sure if we can have AMX instruction’s inputs and
>         outputs fit multiple register classes (vtile1x1, …,
>         vtile16x16), otherwise we need 256 pseudo instructions.
>
>         2.Whether 256 register class is enough to be allocated. There
>         may be more 256 unknow shape tile registers.
>
>         3.In this pass we also find the proper pointer (common
>         dominator) to insert ldtilecfg, but at this time the register
>         is allocated, we don’t know the shape of each physical tile
>         register. So we just insert a pseudo tile config instruction.
>
>         3.All tile register class share the same register unit. We do
>         register allocation by the framework, and the code is
>         transformed as this.
>
>         /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
>         4.Run config pass to collect the shape of each physical tile
>         register and config them. The code can be generated as below.
>         Here is the problem, how can we know the shape of the physical
>         tile register?
>
>         */   MOV row, col info to %stack.0 for each physical tile
>         register   ??????/*
>
>         */  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def
>         $tmm0, implicit-def $tmm1, implicit-def $tmm2, implicit-def
>         $tmm3, implicit-def $tmm4, implicit-def $tmm5, implicit-def
>         $tmm6, implicit-def $tmm7/*
>
>         /  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg/
>
>         /$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1/
>
>         Thanks
>
>         Yuanke
>
>         ...
>
>     -- 
>
>     Hal Finkel
>
>     Lead, Compiler Technology and Programming Languages
>
>     Leadership Computing Facility
>
>     Argonne National Laboratory
>
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200904/79cc027b/attachment-0001.html>

Luo, Yuanke via llvm-dev

2020-Sep-20 04:46 UTC

head link

[llvm-dev] Intel AMX programming model discussion.

Hi Hal,
The RA that you proposed basically works. I have a prototyped patch at
https://reviews.llvm.org/D87981. I'd like to put the configuration region
identifying and splitting in the next phase prototyping, since it is a little
bit complex. In language level is it possible to force end user to define their
shape at the entry of function?
Thanks
Yuanke

From: Hal Finkel <hfinkel at anl.gov>
Sent: Saturday, September 5, 2020 9:31 AM
To: Luo, Yuanke <yuanke.luo at intel.com>; Topper, Craig <craig.topper
at intel.com>; Kaylor, Andrew <andrew.kaylor at intel.com>; Philip
Reames <listmail at philipreames.com>; llvm-dev at lists.llvm.org;
florian_hahn at apple.com; Lu, Hongjiu <hongjiu.lu at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 9/4/20 8:50 AM, Luo, Yuanke wrote:
Fix typo

From: Luo, Yuanke
Sent: Friday, September 4, 2020 9:47 PM
To: 'Hal Finkel' <hfinkel at anl.gov><mailto:hfinkel at
anl.gov>; Topper, Craig <craig.topper at
intel.com><mailto:craig.topper at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com><mailto:andrew.kaylor at intel.com>;
Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu
at intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.

Hi Hal,
Generally, your proposal to adapt tile RA to Greedy RA looks good to me. Thank
you! I plan to do some prototype for the proposal. Since there is 3 RA in LLVM
infrastructure, we need 3 schemes to adapt tile RA to each existing RA. Do you
like to finalize the 3 schemes first, or you would like to review the left part
of the AMX programming model? We have some limitation to support dynamic shape
and I'd like to hear your advice. The dynamic shape requires the ldtilecfg
post-dominate the point that define shape, so we encourage user to define their
shape in the entry of the function. Take below code as example. Ideally, we hope
to insert ldtilecfg at line 57 to config a, b, c, but in this function the
c's shape {row, col} is defined in each if/else clause. So at line 57, the
shape of c in unknown. Do you have any advice for such problem?

In the example below, I'm going to assume that the function calls are
actually to get_row1() and get_row2(), neither of which can be hoisted.

Just to think about this: First, we're starting the MIR with intrinsics that
take the shape parameters directly. Now you need to:

 1. Identify "configuration regions". Because reconfiguring must be
done for all registers at once, and because reconfiguring zeros all of the tile
registers, each configuration region is a connected component in the union of
the live ranges of all virtual tile registers. Thus, first collect the
configuration regions via trivial clustering (two instructions are part of the
same configuration region is they share any live range of a tile register).

 2. If the region will require more than eight types of shapes, then you'll
need to calculate a min cut of the region, split the region by inserting
spill/restores, so that the region requires only <= 8 number of shapes.

 3. If you do it this way, all of the instructions in your code below will be
part of one, big configuration region. Generally, you want to put the ldtilecfg
at the common dominating point of all of the tile instructions in the region.
Now, as you point out in your example below, we can't simply put the
ldtilecfg at the common dominating point: that point might not actually be
dominated by the definitions of all of the shape inputs needed.

 4. One thing that you might do is iterative splitting. If not all of the
definitions of the shape inputs dominate the desired insertion point, first you
might try iteratively hosting the defining instructions to make it so the
definitions do dominate. If they still don't, then split the ldtilecfg into
each successor of the desired insertion point. Do this recursively until, for
each ldtilecfg, the inputs for each dynamic-shape tile register size dominate
the insertion point.

 5. This procedure, alone, might fail in the case where the ldtilecfg is sunk
past the point of definition of one of the tile registers. Imagine, in your
example below, that there was some use of the tile registers a and b before the
if. In that case, you'll need to split those live ranges by spilling into
memory around the desired ldtilecfg insertion point. That creates a new
configuration region that you'll insert into the queue of configuration
regions to process.

I'm sure that this is not the only possible heuristic. This would be easier,
I think, if the hardware did not zero all of the registers when you reconfigured
any of them, but I suppose that it is what it is at this point.

 -Hal


52 void kernel(int cond) {
53   _tile a = {row, 8};
54   _tile b = {8, col};
55
56   // copy shape to stack slot
57   // ldtilecfg a, b, c
58   if(cond) {
59     short row = get_row();
60     short col = get_row();
61     _tile c = {row, col};
62     __tile_loadd(&a, buf, STRIDE);
63     __tile_loadd(&b, buf, STRIDE);
64     __tile_loadd(&c, buf, STRIDE);
65   } else {
66     short row = get_row();
67     short col = get_row();
68     _tile c = {row, col};
69     __tile_loadd(&a, buf2, STRIDE);
70     __tile_loadd(&b, buf2, STRIDE);
71     __tile_loadd(&c, buf2, STRIDE);
72   }
73   __tile_dpbsud(&c, a, b);
74   __tile_stored(buf, STRIDE, c);
75 }

Thanks
Yuanke
From: Hal Finkel <hfinkel at anl.gov<mailto:hfinkel at anl.gov>>
Sent: Friday, September 4, 2020 5:59 PM
To: Luo, Yuanke <yuanke.luo at intel.com<mailto:yuanke.luo at
intel.com>>; Topper, Craig <craig.topper at
intel.com<mailto:craig.topper at intel.com>>; Kaylor, Andrew
<andrew.kaylor at intel.com<mailto:andrew.kaylor at intel.com>>;
Philip Reames <listmail at philipreames.com<mailto:listmail at
philipreames.com>>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com<mailto:hongjiu.lu at
intel.com>>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.



On 9/4/20 3:37 AM, Luo, Yuanke wrote:
Hi Hal,
Thank you for the ideas that help us to improve the design, and sorry for
replying late. There is something I am not able to figure out and there some
special trait for tile RA.



You're quite welcome.



1.       X86RegisterInfo::getRegAllocationHints can tell RA which physical
register is preferred, but it can't force RA to just allocate the hinted
register. If the hinted register is not meet, RA would allocate other register.



I addressed this below, but I could have been clearer. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile
registers, the function will return true. This turns the preference into a hard
constraint, and the allocator will not allocate any other register. That's
my understanding from reading the code.



2.       The shape information should be attached to each virtual register and
physical register which is allocated. How to store and get the shape information
with limited code change on existing RA?



For each virtual register, getRegAllocationHints could just recompute the shape
information. If this isn't a constant-time operation, however, you'll
probably want to cache the computed shape requirements in
X86MachineFunctionInfo. You can add a map from registers to shape information in
that class, and accesses it from getRegAllocationHints. You can store
information about the physical registers there too.

Regarding the physical registers, you can grab this information in the
pre-rewrite phase. Override addPreRewrite in X86TargetMachine.cpp. You'll
need a small pass that records relevant information about the assignments
(which, I imagine, is the same small pass that updates the LDTILECFG
instructions). For an example of such a pass, see AMDGPU/GCNNSAReassign.cpp



3.       When a tile register is spilled, the shape should also be bound the
corresponding spill stack slot, so that it can be assigned the physical tile
register with the same shape.



I'm not sure what you mean. If you don't want to just be conservative
about the spill size allocation, you do need to know the shape in order to
compute the spill-location size. I assume that you can grab that out of
X86MachineFunctionInfo from storeRegToStackSlot/loadRegFromStackSlot or
eliminateFrameIndex (or copyPhysReg) as needed.



4.       There is no mov/copy instruction for tile register. To copy tile
register, we need to store the tile register to memory and load the data from
memory to another register. So a lot of code for live interval split in Greedy
RA is unnecessary for tile register allocation.



Yes, but this just means that you need to support copying through memory.
Setting CopyCost = -1 in X86RegisterInfo.td might help as well.



5.       Compiler can support register spill, but spill should be avoided for
performance benefit. We prefer reporting warning on register spill, so that user
can realize it and adjust their code to avoid register spill.



If you want to emit a diagnostic, you may be able to do that from
storeRegToStackSlot. In any case, please make use of the optimization-remark
infrastructure. For an example of how to do this, see
RAGreedy::reportNumberOfSplillsReloads in RegAllocGreedy.cpp.



If there is no easy way to take the advantage of current RA infrastructure,
there are some pros to have a separate RA for tile register.

1.       We can limit the risk to break RA for general register on each arch. If
there are some bugs on tile RA, only application that use AMX is affected.



That's true. But I also worry about that. Any time you need to write
non-trivial code that will be used relatively rarely, it's likely to have
bugs that take a long time to show up. If you can plug into the generic
infrastructure, you benefit from the fact that it's highly-covered,
often-used code. Not that you might not run into bugs, of course, especially if
you're using it in a new way, but the base logic is likely to already be
robust.



2.       We can customize the special trait (config, spilt, spill) of tile
register in the sperate RA more freely.



True.

 -Hal



For RegAllocFast, I agree with you. Each region of register is small, and since
the performance is not the first priority, we can insert multiply config for
each small region.
As you recommend looking at the PBQP solver, I'll take some time to
investigate it and go back to you.

Thanks
-Yuanke


From: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>
Sent: Monday, August 24, 2020 5:03 PM
To: Luo, Yuanke <yuanke.luo at intel.com><mailto:yuanke.luo at
intel.com>; Topper, Craig <craig.topper at
intel.com><mailto:craig.topper at intel.com>; Kaylor, Andrew
<andrew.kaylor at intel.com><mailto:andrew.kaylor at intel.com>;
Philip Reames <listmail at philipreames.com><mailto:listmail at
philipreames.com>; llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>; florian_hahn at apple.com<mailto:florian_hahn at
apple.com>; Lu, Hongjiu <hongjiu.lu at intel.com><mailto:hongjiu.lu
at intel.com>
Subject: Re: [llvm-dev] Intel AMX programming model discussion.


Hi, Yuanke,

Thanks for writing this up. Let me back up a bit because the scheme I proposed
last week doesn't work without further modification: within a particular
"configuration region" (i.e., the code in between the LDTILECFG and
the TILERELEASE (or next LDTILECFG)), each tile register can only be used with
one shape, and in addition, no register can have its shape changed without
zeroing out all of the tile registers. Thus, just using different register
classes for the different shapes, as I had suggested, isn't sufficient to
model the allocation requirements. That would not prevent the same register from
essentially being assigned to differently-shaped virtual registers with
non-overlapping live ranges within one configuration region.

Also, as you point out, when multiple non-static tile shapes are in use, if you
use one register class for each shape, you would need different register classes
for these too. Luckily, I don't think that using the separate register
classes actually buys us anything, so please disregard that suggestion of mine.
Use only one register class.

Once the configuration regions are identified, you'll know how many tile
register shapes are required. If this number is greater than eight, then
you'll need to cut the region (requiring all live tiles to be spilled and
restored around each re-configuration point). After that, we'll assume that
we have eight or fewer distinct shapes.

Now the problem is that you need to allocate registers, satisfying all of the
usual constraints (non-overlapping live ranges, etc.), but with an additional
constraint: once a physical register has been used with some particular tile
shape, it cannot be assigned to any other tile shape.

I think that the current infrastructure can support this as follows:

 1. Add an override X86RegisterInfo::getRegAllocationHints. Like
SystemZRegisterInfo::getRegAllocationHints does sometimes, when hinting the tile
registers, the function will return true (to indicate a hard constraint). As
registers are assigned in RegAllocGreedy, getRegAllocationHints is called for
each virtual register. For virtual tile registers, look at the passed
VirtRegMap, etc. for already-assigned tile virtual registers with different
shape requirements as the current virtual register (you'll need to cache the
shape requirements in X86MachineFunctionInfo for this to be efficient), and
return a hints list consisting of all other non-reserved tile registers.

 2. To support RegAllocFast, which doesn't use getRegAllocationHints, you
would need to make the configuration regions small enough that it doesn't
matter (and if you're doing this around every tile instruction, this is
automatically true).

 3. To support RegAllocPBQP (which is likely a good thing to do, but probably
not required), I believe you can support this by adding custom constraints to
the solver (kind of like what AArch64PBQPRegAlloc.cpp does).

Once the allocation process is complete, you'll need to go back and update
the LDTILECFG data to reflect the chosen shape -> register mapping.

What I don't know, however, is how well the getRegAllocationHints method
will work. The benefit is that you don't need to write a custom
pre-allocator allocator. On the other hand, it might visit the virtual registers
to assign in a suboptimal order because it doesn't really understand the
constraint being imposed (generally, we just assign larger live ranges first).
On the other hand, it is a greedy algorithm and if you want something
systematically closer to optimal, maybe you should be using PBQP anyway. If you
do end up needing a custom allocator for these, I recommend looking at the PBQP
solver (which, as I recall, is independently reusable).

Hopefully, this is more-helpful advice.

 -Hal
On 8/21/20 9:54 PM, Luo, Yuanke wrote:
It seems I make a mistake on sharing register unit. Can we share register unit
for tile register that is within different tile register class (different
register class has different tile shape)?  Think about two virtual tile register
%2:vtile1x1 and %3:vtile1x2. First %2 is allocated to $tmm0, after that %2 is
killed and %t3 is allocated to $tmm0. This is not allowed, because when $tmm0 is
allocated to %2, its shape is configured to 1x1. If we reallocated $tmm0 to %3,
then we need to re-config $tmm0 to 1x2 which cause $tmm0~$tmm7 be clobbered.

Yuanke

From: Luo, Yuanke
Sent: Friday, August 21, 2020 2:12 PM
To: Hal Finkel <hfinkel at anl.gov><mailto:hfinkel at anl.gov>;
Topper, Craig <craig.topper at intel.com><mailto:craig.topper at
intel.com>; Kaylor, Andrew <andrew.kaylor at
intel.com><mailto:andrew.kaylor at intel.com>; Philip Reames
<listmail at philipreames.com><mailto:listmail at philipreames.com>;
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>;
florian_hahn at apple.com<mailto:florian_hahn at apple.com>; Lu, Hongjiu
<hongjiu.lu at intel.com><mailto:hongjiu.lu at intel.com>
Subject: RE: [llvm-dev] Intel AMX programming model discussion.

Hi Hal,
The proposal is attractive to me, but there is something I still can't
figure out. Let's take below MIR as an example. We assume we have 256
register classes (vtile1x1, vtile1x2, ..., tile16x16).

1.       After instruction selection, the pseudo AMX instruction is generated.
The name of pseudo instructions have 'P' prefix. Now all the AMX pseudo
instruction take vtile as register class. Let's assume %13 is constant 3,
%10 is constant 4 and %14 is variable.
  %1:vtile = PTILELOADDV %13:gr16, %10:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %2:vtile = PTILELOADDV %10:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
  %3:vtile = PTILELOADDV %13:gr16, %14:gr16, %17:gr64, 1, %18:gr64_nosp, 0,
$noreg
%21:vtile = PTDPBSSDV %13:gr16, %10:gr16, %14:gr16, %3:vtile(tied-def 0),
%1:vtile, %2:vtile

2.       The configuration-placement pass looks at all of the AMX
pseudo-instructions and identifies regions in which the pseudo-instructions use
the same configuration parameters. It first replaces the register class for all
tile registers whose shape is known in compile-time. Since the shape of %1 is
constant, so it replaces %1:vtile with %1:vtile3x4 which change the register
class and morph pseudo instruction into AMX real instruction. The shape of %2
and %3 is unknown in compile-time, so it arbitrarily picks up a tile register
class which is not assigned before and assign the register class to %2 and %3.
After register class allocation, the code is transformed as this. The register
class for %2:vtile1x1 and %3:vtile1x2 is allocated.
   PLDTILECFG
  %1:vtile3x4  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %2:vtile1x1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  %3:vtile1x2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
%21:vtile1x2 = TDPBSSDV %9:vtile1x2(tied-def 0), %1:vtile3x4, %2:vtile1x1
Something I am not figured out.

1.       I not sure if we can have AMX instruction's inputs and outputs fit
multiple register classes (vtile1x1, ..., vtile16x16), otherwise we need 256
pseudo instructions.

2.       Whether 256 register class is enough to be allocated. There may be more
256 unknow shape tile registers.

3.       In this pass we also find the proper pointer (common dominator) to
insert ldtilecfg, but at this time the register is allocated, we don't know
the shape of each physical tile register. So we just insert a pseudo tile config
instruction.

3.       All tile register class share the same register unit. We do register
allocation by the framework, and the code is transformed as this.
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

4.       Run config pass to collect the shape of each physical tile register and
config them. The code can be generated as below. Here is the problem, how can we
know the shape of the physical tile register?
   MOV row, col info to %stack.0 for each physical tile register   ??????
  LDTILECFG %stack.0, 1, $noreg, 0, $noreg, implicit-def $tmm0, implicit-def
$tmm1, implicit-def $tmm2, implicit-def $tmm3, implicit-def $tmm4, implicit-def
$tmm5, implicit-def $tmm6, implicit-def $tmm7
  $tmm0  = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm1 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
  $tmm2 = TILELOADDV %17:gr64, 1, %18:gr64_nosp, 0, $noreg
$tmm2 = TDPBSSDV $tmm2(tied-def 0), $tmm0, $tmm1

Thanks
Yuanke

...

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory

--

Hal Finkel

Lead, Compiler Technology and Programming Languages

Leadership Computing Facility

Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200920/2cc55f99/attachment.html>

llvm dev - Sep 2020 - Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.

[llvm-dev] Intel AMX programming model discussion.