thr3ads.net - llvm dev - [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Stefan Hepp

2014-Feb-25 18:22 UTC

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

Hi Tom,

Thanks a lot for your explanations, now it makes a lot more sense ;)

I had a slightly closer look at the R600 packetizer, and the issue is 
that the third LSHL instruction has both an implicit use and 
*afterwards* an implicit def of T1_XYZW. The latter def causes the 
current ScheduleDAGInstrs implementation to ignore the implicit use, 
thus the ScheduleDAG only contains an anti-dependency from the second to 
the third LSHL and the packetizer can bundle the instructions.

If the order of the implicit-defs and implicit-use would be different 
(e.g., like TableGen adds them), or if I apply my patch so that the 
implicit-use is not ignored, there is a true dependency edge between 
those two instructions, preventing bundling.

Removing only the implicit-defs alone has no effect in this example, as 
then the implicit-use of T1_XYZW is then no longer removed and a true 
dependency to the def of T1_W is added. Removing all implicit operands 
however solves that (see attached patch/hack).

I would argue that the behaviour of ScheduleDAGInstrs::buildSchedGraph 
for implicit operands is not correct, simply because changing the order 
of the operands would already prevent the R600 packetizer from bundling 
in this example.

Cheers,
  Stefan

On 2014-02-25 17:35, Tom Stellard wrote:> On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
>> Hello,
>>
>> The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses
lists by iterating over all instruction operands and calls addPhysRegDeps() if
used post-RA (line ~770 ff.). If an operand is a def, the uses of that registers
are cleared (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
>>
>> As a consequence, if an instruction has an explicit use of a register
and an implicit def of the same register, the implicit def causes the use to be
removed. In our case this leads to missing scheduling dependencies for indirect
calls where a return register is used as operand of the indirect call. Therefore
I patched the buildSchedGraph() function to iterate over all defs first, then
over all uses in a second iteration (maybe not the most efficient solution, but
it solves the issue).
>> I am attaching the corresponding patch.
>>
>> However, while upgrading our code to LLVM 3.4, I found that the
test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to this change. The
difference is in the following line in @store_i8: With my patch applied I get:
>>
>>      LSHL                  * T0.W, PV.W, literal.x,
>>    3(4.203895e-45), 0(0.000000e+00)
>>      LSHL                  * T1.X, T1.W, PV.W,
>>      LSHL                  * T1.W, literal.x, T0.W,
>>    255(3.573311e-43), 0(0.000000e+00)
>>
>> versus the orginal output of LLVM 3.4:
>>
>>      LSHL                  * T0.W, PV.W, literal.x,
>>    3(4.203895e-45), 0(0.000000e+00)
>>      LSHL                    T1.X, T1.W, PV.W,
>>      LSHL                  * T1.W, literal.x, PV.W,
>>    255(3.573311e-43), 0(0.000000e+00)
>>
>> The difference is the '*' in the second LSHL and the T0.W
versus PV.W in the third LSHL.
>>
>> Looking at the output of -print-after-all, llc generates the following
MC in both plain LLVM 3.4 and LLVM 3.4 with my patch applied right after the
"Finalize machine instruction bundles" pass:
>>
>>      %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T0_W<kill>, 0,
0, 0, -1, %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0
>>      %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %T1_W<kill>, 0,
0, 0, -1, %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,
%T1_XYZW<imp-def>
>>      %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X, 0, 0,
0, -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 255, 0,
%T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
>>
>> Hence, my patch affects either the R600 Packetizer or the R600 Control
Flow Finalizer pass. A closer look reveals that the 17th operand (the one before
the 'pred:$PRED_SEL_OFF' operand) controls the '*' output and is
actually set for all three LSHL operands before the Packetizer/Control Flow
Finalizer pass, and from a quick glance at the AMD ISA docs I gather that
'PV' versus 'T0' seems to be related to this, but the packetizer
(?) clears this flag for the second LSHR for redwood/VLIW5 GPUs.
>>
>>
>> Now, here is my question: Is the R600/store.ll test actually correct (I
am not yet familiar enough with the R600 ISA to tell), or does the R600 backend
rely on the fact that implicit defs kill the uses of registers, or is there a
bug/.. in the R600 backend that counteracts the implicit-defs scheduling issue?
>> And is removing the explicit use by an implicit def at an instruction a
bug or a feature?
>>
>> If this is an actual bug in the ScheduleDAGInstrs implementation, I
would like to adapt my patch so that it can be applied (i.e., include a fix for
the store.ll test), although there might be a better way of doing this (e.g.,
only erase uses that are not from the same SU, or iterate over defs,
implicit-defs, uses, implicit-uses instead of iterating over all operands
twice), and I would need someone more familiar with the R600 backend to check if
the new output of LLC for store.ll is actually correct or not.
>>
>
> Hi Stefan,
>
> Thanks for the detailed analysis.  The '*' marks the end of a
packet, so
> with your patch, the sequence uses 3 packets, and it uses only 2 with
> LLVM 3.4.  So, this is a performance regression.
>
> The root of this problem is that LLVM assumes it is impossible to
> concurrently access two sub-registers of the same super-register, so
> it creates a scheduling dependency between the two sub-register defs.
>
> This is a problem with R600's VLIW architecture, because we want to be
> able to access all four sub-registers of the same packet, but scheduling
> dependencies prevent this.
>
> This is why there are implicit uses and defs of the T*_XYZW registers
> added to these instructions.  Some of these are added by LLVM passes
> and I think some are added by the R600MachineScheduler to work around
> the issue described (I'm not that familiar with the scheduler, so I may
> be wrong about this).
>
> I think there was an effort recently to try allow concurrent
> sub-register accesses, but I don't remember who was working on this.
>
> In this case, it looks like problem is that the last two instructions
> both define the same super-register, so the packetizer isn't putting
> them in the same packet.
>
> So, I think the conclusion is that your patch increases the exposure of
> the R600 to an already present problem in the register allocator /
> scheduler.  I wonder if we could work around this by stripping all the
> implicit defs and uses off the instructions once scheduling and register
> allocation has completed.  They are technically incorrect, and I'm not
> sure if any other passes use them.  Do you have any suggestions?
>
> Thanks,
> Tom
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Remove-implicit-defs-and-uses-before-packetizing-in-.patch
Type: text/x-patch
Size: 1011 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140225/71b6e693/attachment.bin>

Stefan Hepp

2014-Feb-25 18:28 UTC

head link

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

I should probably mention that with those two patches all the LLVM tests 
are successful now for me, but it might be a good idea to adapt the 
patch to check if the removed operands are actually super-registers, in 
case the R600 backend uses implicit registers for other purposes as well.

Kind regards,
  Stefan

On 2014-02-25 19:22, Stefan Hepp wrote:> Hi Tom,
>
> Thanks a lot for your explanations, now it makes a lot more sense ;)
>
> I had a slightly closer look at the R600 packetizer, and the issue is
> that the third LSHL instruction has both an implicit use and
> *afterwards* an implicit def of T1_XYZW. The latter def causes the
> current ScheduleDAGInstrs implementation to ignore the implicit use,
> thus the ScheduleDAG only contains an anti-dependency from the second to
> the third LSHL and the packetizer can bundle the instructions.
>
> If the order of the implicit-defs and implicit-use would be different
> (e.g., like TableGen adds them), or if I apply my patch so that the
> implicit-use is not ignored, there is a true dependency edge between
> those two instructions, preventing bundling.
>
> Removing only the implicit-defs alone has no effect in this example, as
> then the implicit-use of T1_XYZW is then no longer removed and a true
> dependency to the def of T1_W is added. Removing all implicit operands
> however solves that (see attached patch/hack).
>
> I would argue that the behaviour of ScheduleDAGInstrs::buildSchedGraph
> for implicit operands is not correct, simply because changing the order
> of the operands would already prevent the R600 packetizer from bundling
> in this example.
>
> Cheers,
>   Stefan
>
> On 2014-02-25 17:35, Tom Stellard wrote:
>> On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
>>> Hello,
>>>
>>> The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses
>>> lists by iterating over all instruction operands and calls
>>> addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand is
a
>>> def, the uses of that registers are cleared
>>> (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
>>>
>>> As a consequence, if an instruction has an explicit use of a
register
>>> and an implicit def of the same register, the implicit def causes
the
>>> use to be removed. In our case this leads to missing scheduling
>>> dependencies for indirect calls where a return register is used as
>>> operand of the indirect call. Therefore I patched the
>>> buildSchedGraph() function to iterate over all defs first, then
over
>>> all uses in a second iteration (maybe not the most efficient
>>> solution, but it solves the issue).
>>> I am attaching the corresponding patch.
>>>
>>> However, while upgrading our code to LLVM 3.4, I found that the
>>> test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to this
>>> change. The difference is in the following line in @store_i8: With
my
>>> patch applied I get:
>>>
>>>      LSHL                  * T0.W, PV.W, literal.x,
>>>    3(4.203895e-45), 0(0.000000e+00)
>>>      LSHL                  * T1.X, T1.W, PV.W,
>>>      LSHL                  * T1.W, literal.x, T0.W,
>>>    255(3.573311e-43), 0(0.000000e+00)
>>>
>>> versus the orginal output of LLVM 3.4:
>>>
>>>      LSHL                  * T0.W, PV.W, literal.x,
>>>    3(4.203895e-45), 0(0.000000e+00)
>>>      LSHL                    T1.X, T1.W, PV.W,
>>>      LSHL                  * T1.W, literal.x, PV.W,
>>>    255(3.573311e-43), 0(0.000000e+00)
>>>
>>> The difference is the '*' in the second LSHL and the T0.W
versus PV.W
>>> in the third LSHL.
>>>
>>> Looking at the output of -print-after-all, llc generates the
>>> following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch
>>> applied right after the "Finalize machine instruction
bundles" pass:
>>>
>>>      %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T0_W<kill>, 0, 0, 0, -1,
>>> %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0
>>>      %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T1_W<kill>, 0, 0, 0, -1,
>>> %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,
%T1_XYZW<imp-def>
>>>      %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X,
0, 0, 0,
>>> -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 255, 0,
>>> %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
>>>
>>> Hence, my patch affects either the R600 Packetizer or the R600
>>> Control Flow Finalizer pass. A closer look reveals that the 17th
>>> operand (the one before the 'pred:$PRED_SEL_OFF' operand)
controls
>>> the '*' output and is actually set for all three LSHL
operands before
>>> the Packetizer/Control Flow Finalizer pass, and from a quick glance
>>> at the AMD ISA docs I gather that 'PV' versus 'T0'
seems to be
>>> related to this, but the packetizer (?) clears this flag for the
>>> second LSHR for redwood/VLIW5 GPUs.
>>>
>>>
>>> Now, here is my question: Is the R600/store.ll test actually
correct
>>> (I am not yet familiar enough with the R600 ISA to tell), or does
the
>>> R600 backend rely on the fact that implicit defs kill the uses of
>>> registers, or is there a bug/.. in the R600 backend that
counteracts
>>> the implicit-defs scheduling issue?
>>> And is removing the explicit use by an implicit def at an
instruction
>>> a bug or a feature?
>>>
>>> If this is an actual bug in the ScheduleDAGInstrs implementation, I
>>> would like to adapt my patch so that it can be applied (i.e.,
include
>>> a fix for the store.ll test), although there might be a better way
of
>>> doing this (e.g., only erase uses that are not from the same SU, or
>>> iterate over defs, implicit-defs, uses, implicit-uses instead of
>>> iterating over all operands twice), and I would need someone more
>>> familiar with the R600 backend to check if the new output of LLC
for
>>> store.ll is actually correct or not.
>>>
>>
>> Hi Stefan,
>>
>> Thanks for the detailed analysis.  The '*' marks the end of a
packet, so
>> with your patch, the sequence uses 3 packets, and it uses only 2 with
>> LLVM 3.4.  So, this is a performance regression.
>>
>> The root of this problem is that LLVM assumes it is impossible to
>> concurrently access two sub-registers of the same super-register, so
>> it creates a scheduling dependency between the two sub-register defs.
>>
>> This is a problem with R600's VLIW architecture, because we want to
be
>> able to access all four sub-registers of the same packet, but
scheduling
>> dependencies prevent this.
>>
>> This is why there are implicit uses and defs of the T*_XYZW registers
>> added to these instructions.  Some of these are added by LLVM passes
>> and I think some are added by the R600MachineScheduler to work around
>> the issue described (I'm not that familiar with the scheduler, so I
may
>> be wrong about this).
>>
>> I think there was an effort recently to try allow concurrent
>> sub-register accesses, but I don't remember who was working on
this.
>>
>> In this case, it looks like problem is that the last two instructions
>> both define the same super-register, so the packetizer isn't
putting
>> them in the same packet.
>>
>> So, I think the conclusion is that your patch increases the exposure of
>> the R600 to an already present problem in the register allocator /
>> scheduler.  I wonder if we could work around this by stripping all the
>> implicit defs and uses off the instructions once scheduling and
register
>> allocation has completed.  They are technically incorrect, and I'm
not
>> sure if any other passes use them.  Do you have any suggestions?
>>
>> Thanks,
>> Tom
>>
>>
>

Tom Stellard

2014-Feb-25 18:45 UTC

head link

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

On Tue, Feb 25, 2014 at 07:22:54PM +0100, Stefan Hepp
wrote:> Hi Tom,
> 
> Thanks a lot for your explanations, now it makes a lot more sense ;)
> 
> I had a slightly closer look at the R600 packetizer, and the issue is 
> that the third LSHL instruction has both an implicit use and 
> *afterwards* an implicit def of T1_XYZW. The latter def causes the 
> current ScheduleDAGInstrs implementation to ignore the implicit use, 
> thus the ScheduleDAG only contains an anti-dependency from the second to 
> the third LSHL and the packetizer can bundle the instructions.
> 
> If the order of the implicit-defs and implicit-use would be different 
> (e.g., like TableGen adds them), or if I apply my patch so that the 
> implicit-use is not ignored, there is a true dependency edge between 
> those two instructions, preventing bundling.
> 
> Removing only the implicit-defs alone has no effect in this example, as 
> then the implicit-use of T1_XYZW is then no longer removed and a true 
> dependency to the def of T1_W is added. Removing all implicit operands 
> however solves that (see attached patch/hack).
> 
> I would argue that the behaviour of ScheduleDAGInstrs::buildSchedGraph 
> for implicit operands is not correct, simply because changing the order 
> of the operands would already prevent the R600 packetizer from bundling 
> in this example.
>
It does sound like a bug.  You might want to ping the scheduling code
owner, Andrew Trick.

> Cheers,
>   Stefan
> 
> From 451410290c9c1ad60d5b827e930ef949078ec6c6 Mon Sep 17 00:00:00 2001
> From: Stefan Hepp <stefan at stefant.org>
> Date: Tue, 25 Feb 2014 19:08:58 +0100
> Subject: [PATCH] Remove implicit defs and uses before packetizing in R600
>  backend to avoid dependencies between super-registers
> 
> ---
>  lib/Target/R600/R600Packetizer.cpp | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/lib/Target/R600/R600Packetizer.cpp
b/lib/Target/R600/R600Packetizer.cpp
> index cd9b6ea..8754bbb1 100644
> --- a/lib/Target/R600/R600Packetizer.cpp
> +++ b/lib/Target/R600/R600Packetizer.cpp
> @@ -356,6 +356,15 @@ bool
R600Packetizer::runOnMachineFunction(MachineFunction &Fn) {
>          End = MBB->end();
>          continue;
>        }
> +      for (size_t idx = MI->getNumExplicitOperands();
> +           idx < MI->getNumOperands(); idx++)
> +      {
> +        MachineOperand &MO = MI->getOperand(idx);
> +        if (MO.isImplicit()) {
This should be:

           if (MO.isImplicit() &&
               AMDGPU::R600_Reg128.contains(MO.getReg())) {


That way it will only remove super-register defs and uses.

-Tom
> +          MI->RemoveOperand(idx);
> +          idx--;
> +        }
> +      }
>        ++MI;
>      }
>    }
> -- 
> 1.8.3.2
>

Hal Finkel

2014-Feb-25 18:55 UTC

head link

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

----- Original Message -----> From: "Stefan Hepp" <stefan at stefant.org>
> To: "Tom Stellard" <tom at stellard.net>
> Cc: "llvmdev at cs.uiuc.edu List" <llvmdev at cs.uiuc.edu>
> Sent: Tuesday, February 25, 2014 12:22:54 PM
> Subject: Re: [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with
implicit defs
> 
> Hi Tom,
> 
> Thanks a lot for your explanations, now it makes a lot more sense ;)
> 
> I had a slightly closer look at the R600 packetizer, and the issue is
> that the third LSHL instruction has both an implicit use and
> *afterwards* an implicit def of T1_XYZW. The latter def causes the
> current ScheduleDAGInstrs implementation to ignore the implicit use,
> thus the ScheduleDAG only contains an anti-dependency from the second
> to
> the third LSHL and the packetizer can bundle the instructions.
> 
> If the order of the implicit-defs and implicit-use would be different
> (e.g., like TableGen adds them), or if I apply my patch so that the
> implicit-use is not ignored, there is a true dependency edge between
> those two instructions, preventing bundling.
> 
> Removing only the implicit-defs alone has no effect in this example,
> as
> then the implicit-use of T1_XYZW is then no longer removed and a true
> dependency to the def of T1_W is added. Removing all implicit
> operands
> however solves that (see attached patch/hack).
> 
> I would argue that the behaviour of
> ScheduleDAGInstrs::buildSchedGraph
> for implicit operands is not correct, simply because changing the
> order
> of the operands would already prevent the R600 packetizer from
> bundling
> in this example.
Unless there is supposed to be an ordering, I agree that this does not make
sense.

 -Hal
> 
> Cheers,
>   Stefan
> 
> On 2014-02-25 17:35, Tom Stellard wrote:
> > On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
> >> Hello,
> >>
> >> The ScheduleDAGInstrs::buildSchedGraph() function creates def/uses
> >> lists by iterating over all instruction operands and calls
> >> addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand
> >> is a def, the uses of that registers are cleared
> >> (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
> >>
> >> As a consequence, if an instruction has an explicit use of a
> >> register and an implicit def of the same register, the implicit
> >> def causes the use to be removed. In our case this leads to
> >> missing scheduling dependencies for indirect calls where a return
> >> register is used as operand of the indirect call. Therefore I
> >> patched the buildSchedGraph() function to iterate over all defs
> >> first, then over all uses in a second iteration (maybe not the
> >> most efficient solution, but it solves the issue).
> >> I am attaching the corresponding patch.
> >>
> >> However, while upgrading our code to LLVM 3.4, I found that the
> >> test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to
> >> this change. The difference is in the following line in
> >> @store_i8: With my patch applied I get:
> >>
> >>      LSHL                  * T0.W, PV.W, literal.x,
> >>    3(4.203895e-45), 0(0.000000e+00)
> >>      LSHL                  * T1.X, T1.W, PV.W,
> >>      LSHL                  * T1.W, literal.x, T0.W,
> >>    255(3.573311e-43), 0(0.000000e+00)
> >>
> >> versus the orginal output of LLVM 3.4:
> >>
> >>      LSHL                  * T0.W, PV.W, literal.x,
> >>    3(4.203895e-45), 0(0.000000e+00)
> >>      LSHL                    T1.X, T1.W, PV.W,
> >>      LSHL                  * T1.W, literal.x, PV.W,
> >>    255(3.573311e-43), 0(0.000000e+00)
> >>
> >> The difference is the '*' in the second LSHL and the T0.W
versus
> >> PV.W in the third LSHL.
> >>
> >> Looking at the output of -print-after-all, llc generates the
> >> following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch
> >> applied right after the "Finalize machine instruction
bundles"
> >> pass:
> >>
> >>      %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T0_W<kill>, 0, 0, 0,
> >>      -1, %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3, 0
> >>      %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T1_W<kill>, 0, 0, 0,
> >>      -1, %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,
> >>      %T1_XYZW<imp-def>
> >>      %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0, %ALU_LITERAL_X,
0, 0,
> >>      0, -1, %T0_W<kill>, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF,
255,
> >>      0, %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
> >>
> >> Hence, my patch affects either the R600 Packetizer or the R600
> >> Control Flow Finalizer pass. A closer look reveals that the 17th
> >> operand (the one before the 'pred:$PRED_SEL_OFF' operand)
> >> controls the '*' output and is actually set for all three
LSHL
> >> operands before the Packetizer/Control Flow Finalizer pass, and
> >> from a quick glance at the AMD ISA docs I gather that 'PV'
versus
> >> 'T0' seems to be related to this, but the packetizer (?)
clears
> >> this flag for the second LSHR for redwood/VLIW5 GPUs.
> >>
> >>
> >> Now, here is my question: Is the R600/store.ll test actually
> >> correct (I am not yet familiar enough with the R600 ISA to tell),
> >> or does the R600 backend rely on the fact that implicit defs kill
> >> the uses of registers, or is there a bug/.. in the R600 backend
> >> that counteracts the implicit-defs scheduling issue?
> >> And is removing the explicit use by an implicit def at an
> >> instruction a bug or a feature?
> >>
> >> If this is an actual bug in the ScheduleDAGInstrs implementation,
> >> I would like to adapt my patch so that it can be applied (i.e.,
> >> include a fix for the store.ll test), although there might be a
> >> better way of doing this (e.g., only erase uses that are not from
> >> the same SU, or iterate over defs, implicit-defs, uses,
> >> implicit-uses instead of iterating over all operands twice), and
> >> I would need someone more familiar with the R600 backend to check
> >> if the new output of LLC for store.ll is actually correct or not.
> >>
> >
> > Hi Stefan,
> >
> > Thanks for the detailed analysis.  The '*' marks the end of a
> > packet, so
> > with your patch, the sequence uses 3 packets, and it uses only 2
> > with
> > LLVM 3.4.  So, this is a performance regression.
> >
> > The root of this problem is that LLVM assumes it is impossible to
> > concurrently access two sub-registers of the same super-register,
> > so
> > it creates a scheduling dependency between the two sub-register
> > defs.
> >
> > This is a problem with R600's VLIW architecture, because we want
to
> > be
> > able to access all four sub-registers of the same packet, but
> > scheduling
> > dependencies prevent this.
> >
> > This is why there are implicit uses and defs of the T*_XYZW
> > registers
> > added to these instructions.  Some of these are added by LLVM
> > passes
> > and I think some are added by the R600MachineScheduler to work
> > around
> > the issue described (I'm not that familiar with the scheduler, so
I
> > may
> > be wrong about this).
> >
> > I think there was an effort recently to try allow concurrent
> > sub-register accesses, but I don't remember who was working on
> > this.
> >
> > In this case, it looks like problem is that the last two
> > instructions
> > both define the same super-register, so the packetizer isn't
> > putting
> > them in the same packet.
> >
> > So, I think the conclusion is that your patch increases the
> > exposure of
> > the R600 to an already present problem in the register allocator /
> > scheduler.  I wonder if we could work around this by stripping all
> > the
> > implicit defs and uses off the instructions once scheduling and
> > register
> > allocation has completed.  They are technically incorrect, and I'm
> > not
> > sure if any other passes use them.  Do you have any suggestions?
> >
> > Thanks,
> > Tom
> >
> >
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2014-Feb-26 03:39 UTC

head link

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

On Feb 25, 2014, at 10:55 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Stefan Hepp" <stefan at stefant.org>
>> To: "Tom Stellard" <tom at stellard.net>
>> Cc: "llvmdev at cs.uiuc.edu List" <llvmdev at
cs.uiuc.edu>
>> Sent: Tuesday, February 25, 2014 12:22:54 PM
>> Subject: Re: [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with
implicit defs
>> 
>> Hi Tom,
>> 
>> Thanks a lot for your explanations, now it makes a lot more sense ;)
>> 
>> I had a slightly closer look at the R600 packetizer, and the issue is
>> that the third LSHL instruction has both an implicit use and
>> *afterwards* an implicit def of T1_XYZW. The latter def causes the
>> current ScheduleDAGInstrs implementation to ignore the implicit use,
>> thus the ScheduleDAG only contains an anti-dependency from the second
>> to
>> the third LSHL and the packetizer can bundle the instructions.
>> 
>> If the order of the implicit-defs and implicit-use would be different
>> (e.g., like TableGen adds them), or if I apply my patch so that the
>> implicit-use is not ignored, there is a true dependency edge between
>> those two instructions, preventing bundling.
>> 
>> Removing only the implicit-defs alone has no effect in this example,
>> as
>> then the implicit-use of T1_XYZW is then no longer removed and a true
>> dependency to the def of T1_W is added. Removing all implicit
>> operands
>> however solves that (see attached patch/hack).
>> 
>> I would argue that the behaviour of
>> ScheduleDAGInstrs::buildSchedGraph
>> for implicit operands is not correct, simply because changing the
>> order
>> of the operands would already prevent the R600 packetizer from
>> bundling
>> in this example.
> 
> Unless there is supposed to be an ordering, I agree that this does not make
sense.
IIRC the implicit operands shouldn’t need to be ordered. Anyway, if the
MachineVerifier allows it, I think we should support it. That makes it difficult
to optimize liveness tracking routines like this, but so be it.

I don’t know of a clever way to handle this other than collecting all the Uses
in a vector, and inserting them in the Uses set after processing all defs.
Patches welcome.

-Andy
> 
> -Hal
> 
>> 
>> Cheers,
>>  Stefan
>> 
>> On 2014-02-25 17:35, Tom Stellard wrote:
>>> On Tue, Feb 25, 2014 at 04:38:46PM +0100, Stefan Hepp wrote:
>>>> Hello,
>>>> 
>>>> The ScheduleDAGInstrs::buildSchedGraph() function creates
def/uses
>>>> lists by iterating over all instruction operands and calls
>>>> addPhysRegDeps() if used post-RA (line ~770 ff.). If an operand
>>>> is a def, the uses of that registers are cleared
>>>> (ScheduleDAGInstrs.cpp:333:  Uses.eraseAll(Reg); ).
>>>> 
>>>> As a consequence, if an instruction has an explicit use of a
>>>> register and an implicit def of the same register, the implicit
>>>> def causes the use to be removed. In our case this leads to
>>>> missing scheduling dependencies for indirect calls where a
return
>>>> register is used as operand of the indirect call. Therefore I
>>>> patched the buildSchedGraph() function to iterate over all defs
>>>> first, then over all uses in a second iteration (maybe not the
>>>> most efficient solution, but it solves the issue).
>>>> I am attaching the corresponding patch.
>>>> 
>>>> However, while upgrading our code to LLVM 3.4, I found that the
>>>> test/CodeGen/R600/store.ll test fails for -mcpu=redwood due to
>>>> this change. The difference is in the following line in
>>>> @store_i8: With my patch applied I get:
>>>> 
>>>>     LSHL                  * T0.W, PV.W, literal.x,
>>>>   3(4.203895e-45), 0(0.000000e+00)
>>>>     LSHL                  * T1.X, T1.W, PV.W,
>>>>     LSHL                  * T1.W, literal.x, T0.W,
>>>>   255(3.573311e-43), 0(0.000000e+00)
>>>> 
>>>> versus the orginal output of LLVM 3.4:
>>>> 
>>>>     LSHL                  * T0.W, PV.W, literal.x,
>>>>   3(4.203895e-45), 0(0.000000e+00)
>>>>     LSHL                    T1.X, T1.W, PV.W,
>>>>     LSHL                  * T1.W, literal.x, PV.W,
>>>>   255(3.573311e-43), 0(0.000000e+00)
>>>> 
>>>> The difference is the '*' in the second LSHL and the
T0.W versus
>>>> PV.W in the third LSHL.
>>>> 
>>>> Looking at the output of -print-after-all, llc generates the
>>>> following MC in both plain LLVM 3.4 and LLVM 3.4 with my patch
>>>> applied right after the "Finalize machine instruction
bundles"
>>>> pass:
>>>> 
>>>>     %T0_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T0_W<kill>, 0, 0, 0,
>>>>     -1, %ALU_LITERAL_X, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 3,
0
>>>>     %T1_X<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%T1_W<kill>, 0, 0, 0,
>>>>     -1, %T0_W, 0, 0, 0, -1, 1, pred:%PRED_SEL_OFF, 0, 0,
>>>>     %T1_XYZW<imp-def>
>>>>     %T1_W<def> = LSHL_eg 0, 0, 1, 0, 0, 0,
%ALU_LITERAL_X, 0, 0,
>>>>     0, -1, %T0_W<kill>, 0, 0, 0, -1, 1,
pred:%PRED_SEL_OFF, 255,
>>>>     0, %T1_XYZW<imp-use,kill>, %T1_XYZW<imp-def>
>>>> 
>>>> Hence, my patch affects either the R600 Packetizer or the R600
>>>> Control Flow Finalizer pass. A closer look reveals that the
17th
>>>> operand (the one before the 'pred:$PRED_SEL_OFF'
operand)
>>>> controls the '*' output and is actually set for all
three LSHL
>>>> operands before the Packetizer/Control Flow Finalizer pass, and
>>>> from a quick glance at the AMD ISA docs I gather that
'PV' versus
>>>> 'T0' seems to be related to this, but the packetizer
(?) clears
>>>> this flag for the second LSHR for redwood/VLIW5 GPUs.
>>>> 
>>>> 
>>>> Now, here is my question: Is the R600/store.ll test actually
>>>> correct (I am not yet familiar enough with the R600 ISA to
tell),
>>>> or does the R600 backend rely on the fact that implicit defs
kill
>>>> the uses of registers, or is there a bug/.. in the R600 backend
>>>> that counteracts the implicit-defs scheduling issue?
>>>> And is removing the explicit use by an implicit def at an
>>>> instruction a bug or a feature?
>>>> 
>>>> If this is an actual bug in the ScheduleDAGInstrs
implementation,
>>>> I would like to adapt my patch so that it can be applied (i.e.,
>>>> include a fix for the store.ll test), although there might be a
>>>> better way of doing this (e.g., only erase uses that are not
from
>>>> the same SU, or iterate over defs, implicit-defs, uses,
>>>> implicit-uses instead of iterating over all operands twice),
and
>>>> I would need someone more familiar with the R600 backend to
check
>>>> if the new output of LLC for store.ll is actually correct or
not.
>>>> 
>>> 
>>> Hi Stefan,
>>> 
>>> Thanks for the detailed analysis.  The '*' marks the end of
a
>>> packet, so
>>> with your patch, the sequence uses 3 packets, and it uses only 2
>>> with
>>> LLVM 3.4.  So, this is a performance regression.
>>> 
>>> The root of this problem is that LLVM assumes it is impossible to
>>> concurrently access two sub-registers of the same super-register,
>>> so
>>> it creates a scheduling dependency between the two sub-register
>>> defs.
>>> 
>>> This is a problem with R600's VLIW architecture, because we
want to
>>> be
>>> able to access all four sub-registers of the same packet, but
>>> scheduling
>>> dependencies prevent this.
>>> 
>>> This is why there are implicit uses and defs of the T*_XYZW
>>> registers
>>> added to these instructions.  Some of these are added by LLVM
>>> passes
>>> and I think some are added by the R600MachineScheduler to work
>>> around
>>> the issue described (I'm not that familiar with the scheduler,
so I
>>> may
>>> be wrong about this).
>>> 
>>> I think there was an effort recently to try allow concurrent
>>> sub-register accesses, but I don't remember who was working on
>>> this.
>>> 
>>> In this case, it looks like problem is that the last two
>>> instructions
>>> both define the same super-register, so the packetizer isn't
>>> putting
>>> them in the same packet.
>>> 
>>> So, I think the conclusion is that your patch increases the
>>> exposure of
>>> the R600 to an already present problem in the register allocator /
>>> scheduler.  I wonder if we could work around this by stripping all
>>> the
>>> implicit defs and uses off the instructions once scheduling and
>>> register
>>> allocation has completed.  They are technically incorrect, and
I'm
>>> not
>>> sure if any other passes use them.  Do you have any suggestions?
>>> 
>>> Thanks,
>>> Tom
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
> 
> -- 
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20140225/320c6081/attachment.html>

llvm dev - Feb 2014 - [LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs

[LLVMdev] ScheduleDAGInstrs/R600 test potential issue with implicit defs