thr3ads.net - llvm dev - [LLVMdev] getNodePriority() [Apr 2013]

If this information is useful, please help other people find it:
Share via:

Relph, Richard

2013-Apr-25 15:51 UTC

[LLVMdev] getNodePriority()

We have a function that has 256 loads and 256 fmuladds. This block of operations
is bounded at either end by an OpenCL barrier (an AMDIL fence instruction). The
loads and multiply/adds are ordinarily interleaved... that is, the IR going in
to code generation looks like:
  %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]]
addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4
  %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037)
nounwind
  %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x float]]
addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4
  %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036)
nounwind
... and 254 more of these pairs.

%39 and %41 (and 254 more loads) are dead after they are used in the immediately
following fmuladd.

RegReductionPQBase::getNodePriority() (in
CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the
SethiUllmanNumber for a node, but there's a few special cases.
ISD::TokenFactor and ISD::CopyToReg return a 0, to push them closer to their
uses, and similarly for TargetOpcode::EXTRACT_SUBREG,
TargetOpcode::SUBREG_TO_REG, and TargetOpcode::INSERT_SUBREG.
There is also a special case for instructions that are the end of a
computational chain, or at the beginning, based on if the instruction has 0
predecessors or 0 successors.

Our fence instruction has 2 (constant) predecessors and no successors. This
causes getNodePriority() to think it is the end of a computational chain and
return 0xffff instead of the normal SethiUllmanNumber for the node, to try and
get the instruction closer to where it's constants are manifested.
The result is coming out code generation the loads and fmuladds are separated...
We end up with a block of 256 loads, the fence instruction that was at the end
of the block, then the 256 fmuladd operations.
This causes the live range of all 256 loads to GREATLY increase, increasing
register pressure so much that we end up with absolutely awful performance.

We have a local quick fix for this (return the SethiUllmanNumber), but I wanted
to get the advice of the list because I'd rather not have local
modifications to "target independent" code generation.
Also, it feels like we must be doing something wrong either in describing our
target or in later code generation to get this bad a result.

Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130425/8a76b667/attachment.html>

Andrew Trick

2013-Apr-29 04:47 UTC

head link

[LLVMdev] getNodePriority()

On Apr 25, 2013, at 8:51 AM, "Relph, Richard" <Richard.Relph at
amd.com> wrote:
> We have a function that has 256 loads and 256 fmuladds. This block of
operations is bounded at either end by an OpenCL barrier (an AMDIL fence
instruction). The loads and multiply/adds are ordinarily interleaved… that is,
the IR going in to code generation looks like:
>   %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x
float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4
>   %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037)
nounwind
>   %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x
float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4
>   %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036)
nounwind
> … and 254 more of these pairs.
>  
> %39 and %41 (and 254 more loads) are dead after they are used in the
immediately following fmuladd.
>  
> RegReductionPQBase::getNodePriority() (in
CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the
SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor
and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly
for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and
TargetOpcode::INSERT_SUBREG.
> There is also a special case for instructions that are the end of a
computational chain, or at the beginning, based on if the instruction has 0
predecessors or 0 successors.
The TargetOpcode checks are likely incorrect because they're not checking
getMachineOpcode(), it's just that no one wants to change this nearly
obsolete code and hunt down regressions. I would be happy to remove those checks
altogether though if they cause problems. In your case I think it's
unrelated.
> Our fence instruction has 2 (constant) predecessors and no successors. This
causes getNodePriority() to think it is the end of a computational chain and
return 0xffff instead of the normal SethiUllmanNumber for the node, to try and
get the instruction closer to where it’s constants are manifested.
> The result is coming out code generation the loads and fmuladds are
separated… We end up with a block of 256 loads, the fence instruction that was
at the end of the block, then the 256 fmuladd operations.
> This causes the live range of all 256 loads to GREATLY increase, increasing
register pressure so much that we end up with absolutely awful performance.
>  
> We have a local quick fix for this (return the SethiUllmanNumber), but I
wanted to get the advice of the list because I’d rather not have local
modifications to “target independent” code generation.
> Also, it feels like we must be doing something wrong either in describing
our target or in later code generation to get this bad a result.
As we discussed off-list, please use -pre-RA-sched=source if possible, and
introduce target-specific scheduling in the MachineScheduler pass. There are
multiple ways to "plug in" to MachineScheduler.

-pre-RA-sched=source is currently being fixed to work as advertised. A patch is
being worked on and expect to see it posted fairly soon. It's still usable
as-is, but doesn't always preserve ordering.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130428/9ae5de30/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Apr 2013 - [LLVMdev] getNodePriority()

[LLVMdev] getNodePriority()

[LLVMdev] getNodePriority()

Maybe Matching Threads