On Apr 25, 2013, at 8:51 AM, "Relph, Richard" <Richard.Relph at
amd.com> wrote:
> We have a function that has 256 loads and 256 fmuladds. This block of
operations is bounded at either end by an OpenCL barrier (an AMDIL fence
instruction). The loads and multiply/adds are ordinarily interleaved… that is,
the IR going in to code generation looks like:
> %39 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x
float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 0), align 4
> %40 = call float @llvm.fmuladd.f32(float %37, float %39, float %c0.037)
nounwind
> %41 = load float addrspace(3)* getelementptr inbounds ([16 x [17 x
float]] addrspace(3)* @sgemm.b, i32 0, i32 0, i32 1), align 4
> %42 = call float @llvm.fmuladd.f32(float %37, float %41, float %c1.036)
nounwind
> … and 254 more of these pairs.
>
> %39 and %41 (and 254 more loads) are dead after they are used in the
immediately following fmuladd.
>
> RegReductionPQBase::getNodePriority() (in
CodeGen/SelectionDAG/ScheduleDAGRRList.cpp) normally returns the
SethiUllmanNumber for a node, but there’s a few special cases. ISD::TokenFactor
and ISD::CopyToReg return a 0, to push them closer to their uses, and similarly
for TargetOpcode::EXTRACT_SUBREG, TargetOpcode::SUBREG_TO_REG, and
TargetOpcode::INSERT_SUBREG.
> There is also a special case for instructions that are the end of a
computational chain, or at the beginning, based on if the instruction has 0
predecessors or 0 successors.
The TargetOpcode checks are likely incorrect because they're not checking
getMachineOpcode(), it's just that no one wants to change this nearly
obsolete code and hunt down regressions. I would be happy to remove those checks
altogether though if they cause problems. In your case I think it's
unrelated.
> Our fence instruction has 2 (constant) predecessors and no successors. This
causes getNodePriority() to think it is the end of a computational chain and
return 0xffff instead of the normal SethiUllmanNumber for the node, to try and
get the instruction closer to where it’s constants are manifested.
> The result is coming out code generation the loads and fmuladds are
separated… We end up with a block of 256 loads, the fence instruction that was
at the end of the block, then the 256 fmuladd operations.
> This causes the live range of all 256 loads to GREATLY increase, increasing
register pressure so much that we end up with absolutely awful performance.
>
> We have a local quick fix for this (return the SethiUllmanNumber), but I
wanted to get the advice of the list because I’d rather not have local
modifications to “target independent” code generation.
> Also, it feels like we must be doing something wrong either in describing
our target or in later code generation to get this bad a result.
As we discussed off-list, please use -pre-RA-sched=source if possible, and
introduce target-specific scheduling in the MachineScheduler pass. There are
multiple ways to "plug in" to MachineScheduler.
-pre-RA-sched=source is currently being fixed to work as advertised. A patch is
being worked on and expect to see it posted fairly soon. It's still usable
as-is, but doesn't always preserve ordering.
-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130428/9ae5de30/attachment.html>