Hi Hal,
We came across many missed opportunities to generate madd instructions in
AArch64.
In this particular test case, a mul instruction that feeds into a madd
instruction as its accumulator should have a reduced latency. But when
dumping MachineCombiner decisions we do not see the reduced latency value
(should be 1).
We debugged a bit and noticed the difference in how NewRoot and Root latency
and depth are computed. We were not sure they were leading to the same
queries to the SchedMod. You can see for example that in one instance it
queries with the machine instruction pointer, in the other with only the
machine instruction opcode:
TSchedModel.computeInstrLatency(Root)
TSchedModel.computeInstrLatency(NewRoot->getOpcode())
We made a local change that adds the new candidate instructions to the basic
block (with different output virtual register for the NewRoot),  and then
ran the trace analysis on the modified BB. This allowed us to use the same
code for computing latency and depth of both NewRoot and Root. Still we did
not get the reduced latency value for the mult.
We asked Dave Estes to take a look at our reduced test and he may have found
a bug in SchedAlias,. He will bring it up to the community in a separate
discussion. With his fix now we can see the reduced mult latency and madd
being generated for the reduced test case. We are verifying the fix with
other complex tests.
But we still are left with this question about MachineCombiner: Shouldn't
the trace analysis run on a modified basic block that contains the new
candidate instructions and NewRoot/Root latency and depth be computed with
the same SchedMod apis? In our local change we rerun the trace analysis for
each candidate instruction we add to the basic block, and remove them from
the basic block if they are not profitable.
Thanks,
Ana.
-----Original Message-----
From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
Behalf Of Hal Finkel
Sent: Wednesday, February 04, 2015 5:33 PM
To: Mandeep Singh Grang
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Question on Machine Combiner Pass
----- Original Message -----> From: "Mandeep Singh Grang" <mgrang at codeaurora.org>
> To: llvmdev at cs.uiuc.edu
> Sent: Wednesday, February 4, 2015 12:58:27 PM
> Subject: Re: [LLVMdev] Question on Machine Combiner Pass
> 
> 
> 
> 
> 
> Ping
> 
> From: Mandeep Singh Grang [mailto:mgrang at codeaurora.org]
> Sent: Tuesday, February 03, 2015 4:34 PM
> To: 'llvmdev at cs.uiuc.edu'
> Cc: 'ghoflehner at apple.com'; 'apazos at codeaurora.org'; 
> mgrang at codeaurora.org
> Subject: Question on Machine Combiner Pass
> 
> 
> 
> Hi,
> 
> 
> 
> In the file lib/CodeGen/MachineCombiner.cpp I see that in the function 
> MachineCombiner::preservesCriticalPathLen
> we try to determine whether the new combined instruction lengthens the 
> critical path or not.
> 
> 
> 
> In order to do this we compute the depth and latency for the current 
> instruction (MUL+ADD) and the alternate instruction (MADD).
> 
> But we call two different set of APIs for the current and new
> instructions:
> 
> 
> 
> For new instruction we use:
> 
> unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, 
> BlockTrace);
> 
> unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);
> 
> 
> 
> While for the current instruction we use:
> 
> unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;
> 
> unsigned RootLatency = TSchedModel.computeInstrLatency(Root);
> 
The BlockTrace comes from MachineTraceMetrics, which is an analysis pass,
and thus might have its data pre-computed. This only strictly applies to the
current instruction sequence. We need to use a different method to compute
information for potential new instructions. Are you finding that these
methods compute inconsistent results?
 -Hal
> 
> 
> This is related to the following commit:
> 
> commit e4fa341dde3c9521b7f11bd53ecdcbeb3f8fcbda
> 
> Author: Gerolf Hoflehner < ghoflehner at apple.com >
> 
> Date: Thu Aug 7 21:40:58 2014 +0000
> 
> MachineCombiner Pass for selecting faster instruction sequence on
> AArch64
> 
> 
> For this example code sequence:
> 
> %mul = mul nuw nsw i32 %conv2, %conv
> 
> %mul7 = mul nuw nsw i32 %conv6, %conv4
> 
> %add = add nuw nsw i32 %mul7, %mul
> 
> ret i32 %add
> 
> 
> 
> We generate the following assembly:
> mul w8, w0, w1
> 
> mul w9, w2, w3
> 
> add w0, w9, w8
> 
> ret
> 
> 
> 
> Whereas I expected the MUL+ADD to be combined to MADD otherwise I see 
> degraded performance in several of my tests.
> 
> Could someone please explain why we use two different APIs to compute 
> depth and latency for the two instructions?
> 
> Also if I try to use the same APIs for both then the depth for the 
> NewRoot is 0.
> 
> 
> 
> Thanks,
> 
> Mandeep
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
--
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev