Andrea Di Biagio
2013-Dec-05 15:35 UTC
[LLVMdev] X86 - Help on fixing a poor code generation bug
Hi all, I noticed that the x86 backend tends to emit unnecessary vector insert instructions immediately after sse scalar fp instructions like addss/mulss. For example: ///////////////////////////////// __m128 foo(__m128 A, __m128 B) { _mm_add_ss(A, B); } ///////////////////////////////// produces the sequence: addss %xmm0, %xmm1 movss %xmm1, %xmm0 which could be easily optimized into addss %xmm1, %xmm0 The first step is to understand why the compiler produces the redundant instructions. For the example above, the DAG sequence looks like this: a0 : f32 = extract_vector_elt ( A, 0) b0 : f32 = extract_vector_elt ( B, 0) r0 : f32 = fadd a0, b0 result : v4f32 = insert_vector_elt ( A, r0, 0 ) (with A and B of type v4f32). The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or X86ISD::INSERTPS depending on the target's SSE feature level. To start I checked if this bug was caused simply by the lack of specific tablegen patterns to match the complex sequence described above into a single ADDSS instruction. However X86InstrSSE.td already defines an instruction X86::ADDSSrr as a commutative SSE scalar fp add instruction (that works on F32 ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes and it will be translated to 'addss' in assembly. At this stage, the MOVSS/INSERTPS is still required since the ADDSS alone would not be equivalent to the hardware 'movss' instruction. I then started investigating the possibility of adding a pass that runs at 'postRegAlloc' stage. Before RegAlloc it may not be safe to remove the redundant MOVSSrr because of the TwoAddressInstruction Pass; this may decide to commute the operands of the ADDSS/MULSS. It is possible to write a pass that scans through each basic block in a function looking for opportunities to fold instructions based on the following patterns: ////// B<def, tied1> = #NAME#SSrr B<kill, tied0>, A A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> ==> A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> ///// ///// B<def, tied1> = #NAME#PSrr B<kill, tied0>, A A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> ==> A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> ///// With 'NAME' in {ADD, SUB, MUL, DIV}. I managed to create a very simple prototype pass which simplifies the code according to pattern #1 (see the patch in attachment). However, identifying patterns at postRegAlloc stage is definitely more complicated than matching patterns on dags: instructions are not in the SSA form; the pass must keep track of defs and kills and leave the liveness info in a consistent state; some instructions may have side effects etc. Last but not least, the register mapping implemented by the register allocator and how instructions are scheduled may strongly affect the ability of a pass like this to pattern match specific sequences of instructions. In conclusion, my questions are: Would a patch be acceptable that introduces a new MachineFunctionPass that runs at 'postRegAlloc' stage? If not, is there a better way to fix this bug? Thanks, Andrea Di Biagio SN Systems - Sony Computer Entertainment Group -------------- next part -------------- Index: lib/Target/X86/X86.h ==================================================================--- lib/Target/X86/X86.h (revision 196508) +++ lib/Target/X86/X86.h (working copy) @@ -47,6 +47,8 @@ /// FunctionPass *createX86FloatingPointStackifierPass(); +FunctionPass *createX86FoldRedundantInsertsPass(); + /// createX86IssueVZeroUpperPass - This pass inserts AVX vzeroupper instructions /// before each call to avoid transition penalty between functions encoded with /// AVX and SSE. Index: lib/Target/X86/X86TargetMachine.cpp ==================================================================--- lib/Target/X86/X86TargetMachine.cpp (revision 196508) +++ lib/Target/X86/X86TargetMachine.cpp (working copy) @@ -197,6 +197,10 @@ bool X86PassConfig::addPostRegAlloc() { addPass(createX86FloatingPointStackifierPass()); + + if (getOptLevel() != CodeGenOpt::None) + addPass(createX86FoldRedundantInsertsPass()); + return true; // -print-machineinstr should print after this. } @@ -222,7 +226,6 @@ addPass(createX86FixupLEAs()); ShouldPrint = true; } - return ShouldPrint; } Index: lib/Target/X86/CMakeLists.txt ==================================================================--- lib/Target/X86/CMakeLists.txt (revision 196508) +++ lib/Target/X86/CMakeLists.txt (working copy) @@ -18,6 +18,7 @@ X86CodeEmitter.cpp X86FastISel.cpp X86FloatingPoint.cpp + X86FoldRedundantInserts.cpp X86FrameLowering.cpp X86ISelDAGToDAG.cpp X86ISelLowering.cpp Index: lib/Target/X86/X86FoldRedundantInserts.cpp ==================================================================--- lib/Target/X86/X86FoldRedundantInserts.cpp (revision 0) +++ lib/Target/X86/X86FoldRedundantInserts.cpp (revision 0) @@ -0,0 +1,185 @@ +//===-- X86FoldRedundantInserts - Remove redundant vector inserts -----------===// +// +// The LLVM Compiler Infrastructure +// +// This file is distributed under the University of Illinois Open Source +// License. See LICENSE.TXT for details. +// +//===----------------------------------------------------------------------===// +// +// This file defines a pass that searches for opportunities to fold +// the following patterns: +// +// 1) B = #OPNAME#SSrr B, A +// A = #OPINSERT#SSrr A, B<kill> +// ==> +// A = #OPNAME#SSrr A, B<kill> +// +// 2) B = #OPNAME#PSrr B, A +// A = #OPINSERT#SSrr A, B<kill> +// ==> +// A = #OPNAME#SSrr A, B<kill> +// +// With #OPNAME# in { ADD, MUL, SUB, DIV } +// #OPINSERT# in { MOV, INSERT } +// +// FIXME: this is just a simple prototype. It only implement pattern 1. +// +//===----------------------------------------------------------------------===// + +#define DEBUG_TYPE "x86-codegen" +#include "X86.h" +#include "X86InstrInfo.h" +#include "llvm/ADT/DenseMap.h" +#include "llvm/ADT/Statistic.h" +#include "llvm/CodeGen/MachineFunctionPass.h" +#include "llvm/CodeGen/MachineRegisterInfo.h" +#include "llvm/CodeGen/Passes.h" +#include "llvm/Support/Debug.h" +#include "llvm/Support/raw_ostream.h" +#include "llvm/Target/TargetInstrInfo.h" +#include "llvm/Target/TargetMachine.h" + +using namespace llvm; + +STATISTIC(NumDeletedInserts, "Number of inserts deleted"); + +namespace { + class X86FoldRedundantInserts : public MachineFunctionPass { + const TargetRegisterInfo *TRI; + const TargetInstrInfo *TII; + MachineRegisterInfo *MRI; + + public: + static char ID; // Pass identification, replacement for typeid + X86FoldRedundantInserts() : MachineFunctionPass(ID) { } + + virtual void getAnalysisUsage(AnalysisUsage &AU) const { + AU.setPreservesCFG(); + AU.addPreservedID(MachineLoopInfoID); + AU.addPreservedID(MachineDominatorsID); + MachineFunctionPass::getAnalysisUsage(AU); + } + + virtual bool runOnMachineFunction(MachineFunction &MF); + + virtual const char *getPassName() const { + return "X86 Remove Redundant Inserts"; + } + + private: + // Candidates - Keep track of potential candidates for removal. + DenseMap<unsigned, MachineInstr*> Candidates; + + bool isValidINSERTPS(const MachineInstr &MI) const; + bool isValidMOVSS(const MachineInstr &MI) const; + bool isValidCandidate(const MachineInstr &MI) const; + bool SimplifyBlock(MachineBasicBlock &MBB); + }; +} + +char X86FoldRedundantInserts::ID = 0; + +FunctionPass *llvm::createX86FoldRedundantInsertsPass() { + return new X86FoldRedundantInserts(); +} + +bool X86FoldRedundantInserts::isValidINSERTPS(const MachineInstr &MI) const { + const MachineOperand &Op2 = MI.getOperand(2); + const MachineOperand &Op3 = MI.getOperand(3); + + return ((Op3.getType() == MachineOperand::MO_Immediate) && + (Op3.getImm() == 0) && + (Op2.isKill() && !MRI->isReserved(Op2.getReg()))); +} + +bool X86FoldRedundantInserts::isValidMOVSS(const MachineInstr &MI) const { + const MachineOperand &MO = MI.getOperand(2); + unsigned Reg = MO.getReg(); + + return !MRI->isReserved(Reg) && MO.isKill(); +} + +bool X86FoldRedundantInserts::isValidCandidate(const MachineInstr &MI) const { + switch (MI.getOpcode()) { + default : return false; + case X86::MOVSSrr : + return isValidMOVSS(MI); + case X86::INSERTPSrr : + return isValidINSERTPS(MI); + } + + llvm_unreachable(0); +} + +bool X86FoldRedundantInserts::SimplifyBlock(MachineBasicBlock &MBB) { + bool Changed = false; + MachineBasicBlock::iterator I = MBB.end(), E = MBB.begin(); + + while (I != E) { + --I; + if (I->hasUnmodeledSideEffects()) { + // Invalidate the Map. + Candidates.clear(); + continue; + } + + if (isValidCandidate(*I)) { + // Add this entry to the Candidates map. + unsigned RegDef = I->getOperand(0).getReg(); + Candidates.insert(std::make_pair(RegDef, I)); + continue; + } + + // Simplify according to the pattern + // X<def,tied1> = #OPCODE#SSrr X<kill,tied0>, Y + // Y<def,tied1> = #INSERT#SSrr Y<kill,tied0>, X<kill> + // ==> + // Y<def,tied1> = #OPCODE#SSrr Y<kill,tied0>, X<kill> + if (I->getOpcode() == X86::ADDSSrr || + I->getOpcode() == X86::MULSSrr) { + unsigned Def = I->getOperand(2).getReg(); + + DenseMap<unsigned, MachineInstr*>::iterator CI = Candidates.find(Def); + if (CI != Candidates.end()) { + MachineOperand &MO = I->getOperand(0); + MachineInstr *CMI = CI->second; + + if (MO.getReg() == CMI->getOperand(2).getReg()) { + DEBUG(dbgs() << "Commuting: " << *I); + MachineInstr *NewMI = TII->commuteInstruction(I,false); + assert(NewMI && "failed to commute the operands!"); + + DEBUG(dbgs() << "NewMI is: " << *NewMI); + DEBUG(dbgs() << "Removing redundant insert: " << *CMI); + CMI->eraseFromParent(); + NumDeletedInserts++; + } + } + + Candidates.erase(Def); + continue; + } + + // FIXME: simplify more patterns and update the Candidates + // set based on what registers are clobbered by the current + // instruction instead of just clearling the map of candidates. + Candidates.clear(); + } + + return Changed; +} + +bool X86FoldRedundantInserts::runOnMachineFunction(MachineFunction &MF) { + bool Changed = false; + + TRI = MF.getTarget().getRegisterInfo(); + TII = MF.getTarget().getInstrInfo(); + MRI = &MF.getRegInfo(); + + for (MachineFunction::iterator I = MF.begin(), E = MF.end(); I != E; ++I) + Changed |= SimplifyBlock(*I); + + return Changed; +} +
Cameron McInally
2013-Dec-05 16:54 UTC
[LLVMdev] X86 - Help on fixing a poor code generation bug
Hey Andrea, On Thu, Dec 5, 2013 at 10:35 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> Hi all, > > I noticed that the x86 backend tends to emit unnecessary vector insert > instructions immediately after sse scalar fp instructions like > addss/mulss. > > For example: > ///////////////////////////////// > __m128 foo(__m128 A, __m128 B) { > _mm_add_ss(A, B); > } > ///////////////////////////////// > > produces the sequence: > addss %xmm0, %xmm1 > movss %xmm1, %xmm0 > > which could be easily optimized into > addss %xmm1, %xmm0 > > The first step is to understand why the compiler produces the > redundant instructions. > > For the example above, the DAG sequence looks like this: > > a0 : f32 = extract_vector_elt ( A, 0) > b0 : f32 = extract_vector_elt ( B, 0) > r0 : f32 = fadd a0, b0 > result : v4f32 = insert_vector_elt ( A, r0, 0 ) > > (with A and B of type v4f32). > > The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or > X86ISD::INSERTPS depending on the target's SSE feature level. > > To start I checked if this bug was caused simply by the lack of > specific tablegen patterns to match the complex sequence described > above into a single ADDSS instruction. > > However X86InstrSSE.td already defines an instruction X86::ADDSSrr as > a commutative SSE scalar fp add instruction (that works on F32 > ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes > and it will be translated to 'addss' in assembly. > > At this stage, the MOVSS/INSERTPS is still required since the ADDSS > alone would not be equivalent to the hardware 'movss' instruction. > > I then started investigating the possibility of adding a pass that > runs at 'postRegAlloc' stage. > > Before RegAlloc it may not be safe to remove the redundant MOVSSrr > because of the TwoAddressInstruction Pass; this may decide to commute > the operands of the ADDSS/MULSS. > > It is possible to write a pass that scans through each basic block in > a function looking for opportunities to fold instructions based on the > following patterns: > > ////// > B<def, tied1> = #NAME#SSrr B<kill, tied0>, A > A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> > ==> > A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> > ///// > > ///// > B<def, tied1> = #NAME#PSrr B<kill, tied0>, A > A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> > ==> > A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> > ///// > > With 'NAME' in {ADD, SUB, MUL, DIV}. > > I managed to create a very simple prototype pass which simplifies the > code according to pattern #1 (see the patch in attachment). > However, identifying patterns at postRegAlloc stage is definitely more > complicated than matching patterns on dags: instructions are not in > the SSA form; the pass must keep track of defs and kills and leave the > liveness info in a consistent state; some instructions may have side > effects etc. > Last but not least, the register mapping implemented by the register > allocator and how instructions are scheduled may strongly affect the > ability of a pass like this to pattern match specific sequences of > instructions. > > In conclusion, my questions are: > Would a patch be acceptable that introduces a new MachineFunctionPass > that runs at 'postRegAlloc' stage? > If not, is there a better way to fix this bug? > > Thanks, > Andrea Di Biagio > SN Systems - Sony Computer Entertainment GroupWith my compiler, which has a proprietary frontend and Builtin lowerings, we do produce the expected assembly:> # BB#0: # %file test.c, line 3, bb1 > addss %xmm1, %xmm0 # test.c:5 > ret # test.c:5The IR produced is:> %r = load <4 x float>* %A, align 16, !dbg !11, > %r3 = load <4 x float>* %B, align 16, !dbg !11 > %r4 = call <4 x float> @llvm.x86.sse.add.ss(<4 x float> %r, <4 x float> %r3), !dbg !11 > ret <4 x float> %r4, !dbg !11Since the int_x86_sse_add_ss intrinsic operates on vectors, not scalars, there's no need for the inserts and extracts. Of course, with my compiler, we miss out on any optimization opportunities that may arise from using the IRBuilder. So, this is not ideal either. But, we do avoid partial register updates... so we've got that going for us. ;) I do realize that this does not solve your problem. Just thought that it might help design an ideal solution. HTH, Cameron
Nadav Rotem
2013-Dec-05 17:58 UTC
[LLVMdev] X86 - Help on fixing a poor code generation bug
Hi Andrea, Thanks for working on this. I can see two approaches to solving this problem. The first one (that you suggested) is to catch this pattern after register allocation. The second approach is to eliminate this redundancy during instruction selection. Can you please look into catching this pattern during iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can easily catch them. You can add a new target specific DAGCombine or a table-ten pattern. You should also handle mul/add/sub. define <4 x float> @foo(<4 x float> %A, <4 x float> %B) nounwind readnone ssp uwtable { %1 = extractelement <4 x float> %B, i32 0 %2 = extractelement <4 x float> %A, i32 0 %3 = fadd float %2, %1 // Both the fadd and the insert element below should be matched into %4 = insertelement <4 x float> %A, float %3, i32 0 // an ADDSS which does an ADD and a BLEND in one instruction. ret <4 x float> %4 } Thanks, Nadav On Dec 5, 2013, at 7:35 AM, Andrea Di Biagio <andrea.dibiagio at gmail.com> wrote:> Hi all, > > I noticed that the x86 backend tends to emit unnecessary vector insert > instructions immediately after sse scalar fp instructions like > addss/mulss. > > For example: > ///////////////////////////////// > __m128 foo(__m128 A, __m128 B) { > _mm_add_ss(A, B); > } > ///////////////////////////////// > > produces the sequence: > addss %xmm0, %xmm1 > movss %xmm1, %xmm0 > > which could be easily optimized into > addss %xmm1, %xmm0 > > The first step is to understand why the compiler produces the > redundant instructions. > > For the example above, the DAG sequence looks like this: > > a0 : f32 = extract_vector_elt ( A, 0) > b0 : f32 = extract_vector_elt ( B, 0) > r0 : f32 = fadd a0, b0 > result : v4f32 = insert_vector_elt ( A, r0, 0 ) > > (with A and B of type v4f32). > > The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or > X86ISD::INSERTPS depending on the target's SSE feature level. > > To start I checked if this bug was caused simply by the lack of > specific tablegen patterns to match the complex sequence described > above into a single ADDSS instruction. > > However X86InstrSSE.td already defines an instruction X86::ADDSSrr as > a commutative SSE scalar fp add instruction (that works on F32 > ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes > and it will be translated to 'addss' in assembly. > > At this stage, the MOVSS/INSERTPS is still required since the ADDSS > alone would not be equivalent to the hardware 'movss' instruction. > > I then started investigating the possibility of adding a pass that > runs at 'postRegAlloc' stage. > > Before RegAlloc it may not be safe to remove the redundant MOVSSrr > because of the TwoAddressInstruction Pass; this may decide to commute > the operands of the ADDSS/MULSS. > > It is possible to write a pass that scans through each basic block in > a function looking for opportunities to fold instructions based on the > following patterns: > > ////// > B<def, tied1> = #NAME#SSrr B<kill, tied0>, A > A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> > ==> > A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> > ///// > > ///// > B<def, tied1> = #NAME#PSrr B<kill, tied0>, A > A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill> > ==> > A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill> > ///// > > With 'NAME' in {ADD, SUB, MUL, DIV}. > > I managed to create a very simple prototype pass which simplifies the > code according to pattern #1 (see the patch in attachment). > However, identifying patterns at postRegAlloc stage is definitely more > complicated than matching patterns on dags: instructions are not in > the SSA form; the pass must keep track of defs and kills and leave the > liveness info in a consistent state; some instructions may have side > effects etc. > Last but not least, the register mapping implemented by the register > allocator and how instructions are scheduled may strongly affect the > ability of a pass like this to pattern match specific sequences of > instructions. > > In conclusion, my questions are: > Would a patch be acceptable that introduces a new MachineFunctionPass > that runs at 'postRegAlloc' stage? > If not, is there a better way to fix this bug? > > Thanks, > Andrea Di Biagio > SN Systems - Sony Computer Entertainment Group > <patch.diff>_______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Andrea Di Biagio
2013-Dec-06 12:19 UTC
[LLVMdev] X86 - Help on fixing a poor code generation bug
Hi Nadav and Cameron, On Thu, Dec 5, 2013 at 5:58 PM, Nadav Rotem <nrotem at apple.com> wrote:> Hi Andrea, > > Thanks for working on this. I can see two approaches to solving this problem. The first one (that you suggested) is to catch this pattern after register allocation. The second approach is to eliminate this redundancy during instruction selection. Can you please look into catching this pattern during iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can easily catch them. You can add a new target specific DAGCombine or a table-ten pattern. You should also handle mul/add/sub. > > define <4 x float> @foo(<4 x float> %A, <4 x float> %B) nounwind readnone ssp uwtable { > %1 = extractelement <4 x float> %B, i32 0 > %2 = extractelement <4 x float> %A, i32 0 > %3 = fadd float %2, %1 // Both the fadd and the insert element below should be matched into > %4 = insertelement <4 x float> %A, float %3, i32 0 // an ADDSS which does an ADD and a BLEND in one instruction. > ret <4 x float> %4 > } >I found how to catch the pattern during ISel and I have got a patch which I think fixes the problem. I will upload a patch as soon as I finished to test it. Thanks for the feedback! Andrea Di Biagio SN Systems - Sony Computer Entertainment Group
Possibly Parallel Threads
- [LLVMdev] X86 - Help on fixing a poor code generation bug
- [PATCH] Make SSE Run Time option. Add Win32 SSE code
- [LLVMdev] Packed instructions generaetd by LoopVectorize?
- [LLVMdev] Packed instructions generaetd by LoopVectorize?
- [LLVMdev] Packed instructions generaetd by LoopVectorize?