thr3ads.net - llvm dev - [LLVMdev] X86 - Help on fixing a poor code generation bug [Dec 2013]

If this information is useful, please help other people find it:
Share via:

Andrea Di Biagio

2013-Dec-05 15:35 UTC

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hi all,

I noticed that the x86 backend tends to emit unnecessary vector insert
instructions immediately after sse scalar fp instructions like
addss/mulss.

For example:
/////////////////////////////////
__m128 foo(__m128 A, __m128 B) {
  _mm_add_ss(A, B);
}
/////////////////////////////////

produces the sequence:
  addss  %xmm0, %xmm1
  movss  %xmm1, %xmm0

which could be easily optimized into
  addss  %xmm1, %xmm0

The first step is to understand why the compiler produces the
redundant instructions.

For the example above, the DAG sequence looks like this:

  a0 :     f32 = extract_vector_elt ( A, 0)
  b0 :     f32 = extract_vector_elt ( B, 0)
  r0 :     f32 = fadd a0, b0
  result : v4f32 = insert_vector_elt ( A, r0, 0 )

(with A and B of type v4f32).

The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS or
X86ISD::INSERTPS depending on the target's SSE feature level.

To start I checked if this bug was caused simply by the lack of
specific tablegen patterns to match the complex sequence described
above into a single ADDSS instruction.

However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
a commutative SSE scalar fp add instruction (that works on F32
ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd' nodes
and it will be translated to 'addss' in assembly.

At this stage, the MOVSS/INSERTPS is still required since the ADDSS
alone would not be equivalent to the hardware 'movss' instruction.

I then started investigating the possibility of adding a pass that
runs at 'postRegAlloc' stage.

Before RegAlloc it may not be safe to remove the redundant MOVSSrr
because of the TwoAddressInstruction Pass; this may decide to commute
the operands of the ADDSS/MULSS.

It is possible to write a pass that scans through each basic block in
a function looking for opportunities to fold instructions based on the
following patterns:

  //////
  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

  /////
  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
    ==>
  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
  /////

With 'NAME' in {ADD, SUB, MUL, DIV}.

I managed to create a very simple prototype pass which simplifies the
code according to pattern #1 (see the patch in attachment).
However, identifying patterns at postRegAlloc stage is definitely more
complicated than matching patterns on dags: instructions are not in
the SSA form; the pass must keep track of defs and kills and leave the
liveness info in a consistent state; some instructions may have side
effects etc.
Last but not least, the register mapping implemented by the register
allocator and how instructions are scheduled may strongly affect the
ability of a pass like this to pattern match specific sequences of
instructions.

In conclusion, my questions are:
Would a patch be acceptable that introduces a new MachineFunctionPass
that runs at 'postRegAlloc' stage?
If not, is there a better way to fix this bug?

Thanks,
Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group
-------------- next part --------------
Index: lib/Target/X86/X86.h
==================================================================---
lib/Target/X86/X86.h	(revision 196508)
+++ lib/Target/X86/X86.h	(working copy)
@@ -47,6 +47,8 @@
 ///
 FunctionPass *createX86FloatingPointStackifierPass();
 
+FunctionPass *createX86FoldRedundantInsertsPass();
+
 /// createX86IssueVZeroUpperPass - This pass inserts AVX vzeroupper
instructions
 /// before each call to avoid transition penalty between functions encoded with
 /// AVX and SSE.
Index: lib/Target/X86/X86TargetMachine.cpp
==================================================================---
lib/Target/X86/X86TargetMachine.cpp	(revision 196508)
+++ lib/Target/X86/X86TargetMachine.cpp	(working copy)
@@ -197,6 +197,10 @@
 
 bool X86PassConfig::addPostRegAlloc() {
   addPass(createX86FloatingPointStackifierPass());
+
+  if (getOptLevel() != CodeGenOpt::None)
+    addPass(createX86FoldRedundantInsertsPass());
+
   return true;  // -print-machineinstr should print after this.
 }
 
@@ -222,7 +226,6 @@
     addPass(createX86FixupLEAs());
     ShouldPrint = true;
   }
-
   return ShouldPrint;
 }
 
Index: lib/Target/X86/CMakeLists.txt
==================================================================---
lib/Target/X86/CMakeLists.txt	(revision 196508)
+++ lib/Target/X86/CMakeLists.txt	(working copy)
@@ -18,6 +18,7 @@
   X86CodeEmitter.cpp
   X86FastISel.cpp
   X86FloatingPoint.cpp
+  X86FoldRedundantInserts.cpp
   X86FrameLowering.cpp
   X86ISelDAGToDAG.cpp
   X86ISelLowering.cpp
Index: lib/Target/X86/X86FoldRedundantInserts.cpp
==================================================================---
lib/Target/X86/X86FoldRedundantInserts.cpp	(revision 0)
+++ lib/Target/X86/X86FoldRedundantInserts.cpp	(revision 0)
@@ -0,0 +1,185 @@
+//===-- X86FoldRedundantInserts - Remove redundant vector inserts
-----------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
+//===----------------------------------------------------------------------===//
+//
+// This file defines a pass that searches for opportunities to fold
+// the following patterns:
+//
+// 1)  B = #OPNAME#SSrr B, A
+//     A = #OPINSERT#SSrr A, B<kill>
+//        ==>
+//     A = #OPNAME#SSrr A, B<kill>
+//
+// 2)  B = #OPNAME#PSrr B, A
+//     A = #OPINSERT#SSrr A, B<kill>
+//        ==>
+//     A = #OPNAME#SSrr A, B<kill>
+//
+// With #OPNAME# in { ADD, MUL, SUB, DIV }
+//      #OPINSERT# in { MOV, INSERT }
+//
+// FIXME: this is just a simple prototype. It only implement pattern 1.
+//
+//===----------------------------------------------------------------------===//
+
+#define DEBUG_TYPE "x86-codegen"
+#include "X86.h"
+#include "X86InstrInfo.h"
+#include "llvm/ADT/DenseMap.h"  
+#include "llvm/ADT/Statistic.h"
+#include "llvm/CodeGen/MachineFunctionPass.h"
+#include "llvm/CodeGen/MachineRegisterInfo.h"
+#include "llvm/CodeGen/Passes.h"
+#include "llvm/Support/Debug.h"
+#include "llvm/Support/raw_ostream.h"
+#include "llvm/Target/TargetInstrInfo.h"
+#include "llvm/Target/TargetMachine.h"
+
+using namespace llvm;
+
+STATISTIC(NumDeletedInserts, "Number of inserts deleted");
+
+namespace {
+  class X86FoldRedundantInserts : public MachineFunctionPass {
+    const TargetRegisterInfo *TRI;
+    const TargetInstrInfo *TII;
+    MachineRegisterInfo *MRI;
+
+  public:
+    static char ID;  // Pass identification, replacement for typeid
+    X86FoldRedundantInserts() : MachineFunctionPass(ID) { }
+
+    virtual void getAnalysisUsage(AnalysisUsage &AU) const {
+      AU.setPreservesCFG();
+      AU.addPreservedID(MachineLoopInfoID);
+      AU.addPreservedID(MachineDominatorsID);
+      MachineFunctionPass::getAnalysisUsage(AU);
+    }
+
+    virtual bool runOnMachineFunction(MachineFunction &MF);
+
+    virtual const char *getPassName() const {
+      return "X86 Remove Redundant Inserts";
+    }
+
+  private:
+    // Candidates - Keep track of potential candidates for removal.
+    DenseMap<unsigned, MachineInstr*> Candidates;
+
+    bool isValidINSERTPS(const MachineInstr &MI) const;
+    bool isValidMOVSS(const MachineInstr &MI) const;
+    bool isValidCandidate(const MachineInstr &MI) const;
+    bool SimplifyBlock(MachineBasicBlock &MBB);
+  };
+}
+
+char X86FoldRedundantInserts::ID = 0;
+
+FunctionPass *llvm::createX86FoldRedundantInsertsPass() {
+  return new X86FoldRedundantInserts();
+}
+
+bool X86FoldRedundantInserts::isValidINSERTPS(const MachineInstr &MI) const
{
+  const MachineOperand &Op2 = MI.getOperand(2);
+  const MachineOperand &Op3 = MI.getOperand(3);
+
+  return ((Op3.getType() == MachineOperand::MO_Immediate) &&
+          (Op3.getImm() == 0) &&
+          (Op2.isKill() && !MRI->isReserved(Op2.getReg())));
+}
+
+bool X86FoldRedundantInserts::isValidMOVSS(const MachineInstr &MI) const {
+  const MachineOperand &MO = MI.getOperand(2);
+  unsigned Reg = MO.getReg();
+
+  return !MRI->isReserved(Reg) && MO.isKill();
+}
+
+bool X86FoldRedundantInserts::isValidCandidate(const MachineInstr &MI)
const {
+  switch (MI.getOpcode()) {
+    default : return false;
+    case X86::MOVSSrr :
+      return isValidMOVSS(MI);
+    case X86::INSERTPSrr :
+      return isValidINSERTPS(MI);
+  }
+
+  llvm_unreachable(0);
+}
+
+bool X86FoldRedundantInserts::SimplifyBlock(MachineBasicBlock &MBB) {
+  bool Changed = false;
+  MachineBasicBlock::iterator I = MBB.end(), E = MBB.begin();
+
+  while (I != E) {
+    --I;
+    if (I->hasUnmodeledSideEffects()) {
+      // Invalidate the Map.
+      Candidates.clear();
+      continue;
+    }
+
+    if (isValidCandidate(*I)) {
+      // Add this entry to the Candidates map.
+      unsigned RegDef = I->getOperand(0).getReg();
+      Candidates.insert(std::make_pair(RegDef, I));
+      continue;
+    }
+    
+    // Simplify according to the pattern
+    //  X<def,tied1> = #OPCODE#SSrr X<kill,tied0>, Y
+    //  Y<def,tied1> = #INSERT#SSrr Y<kill,tied0>, X<kill>
+    //     ==>
+    //  Y<def,tied1> = #OPCODE#SSrr Y<kill,tied0>, X<kill>
+    if (I->getOpcode() == X86::ADDSSrr || 
+        I->getOpcode() == X86::MULSSrr) {
+      unsigned Def = I->getOperand(2).getReg();
+      
+      DenseMap<unsigned, MachineInstr*>::iterator CI =
Candidates.find(Def);
+      if (CI != Candidates.end()) {
+        MachineOperand &MO = I->getOperand(0);
+        MachineInstr *CMI = CI->second;
+ 
+        if (MO.getReg() == CMI->getOperand(2).getReg()) {
+          DEBUG(dbgs() << "Commuting: " << *I);
+          MachineInstr *NewMI = TII->commuteInstruction(I,false);
+          assert(NewMI && "failed to commute the operands!");
+
+          DEBUG(dbgs() << "NewMI is: " << *NewMI);
+          DEBUG(dbgs() << "Removing redundant insert: "
<< *CMI);
+          CMI->eraseFromParent();
+          NumDeletedInserts++;
+        }
+      }
+ 
+      Candidates.erase(Def);
+      continue;
+    }
+
+    // FIXME: simplify more patterns and update the Candidates
+    // set based on what registers are clobbered by the current
+    // instruction instead of just clearling the map of candidates.
+    Candidates.clear();
+  }
+
+  return Changed;
+}
+
+bool X86FoldRedundantInserts::runOnMachineFunction(MachineFunction &MF) {
+  bool Changed = false;
+  
+  TRI = MF.getTarget().getRegisterInfo();
+  TII = MF.getTarget().getInstrInfo();
+  MRI = &MF.getRegInfo();
+
+  for (MachineFunction::iterator I = MF.begin(), E = MF.end(); I != E; ++I)
+    Changed |= SimplifyBlock(*I);
+
+  return Changed;
+}
+

Cameron McInally

2013-Dec-05 16:54 UTC

head link

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hey Andrea,

On Thu, Dec 5, 2013 at 10:35 AM, Andrea Di Biagio
<andrea.dibiagio at gmail.com> wrote:> Hi all,
>
> I noticed that the x86 backend tends to emit unnecessary vector insert
> instructions immediately after sse scalar fp instructions like
> addss/mulss.
>
> For example:
> /////////////////////////////////
> __m128 foo(__m128 A, __m128 B) {
>   _mm_add_ss(A, B);
> }
> /////////////////////////////////
>
> produces the sequence:
>   addss  %xmm0, %xmm1
>   movss  %xmm1, %xmm0
>
> which could be easily optimized into
>   addss  %xmm1, %xmm0
>
> The first step is to understand why the compiler produces the
> redundant instructions.
>
> For the example above, the DAG sequence looks like this:
>
>   a0 :     f32 = extract_vector_elt ( A, 0)
>   b0 :     f32 = extract_vector_elt ( B, 0)
>   r0 :     f32 = fadd a0, b0
>   result : v4f32 = insert_vector_elt ( A, r0, 0 )
>
> (with A and B of type v4f32).
>
> The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS
or
> X86ISD::INSERTPS depending on the target's SSE feature level.
>
> To start I checked if this bug was caused simply by the lack of
> specific tablegen patterns to match the complex sequence described
> above into a single ADDSS instruction.
>
> However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
> a commutative SSE scalar fp add instruction (that works on F32
> ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd'
nodes
> and it will be translated to 'addss' in assembly.
>
> At this stage, the MOVSS/INSERTPS is still required since the ADDSS
> alone would not be equivalent to the hardware 'movss' instruction.
>
> I then started investigating the possibility of adding a pass that
> runs at 'postRegAlloc' stage.
>
> Before RegAlloc it may not be safe to remove the redundant MOVSSrr
> because of the TwoAddressInstruction Pass; this may decide to commute
> the operands of the ADDSS/MULSS.
>
> It is possible to write a pass that scans through each basic block in
> a function looking for opportunities to fold instructions based on the
> following patterns:
>
>   //////
>   B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
>   A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>     ==>
>   A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>   /////
>
>   /////
>   B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
>   A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>     ==>
>   A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>   /////
>
> With 'NAME' in {ADD, SUB, MUL, DIV}.
>
> I managed to create a very simple prototype pass which simplifies the
> code according to pattern #1 (see the patch in attachment).
> However, identifying patterns at postRegAlloc stage is definitely more
> complicated than matching patterns on dags: instructions are not in
> the SSA form; the pass must keep track of defs and kills and leave the
> liveness info in a consistent state; some instructions may have side
> effects etc.
> Last but not least, the register mapping implemented by the register
> allocator and how instructions are scheduled may strongly affect the
> ability of a pass like this to pattern match specific sequences of
> instructions.
>
> In conclusion, my questions are:
> Would a patch be acceptable that introduces a new MachineFunctionPass
> that runs at 'postRegAlloc' stage?
> If not, is there a better way to fix this bug?
>
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Group
With my compiler, which has a proprietary frontend and Builtin
lowerings, we do produce the expected assembly:
> # BB#0:                                 # %file test.c, line 3, bb1
>        addss   %xmm1, %xmm0            # test.c:5
>        ret                             # test.c:5
The IR produced is:
>  %r = load <4 x float>* %A, align 16, !dbg !11,
>  %r3 = load <4 x float>* %B, align 16, !dbg !11
>  %r4 = call <4 x float> @llvm.x86.sse.add.ss(<4 x float> %r,
<4 x float> %r3), !dbg !11
>  ret <4 x float> %r4, !dbg !11
Since the int_x86_sse_add_ss intrinsic operates on vectors, not
scalars, there's no need for the inserts and extracts.

Of course, with my compiler, we miss out on any optimization
opportunities that may arise from using the IRBuilder. So, this is not
ideal either. But, we do avoid partial register updates... so we've
got that going for us. ;)

I do realize that this does not solve your problem. Just thought that
it might help design an ideal solution.

HTH,
Cameron

Nadav Rotem

2013-Dec-05 17:58 UTC

head link

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hi Andrea, 

Thanks for working on this. I can see two approaches to solving this problem.
The first one (that you suggested) is to catch this pattern after register
allocation. The second approach is to eliminate this redundancy during
instruction selection. Can you please look into catching this pattern during
iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can
easily catch them. You can add a new target specific DAGCombine or a table-ten
pattern. You should also handle mul/add/sub.

define <4 x float> @foo(<4 x float> %A, <4 x float> %B)
nounwind readnone ssp uwtable {
  %1 = extractelement <4 x float> %B, i32 0
  %2 = extractelement <4 x float> %A, i32 0
  %3 = fadd float %2, %1                                                 // 
Both the fadd and the insert element below should be matched into
  %4 = insertelement <4 x float> %A, float %3, i32 0      //   an ADDSS
which does an ADD and a BLEND in one instruction.
  ret <4 x float> %4
}

Thanks,
Nadav

On Dec 5, 2013, at 7:35 AM, Andrea Di Biagio <andrea.dibiagio at
gmail.com> wrote:
> Hi all,
> 
> I noticed that the x86 backend tends to emit unnecessary vector insert
> instructions immediately after sse scalar fp instructions like
> addss/mulss.
> 
> For example:
> /////////////////////////////////
> __m128 foo(__m128 A, __m128 B) {
>  _mm_add_ss(A, B);
> }
> /////////////////////////////////
> 
> produces the sequence:
>  addss  %xmm0, %xmm1
>  movss  %xmm1, %xmm0
> 
> which could be easily optimized into
>  addss  %xmm1, %xmm0
> 
> The first step is to understand why the compiler produces the
> redundant instructions.
> 
> For the example above, the DAG sequence looks like this:
> 
>  a0 :     f32 = extract_vector_elt ( A, 0)
>  b0 :     f32 = extract_vector_elt ( B, 0)
>  r0 :     f32 = fadd a0, b0
>  result : v4f32 = insert_vector_elt ( A, r0, 0 )
> 
> (with A and B of type v4f32).
> 
> The 'insert_vector_elt' is custom lowered into either X86ISD::MOVSS
or
> X86ISD::INSERTPS depending on the target's SSE feature level.
> 
> To start I checked if this bug was caused simply by the lack of
> specific tablegen patterns to match the complex sequence described
> above into a single ADDSS instruction.
> 
> However X86InstrSSE.td already defines an instruction X86::ADDSSrr as
> a commutative SSE scalar fp add instruction (that works on F32
> ValueTypes). Instruction X86::ADDSSrr is used to select 'fadd'
nodes
> and it will be translated to 'addss' in assembly.
> 
> At this stage, the MOVSS/INSERTPS is still required since the ADDSS
> alone would not be equivalent to the hardware 'movss' instruction.
> 
> I then started investigating the possibility of adding a pass that
> runs at 'postRegAlloc' stage.
> 
> Before RegAlloc it may not be safe to remove the redundant MOVSSrr
> because of the TwoAddressInstruction Pass; this may decide to commute
> the operands of the ADDSS/MULSS.
> 
> It is possible to write a pass that scans through each basic block in
> a function looking for opportunities to fold instructions based on the
> following patterns:
> 
>  //////
>  B<def, tied1> = #NAME#SSrr B<kill, tied0>, A
>  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>    ==>
>  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>  /////
> 
>  /////
>  B<def, tied1> = #NAME#PSrr B<kill, tied0>, A
>  A<def, tied1> = MOVSSrr A<kill, tied0>, B<kill>
>    ==>
>  A<def, tied1> = #NAME#SSrr A<kill, tied0>, B<kill>
>  /////
> 
> With 'NAME' in {ADD, SUB, MUL, DIV}.
> 
> I managed to create a very simple prototype pass which simplifies the
> code according to pattern #1 (see the patch in attachment).
> However, identifying patterns at postRegAlloc stage is definitely more
> complicated than matching patterns on dags: instructions are not in
> the SSA form; the pass must keep track of defs and kills and leave the
> liveness info in a consistent state; some instructions may have side
> effects etc.
> Last but not least, the register mapping implemented by the register
> allocator and how instructions are scheduled may strongly affect the
> ability of a pass like this to pattern match specific sequences of
> instructions.
> 
> In conclusion, my questions are:
> Would a patch be acceptable that introduces a new MachineFunctionPass
> that runs at 'postRegAlloc' stage?
> If not, is there a better way to fix this bug?
> 
> Thanks,
> Andrea Di Biagio
> SN Systems - Sony Computer Entertainment Group
> <patch.diff>_______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Andrea Di Biagio

2013-Dec-06 12:19 UTC

head link

[LLVMdev] X86 - Help on fixing a poor code generation bug

Hi Nadav and Cameron,

On Thu, Dec 5, 2013 at 5:58 PM, Nadav Rotem <nrotem at apple.com>
wrote:> Hi Andrea,
>
> Thanks for working on this. I can see two approaches to solving this
problem. The first one (that you suggested) is to catch this pattern after
register allocation. The second approach is to eliminate this redundancy during
instruction selection. Can you please look into catching this pattern during
iSel? The idea is that ADDSS does an ADD plus BLEND operations, and you can
easily catch them. You can add a new target specific DAGCombine or a table-ten
pattern. You should also handle mul/add/sub.
>
> define <4 x float> @foo(<4 x float> %A, <4 x float> %B)
nounwind readnone ssp uwtable {
>   %1 = extractelement <4 x float> %B, i32 0
>   %2 = extractelement <4 x float> %A, i32 0
>   %3 = fadd float %2, %1                                                 //
Both the fadd and the insert element below should be matched into
>   %4 = insertelement <4 x float> %A, float %3, i32 0      //   an
ADDSS which does an ADD and a BLEND in one instruction.
>   ret <4 x float> %4
> }
>
I found how to catch the pattern during ISel and I have got a patch
which I think fixes the problem.
I will upload a patch as soon as I finished to test it.

Thanks for the feedback!

Andrea Di Biagio
SN Systems - Sony Computer Entertainment Group

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Dec 2013 - [LLVMdev] X86 - Help on fixing a poor code generation bug

[LLVMdev] X86 - Help on fixing a poor code generation bug

[LLVMdev] X86 - Help on fixing a poor code generation bug

[LLVMdev] X86 - Help on fixing a poor code generation bug

[LLVMdev] X86 - Help on fixing a poor code generation bug

Apparently Analagous Threads