Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR. shrq $25, %rdi andq $1, %rdi LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25 The generated code is correct in both cases. It just isn't optimized in the immediate operatnd case. unsigned long long IsBitSetA(unsigned long long val) { return (val & (1ULL<<25)) != 0ULL; } unsigned long long IsBitSetB(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> Hi, > > Can you provide a reproducible example? I feel especially your first IR > sample is incomplete. > If you can also make more explicit how is the generated code wrong? > > You can give a C file if you are sure that it is reproducible with the > current clang. > > Thanks, > > Mehdi > > On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com> wrote: > > I'm tracking down an X86 code generation malfeasance regarding BT (bit > test) and I have some questions. > > This IR *matches* and then *X86TargetLowering::LowerToBT **is called:* > > %and = and i64 %shl, %val * ; (val & (1 << index)) != 0 ; *bit > test with a *register* index > > > This IR *does not match* and so *X86TargetLowering::LowerToBT **is not > called:* > > %and = lshr i64 %val, 25 * ; (val & (1 **<< 25)) != 0 ; *bit > test with an *immediate* index > > %conv = and i64 %and, 1 > > > Let's back that up a bit. Clang emits this IR. These expressions start out > life in C as *and with a left shifted masking bit*, and are then > converted into IR as *right shifted values anded with a masking bit*. > > This IR then remains untouched until *Expand ISel Pseudo-instructions* in > llc (-O3). At that point, *LowerToBT* is called on the REGISTER version > and substitutes in a BT reg,reg instruction: > > btq %rsi, %rdi ## <MCInst #312 BT64rr > > > The IMMEDIATE version doesn't match the pattern and so *LowerToBT* is not > called. > > *Question*: This is during *pseudo instruction expansion*. How could > *LowerToBT'*s caller have enough context to match the immediate IR > version? In fact, lli isn't calling *LowerToBT* so it isn't matching. But > isn't this really a *peephole optimization* issue? > > LLVM has a generic peephole optimizer, *CodeGen/PeepholeOptimizer.cpp *which has > exactly one subclass in *NVPTXTargetMachine.cpp.* > > But isn't it better to deal with X86 *LowerToBT* in a *PeepholeOptimizer* subclass > where you have a small window of instructions rather than during pseudo > instruction expansion where you have really one instruction? > *PeepholeOptimizer *doesn't seem to be getting much attention and > certainly no attention at the subclass level. > > Bluntly, expansion is about expansion. Peephole optimization is the > opposite. > > *Question*: Regardless, why is *LowerToBT* not being called for the > IMMEDIATE version? I suppose you could look at the preceding instruction in > the DAG. That seems a bit hacky*.* > > Another approach using *LowerToBT* would be to match *lshr reg/imm* first > and then if the *following* instruction was an *and reg,1 *replace both > with a BT*. *It doesn't look like *LowerToBT* as is can do that right now > since it is matching the *and* instruction. > > SDValue X86TargetLowering::LowerToBT(*SDValue And*, ISD::CondCode CC, > SDLoc dl, SelectionDAG &DAG) const { ... } > > > But I think this is better done in a subclass of > *CodeGen/PeepholeOptimizer.cpp.* > > thanks. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-- Ite Ursi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/9927a6c4/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: tst.c Type: text/x-csrc Size: 207 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/9927a6c4/attachment.c>
Do we want to use btq? On many x64_64 processors, shrq/andq is "hard-coded", but btq will execute in microcode, and will likely be worse performing. On Mon, Jan 19, 2015 at 1:57 AM, Chris Sears <chris.sears at gmail.com> wrote:> Sure. Attached is the file but here are the functions. The first uses a > fixed bit offset. The second has a indexed bit offset. Compiling with llc > -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, > %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is > pretty much what Clang had generated as IR. > > shrq $25, %rdi > andq $1, %rdi > > > LLVM should be able to replace these two with a single X86_64 instruction: > btq reg,25 > The generated code is correct in both cases. It just isn't optimized in > the immediate operatnd case. > > unsigned long long IsBitSetA(unsigned long long val) > { > return (val & (1ULL<<25)) != 0ULL; > } > > unsigned long long IsBitSetB(unsigned long long val, int index) > { > return (val & (1ULL<<index)) != 0ULL; > } > > > On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com> > wrote: > >> Hi, >> >> Can you provide a reproducible example? I feel especially your first IR >> sample is incomplete. >> If you can also make more explicit how is the generated code wrong? >> >> You can give a C file if you are sure that it is reproducible with the >> current clang. >> >> Thanks, >> >> Mehdi >> >> On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com> wrote: >> >> I'm tracking down an X86 code generation malfeasance regarding BT (bit >> test) and I have some questions. >> >> This IR *matches* and then *X86TargetLowering::LowerToBT **is called:* >> >> %and = and i64 %shl, %val * ; (val & (1 << index)) != 0 ; *bit >> test with a *register* index >> >> >> This IR *does not match* and so *X86TargetLowering::LowerToBT **is not >> called:* >> >> %and = lshr i64 %val, 25 * ; (val & (1 **<< 25)) != 0 ; >> *bit test with an *immediate* index >> >> %conv = and i64 %and, 1 >> >> >> Let's back that up a bit. Clang emits this IR. These expressions start >> out life in C as *and with a left shifted masking bit*, and are then >> converted into IR as *right shifted values anded with a masking bit*. >> >> This IR then remains untouched until *Expand ISel Pseudo-instructions* >> in llc (-O3). At that point, *LowerToBT* is called on the REGISTER >> version and substitutes in a BT reg,reg instruction: >> >> btq %rsi, %rdi ## <MCInst #312 BT64rr >> >> >> The IMMEDIATE version doesn't match the pattern and so *LowerToBT* is >> not called. >> >> *Question*: This is during *pseudo instruction expansion*. How could >> *LowerToBT'*s caller have enough context to match the immediate IR >> version? In fact, lli isn't calling *LowerToBT* so it isn't matching. >> But isn't this really a *peephole optimization* issue? >> >> LLVM has a generic peephole optimizer, *CodeGen/PeepholeOptimizer.cpp *which has >> exactly one subclass in *NVPTXTargetMachine.cpp.* >> >> But isn't it better to deal with X86 *LowerToBT* in a *PeepholeOptimizer* subclass >> where you have a small window of instructions rather than during pseudo >> instruction expansion where you have really one instruction? >> *PeepholeOptimizer *doesn't seem to be getting much attention and >> certainly no attention at the subclass level. >> >> Bluntly, expansion is about expansion. Peephole optimization is the >> opposite. >> >> *Question*: Regardless, why is *LowerToBT* not being called for the >> IMMEDIATE version? I suppose you could look at the preceding instruction in >> the DAG. That seems a bit hacky*.* >> >> Another approach using *LowerToBT* would be to match *lshr reg/imm* first >> and then if the *following* instruction was an *and reg,1 *replace both >> with a BT*. *It doesn't look like *LowerToBT* as is can do that right >> now since it is matching the *and* instruction. >> >> SDValue X86TargetLowering::LowerToBT(*SDValue And*, ISD::CondCode CC, >> SDLoc dl, SelectionDAG &DAG) const { ... } >> >> >> But I think this is better done in a subclass of >> *CodeGen/PeepholeOptimizer.cpp.* >> >> thanks. >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> > > > -- > Ite Ursi > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/8aa4c217/attachment.html>
Which BTQ? There are three flavors. BTQ reg/reg BTQ reg/mem BTQ reg/imm I can imagine that the reg/reg and especially the reg/mem versions would be slow. However the shrq/and versions *with the same operands* would be slow as well. There's even a compiler comment about the reg/mem version saying "this is for disassembly only". But I doubt BTQ reg/imm would be microcoded. -- Ite Ursi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/658be9ec/attachment.html>