I'm tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions. This IR *matches* and then *X86TargetLowering::LowerToBT **is called:* %and = and i64 %shl, %val * ; (val & (1 << index)) != 0 ; *bit test with a *register* index This IR *does not match* and so *X86TargetLowering::LowerToBT **is not called:* %and = lshr i64 %val, 25 * ; (val & (1 **<< 25)) != 0 ; *bit test with an *immediate* index %conv = and i64 %and, 1 Let's back that up a bit. Clang emits this IR. These expressions start out life in C as *and with a left shifted masking bit*, and are then converted into IR as *right shifted values anded with a masking bit*. This IR then remains untouched until *Expand ISel Pseudo-instructions* in llc (-O3). At that point, *LowerToBT* is called on the REGISTER version and substitutes in a BT reg,reg instruction: btq %rsi, %rdi ## <MCInst #312 BT64rr The IMMEDIATE version doesn't match the pattern and so *LowerToBT* is not called. *Question*: This is during *pseudo instruction expansion*. How could *LowerToBT'*s caller have enough context to match the immediate IR version? In fact, lli isn't calling *LowerToBT* so it isn't matching. But isn't this really a *peephole optimization* issue? LLVM has a generic peephole optimizer, *CodeGen/PeepholeOptimizer.cpp *which has exactly one subclass in *NVPTXTargetMachine.cpp.* But isn't it better to deal with X86 *LowerToBT* in a *PeepholeOptimizer* subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? *PeepholeOptimizer *doesn't seem to be getting much attention and certainly no attention at the subclass level. Bluntly, expansion is about expansion. Peephole optimization is the opposite. *Question*: Regardless, why is *LowerToBT* not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky*.* Another approach using *LowerToBT* would be to match *lshr reg/imm* first and then if the *following* instruction was an *and reg,1 *replace both with a BT*. *It doesn't look like *LowerToBT* as is can do that right now since it is matching the *and* instruction. SDValue X86TargetLowering::LowerToBT(*SDValue And*, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { ... } But I think this is better done in a subclass of *CodeGen/PeepholeOptimizer.cpp.* thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/f4aae20d/attachment.html>
Hi, Can you provide a reproducible example? I feel especially your first IR sample is incomplete. If you can also make more explicit how is the generated code wrong? You can give a C file if you are sure that it is reproducible with the current clang. Thanks, Mehdi> On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com> wrote: > > I'm tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions. > > This IR matches and then X86TargetLowering::LowerToBT is called: > > %and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test with a register index > > This IR does not match and so X86TargetLowering::LowerToBT is not called: > > %and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit test with an immediate index > %conv = and i64 %and, 1 > > Let's back that up a bit. Clang emits this IR. These expressions start out life in C as and with a left shifted masking bit, and are then converted into IR as right shifted values anded with a masking bit. > > This IR then remains untouched until Expand ISel Pseudo-instructions in llc (-O3). At that point, LowerToBT is called on the REGISTER version and substitutes in a BT reg,reg instruction: > > btq %rsi, %rdi ## <MCInst #312 BT64rr > > The IMMEDIATE version doesn't match the pattern and so LowerToBT is not called. > > Question: This is during pseudo instruction expansion. How could LowerToBT's caller have enough context to match the immediate IR version? In fact, lli isn't calling LowerToBT so it isn't matching. But isn't this really a peephole optimization issue? > > LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which has exactly one subclass in NVPTXTargetMachine.cpp. > > But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? PeepholeOptimizer doesn't seem to be getting much attention and certainly no attention at the subclass level. > > Bluntly, expansion is about expansion. Peephole optimization is the opposite. > > Question: Regardless, why is LowerToBT not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky. > > Another approach using LowerToBT would be to match lshr reg/imm first and then if the following instruction was an and reg,1 replace both with a BT. It doesn't look like LowerToBT as is can do that right now since it is matching the and instruction. > > SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { ... } > > But I think this is better done in a subclass of CodeGen/PeepholeOptimizer.cpp. > > thanks. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/81f4470b/attachment.html>
Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR. shrq $25, %rdi andq $1, %rdi LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25 The generated code is correct in both cases. It just isn't optimized in the immediate operatnd case. unsigned long long IsBitSetA(unsigned long long val) { return (val & (1ULL<<25)) != 0ULL; } unsigned long long IsBitSetB(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:> Hi, > > Can you provide a reproducible example? I feel especially your first IR > sample is incomplete. > If you can also make more explicit how is the generated code wrong? > > You can give a C file if you are sure that it is reproducible with the > current clang. > > Thanks, > > Mehdi > > On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com> wrote: > > I'm tracking down an X86 code generation malfeasance regarding BT (bit > test) and I have some questions. > > This IR *matches* and then *X86TargetLowering::LowerToBT **is called:* > > %and = and i64 %shl, %val * ; (val & (1 << index)) != 0 ; *bit > test with a *register* index > > > This IR *does not match* and so *X86TargetLowering::LowerToBT **is not > called:* > > %and = lshr i64 %val, 25 * ; (val & (1 **<< 25)) != 0 ; *bit > test with an *immediate* index > > %conv = and i64 %and, 1 > > > Let's back that up a bit. Clang emits this IR. These expressions start out > life in C as *and with a left shifted masking bit*, and are then > converted into IR as *right shifted values anded with a masking bit*. > > This IR then remains untouched until *Expand ISel Pseudo-instructions* in > llc (-O3). At that point, *LowerToBT* is called on the REGISTER version > and substitutes in a BT reg,reg instruction: > > btq %rsi, %rdi ## <MCInst #312 BT64rr > > > The IMMEDIATE version doesn't match the pattern and so *LowerToBT* is not > called. > > *Question*: This is during *pseudo instruction expansion*. How could > *LowerToBT'*s caller have enough context to match the immediate IR > version? In fact, lli isn't calling *LowerToBT* so it isn't matching. But > isn't this really a *peephole optimization* issue? > > LLVM has a generic peephole optimizer, *CodeGen/PeepholeOptimizer.cpp *which has > exactly one subclass in *NVPTXTargetMachine.cpp.* > > But isn't it better to deal with X86 *LowerToBT* in a *PeepholeOptimizer* subclass > where you have a small window of instructions rather than during pseudo > instruction expansion where you have really one instruction? > *PeepholeOptimizer *doesn't seem to be getting much attention and > certainly no attention at the subclass level. > > Bluntly, expansion is about expansion. Peephole optimization is the > opposite. > > *Question*: Regardless, why is *LowerToBT* not being called for the > IMMEDIATE version? I suppose you could look at the preceding instruction in > the DAG. That seems a bit hacky*.* > > Another approach using *LowerToBT* would be to match *lshr reg/imm* first > and then if the *following* instruction was an *and reg,1 *replace both > with a BT*. *It doesn't look like *LowerToBT* as is can do that right now > since it is matching the *and* instruction. > > SDValue X86TargetLowering::LowerToBT(*SDValue And*, ISD::CondCode CC, > SDLoc dl, SelectionDAG &DAG) const { ... } > > > But I think this is better done in a subclass of > *CodeGen/PeepholeOptimizer.cpp.* > > thanks. > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-- Ite Ursi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/9927a6c4/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: tst.c Type: text/x-csrc Size: 207 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150118/9927a6c4/attachment.c>
On Sun, Jan 18, 2015 at 5:13 PM, Chris Sears <chris.sears at gmail.com> wrote:> I'm tracking down an X86 code generation malfeasance regarding BT (bit test) > and I have some questions. > > This IR matches and then X86TargetLowering::LowerToBT is called: > > %and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test > with a register index > > > This IR does not match and so X86TargetLowering::LowerToBT is not called: > > %and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit > test with an immediate index > > %conv = and i64 %and, 1 > > > Let's back that up a bit. Clang emits this IR. These expressions start out > life in C as and with a left shifted masking bit, and are then converted > into IR as right shifted values anded with a masking bit. > > This IR then remains untouched until Expand ISel Pseudo-instructions in llc > (-O3). At that point, LowerToBT is called on the REGISTER version and > substitutes in a BT reg,reg instruction: > > btq %rsi, %rdi ## <MCInst #312 BT64rr > > > The IMMEDIATE version doesn't match the pattern and so LowerToBT is not > called. > > Question: This is during pseudo instruction expansion. How could LowerToBT's > caller have enough context to match the immediate IR version? In fact, lli > isn't calling LowerToBT so it isn't matching. But isn't this really a > peephole optimization issue? > > LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which > has exactly one subclass in NVPTXTargetMachine.cpp. > > But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer > subclass where you have a small window of instructions rather than during > pseudo instruction expansion where you have really one instruction? > PeepholeOptimizer doesn't seem to be getting much attention and certainly no > attention at the subclass level. > > Bluntly, expansion is about expansion. Peephole optimization is the > opposite. > > Question: Regardless, why is LowerToBT not being called for the IMMEDIATE > version? I suppose you could look at the preceding instruction in the DAG. > That seems a bit hacky. > > Another approach using LowerToBT would be to match lshr reg/imm first and > then if the following instruction was an and reg,1 replace both with a BT. > It doesn't look like LowerToBT as is can do that right now since it is > matching the and instruction.I think it's actually matching the comparison: LowerToBT is called by LowerSetCC, which has a comment saying: // Optimize to BT if possible. // Lower (X & (1 << N)) == 0 to BT(X, N). // Lower ((X >>u N) & 1) != 0 to BT(X, N). // Lower ((X >>s N) & 1) != 0 to BT(X, N). This doesn't match the immediate/LSHR version, because the ANDed result is returned directly, and there's no comparison with 0. If it is indeed profitable to generate the BT (a quick glance at Agner's tables for Merom/Haswell shows it probably is), I would start by looking at in PerformAndCombine, to replace the two nodes with an X86ISD::BT. -Ahmed> SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc > dl, SelectionDAG &DAG) const { ... } > > > But I think this is better done in a subclass of > CodeGen/PeepholeOptimizer.cpp. > > thanks. > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >
According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has one bit set, it might not be the best idea to always change TEST reg, 0x1000 to BT reg, 12… Fiona> On Jan 22, 2015, at 1:17 PM, Mehdi Amini <mehdi.amini at apple.com> wrote: > > > >> Begin forwarded message: >> >> Date: January 18, 2015 at 10:57:33 PM PST >> Subject: Re: [LLVMdev] X86TargetLowering::LowerToBT >> From: Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> >> To: Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> >> Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>> >> >> Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR. >> >> shrq $25, %rdi >> andq $1, %rdi >> >> LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25 >> The generated code is correct in both cases. It just isn't optimized in the immediate operatnd case. >> >> unsigned long long IsBitSetA(unsigned long long val) >> { >> return (val & (1ULL<<25)) != 0ULL; >> } >> >> unsigned long long IsBitSetB(unsigned long long val, int index) >> { >> return (val & (1ULL<<index)) != 0ULL; >> } >> >> >> On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >> Hi, >> >> Can you provide a reproducible example? I feel especially your first IR sample is incomplete. >> If you can also make more explicit how is the generated code wrong? >> >> You can give a C file if you are sure that it is reproducible with the current clang. >> >> Thanks, >> >> Mehdi >> >>> On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> wrote: >>> >>> I'm tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions. >>> >>> This IR matches and then X86TargetLowering::LowerToBT is called: >>> >>> %and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test with a register index >>> >>> This IR does not match and so X86TargetLowering::LowerToBT is not called: >>> >>> %and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit test with an immediate index >>> %conv = and i64 %and, 1 >>> >>> Let's back that up a bit. Clang emits this IR. These expressions start out life in C as and with a left shifted masking bit, and are then converted into IR as right shifted values anded with a masking bit. >>> >>> This IR then remains untouched until Expand ISel Pseudo-instructions in llc (-O3). At that point, LowerToBT is called on the REGISTER version and substitutes in a BT reg,reg instruction: >>> >>> btq %rsi, %rdi ## <MCInst #312 BT64rr >>> >>> The IMMEDIATE version doesn't match the pattern and so LowerToBT is not called. >>> >>> Question: This is during pseudo instruction expansion. How could LowerToBT's caller have enough context to match the immediate IR version? In fact, lli isn't calling LowerToBT so it isn't matching. But isn't this really a peephole optimization issue? >>> >>> LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which has exactly one subclass in NVPTXTargetMachine.cpp. >>> >>> But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? PeepholeOptimizer doesn't seem to be getting much attention and certainly no attention at the subclass level. >>> >>> Bluntly, expansion is about expansion. Peephole optimization is the opposite. >>> >>> Question: Regardless, why is LowerToBT not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky. >>> >>> Another approach using LowerToBT would be to match lshr reg/imm first and then if the following instruction was an and reg,1 replace both with a BT. It doesn't look like LowerToBT as is can do that right now since it is matching the and instruction. >>> >>> SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { ... } >>> >>> But I think this is better done in a subclass of CodeGen/PeepholeOptimizer.cpp. >>> >>> thanks. >>> >>> _______________________________________________ >>> LLVM Developers mailing list >>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >> >> >> >> >> -- >> Ite Ursi > <tst.c> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150122/3f56848f/attachment.html>
Even more importantly, TEST benefits from macro-fusion; BT does not, BT is vulnerable to partial-flags update stalls, TEST is not. Use TEST. Don’t use BT. – Steve> On Jan 22, 2015, at 4:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has one bit set, it might not be the best idea to always change TEST reg, 0x1000 to BT reg, 12… > > Fiona > >> On Jan 22, 2015, at 1:17 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >> >> >> >>> Begin forwarded message: >>> >>> Date: January 18, 2015 at 10:57:33 PM PST >>> Subject: Re: [LLVMdev] X86TargetLowering::LowerToBT >>> From: Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> >>> To: Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> >>> Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>> >>> >>> Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR. >>> >>> shrq $25, %rdi >>> andq $1, %rdi >>> >>> LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25 >>> The generated code is correct in both cases. It just isn't optimized in the immediate operatnd case. >>> >>> unsigned long long IsBitSetA(unsigned long long val) >>> { >>> return (val & (1ULL<<25)) != 0ULL; >>> } >>> >>> unsigned long long IsBitSetB(unsigned long long val, int index) >>> { >>> return (val & (1ULL<<index)) != 0ULL; >>> } >>> >>> >>> On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >>> Hi, >>> >>> Can you provide a reproducible example? I feel especially your first IR sample is incomplete. >>> If you can also make more explicit how is the generated code wrong? >>> >>> You can give a C file if you are sure that it is reproducible with the current clang. >>> >>> Thanks, >>> >>> Mehdi >>> >>>> On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> wrote: >>>> >>>> I'm tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions. >>>> >>>> This IR matches and then X86TargetLowering::LowerToBT is called: >>>> >>>> %and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test with a register index >>>> >>>> This IR does not match and so X86TargetLowering::LowerToBT is not called: >>>> >>>> %and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit test with an immediate index >>>> %conv = and i64 %and, 1 >>>> >>>> Let's back that up a bit. Clang emits this IR. These expressions start out life in C as and with a left shifted masking bit, and are then converted into IR as right shifted values anded with a masking bit. >>>> >>>> This IR then remains untouched until Expand ISel Pseudo-instructions in llc (-O3). At that point, LowerToBT is called on the REGISTER version and substitutes in a BT reg,reg instruction: >>>> >>>> btq %rsi, %rdi ## <MCInst #312 BT64rr >>>> >>>> The IMMEDIATE version doesn't match the pattern and so LowerToBT is not called. >>>> >>>> Question: This is during pseudo instruction expansion. How could LowerToBT's caller have enough context to match the immediate IR version? In fact, lli isn't calling LowerToBT so it isn't matching. But isn't this really a peephole optimization issue? >>>> >>>> LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which has exactly one subclass in NVPTXTargetMachine.cpp. >>>> >>>> But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? PeepholeOptimizer doesn't seem to be getting much attention and certainly no attention at the subclass level. >>>> >>>> Bluntly, expansion is about expansion. Peephole optimization is the opposite. >>>> >>>> Question: Regardless, why is LowerToBT not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky. >>>> >>>> Another approach using LowerToBT would be to match lshr reg/imm first and then if the following instruction was an and reg,1 replace both with a BT. It doesn't look like LowerToBT as is can do that right now since it is matching the and instruction. >>>> >>>> SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { ... } >>>> >>>> But I think this is better done in a subclass of CodeGen/PeepholeOptimizer.cpp. >>>> >>>> thanks. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>> >>> >>> >>> >>> -- >>> Ite Ursi >> <tst.c> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150122/909bf3d9/attachment.html>
> On Jan 22, 2015, at 1:22 PM, Fiona Glaser <fglaser at apple.com> wrote: > > According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has one bit set, it might not be the best idea to always change TEST reg, 0x1000 to BT reg, 12…Sounds like we should use BT with -Os, but TEST otherwise. This is probably a common enough instruction that it might make a good impact on code size. Pete> > Fiona > >> On Jan 22, 2015, at 1:17 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >> >> >> >>> Begin forwarded message: >>> >>> Date: January 18, 2015 at 10:57:33 PM PST >>> Subject: Re: [LLVMdev] X86TargetLowering::LowerToBT >>> From: Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> >>> To: Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> >>> Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>> >>> >>> Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR. >>> >>> shrq $25, %rdi >>> andq $1, %rdi >>> >>> LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25 >>> The generated code is correct in both cases. It just isn't optimized in the immediate operatnd case. >>> >>> unsigned long long IsBitSetA(unsigned long long val) >>> { >>> return (val & (1ULL<<25)) != 0ULL; >>> } >>> >>> unsigned long long IsBitSetB(unsigned long long val, int index) >>> { >>> return (val & (1ULL<<index)) != 0ULL; >>> } >>> >>> >>> On Sun, Jan 18, 2015 at 10:02 PM, Mehdi Amini <mehdi.amini at apple.com <mailto:mehdi.amini at apple.com>> wrote: >>> Hi, >>> >>> Can you provide a reproducible example? I feel especially your first IR sample is incomplete. >>> If you can also make more explicit how is the generated code wrong? >>> >>> You can give a C file if you are sure that it is reproducible with the current clang. >>> >>> Thanks, >>> >>> Mehdi >>> >>>> On Jan 18, 2015, at 5:13 PM, Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> wrote: >>>> >>>> I'm tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions. >>>> >>>> This IR matches and then X86TargetLowering::LowerToBT is called: >>>> >>>> %and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test with a register index >>>> >>>> This IR does not match and so X86TargetLowering::LowerToBT is not called: >>>> >>>> %and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit test with an immediate index >>>> %conv = and i64 %and, 1 >>>> >>>> Let's back that up a bit. Clang emits this IR. These expressions start out life in C as and with a left shifted masking bit, and are then converted into IR as right shifted values anded with a masking bit. >>>> >>>> This IR then remains untouched until Expand ISel Pseudo-instructions in llc (-O3). At that point, LowerToBT is called on the REGISTER version and substitutes in a BT reg,reg instruction: >>>> >>>> btq %rsi, %rdi ## <MCInst #312 BT64rr >>>> >>>> The IMMEDIATE version doesn't match the pattern and so LowerToBT is not called. >>>> >>>> Question: This is during pseudo instruction expansion. How could LowerToBT's caller have enough context to match the immediate IR version? In fact, lli isn't calling LowerToBT so it isn't matching. But isn't this really a peephole optimization issue? >>>> >>>> LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which has exactly one subclass in NVPTXTargetMachine.cpp. >>>> >>>> But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? PeepholeOptimizer doesn't seem to be getting much attention and certainly no attention at the subclass level. >>>> >>>> Bluntly, expansion is about expansion. Peephole optimization is the opposite. >>>> >>>> Question: Regardless, why is LowerToBT not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky. >>>> >>>> Another approach using LowerToBT would be to match lshr reg/imm first and then if the following instruction was an and reg,1 replace both with a BT. It doesn't look like LowerToBT as is can do that right now since it is matching the and instruction. >>>> >>>> SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { ... } >>>> >>>> But I think this is better done in a subclass of CodeGen/PeepholeOptimizer.cpp. >>>> >>>> thanks. >>>> >>>> _______________________________________________ >>>> LLVM Developers mailing list >>>> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> >>> >>> >>> >>> >>> -- >>> Ite Ursi >> <tst.c> >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150122/ea96b19c/attachment.html>