Which BTQ? There are three flavors. BTQ reg/reg BTQ reg/mem BTQ reg/imm I can imagine that the reg/reg and especially the reg/mem versions would be slow. However the shrq/and versions *with the same operands* would be slow as well. There's even a compiler comment about the reg/mem version saying "this is for disassembly only". But I doubt BTQ reg/imm would be microcoded. -- Ite Ursi -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/658be9ec/attachment.html>
Looking at the Intel Optimization Reference Manual, page 14-14, for Atom
BT m16, imm8, BT mem, imm8 latency 2,1 throughput 1
BT m16, r16, BT mem, reg latency 10, 9, throughput 8
BT reg, imm8, BT reg, reg latency 1, throughput 1
On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.
The posted functions were simplified for tracking down the code generation
problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ
for bit testing is even worse because you have to MOVE the tested reg to a
temporary before the SHRQ/ANDQ. And all of these instructions require a REX
prefix (well, not the AND). The result is some code bloat (3 instructions
vs 1) and a little register pressure.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20150119/ef0da9ac/attachment.html>
> On Jan 19, 2015, at 10:29 AM, Chris Sears <chris.sears at gmail.com> wrote: > > Looking at the Intel Optimization Reference Manual, page 14-14, for Atom > > BT m16, imm8, BT mem, imm8 latency 2,1 throughput 1 > BT m16, r16, BT mem, reg latency 10, 9, throughput 8 > BT reg, imm8, BT reg, reg latency 1, throughput 1 > > On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8. > > The posted functions were simplified for tracking down the code generation problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for bit testing is even worse because you have to MOVE the tested reg to a temporary before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well, not the AND). The result is some code bloat (3 instructions vs 1) and a little register pressure.I’m not an X86 expert, but I’d still like to understand why you are comparing 1 instructions to 3, the result does not seem exactly the same since (if I understand correctly) BT only sets the carry flags while the other combination provide the result in a register. The full sequence is: btq %rsi, %rdi sbbq %rax, %rax andq $1, %rax popq %rbp vs: shrq $25, %rdi andq $1, %rdi movq %rdi, %rax popq %rbp (I’m not saying that btq is not preferable, just that I don’t see a difference in the number of instructions needed to get the result). I agree with Ahmed that you probably should look into PerformAndCombine()( X86ISelLowering.cpp) Best, Mehdi