I suspect that this is because the mask in your example is the result of a variable shift, which (a) has it’s own performance and flags hazards pre-SHLX and (b) requires additional µops to do with TEST. I expect that ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags dependency, as well. If the mask were constant, I expect ICC would generate TEST instead (but I don’t have it handy to check). – Steve> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at rotateright.com> wrote: > > If 'bt' is a perf sin, icc doesn't seem to know it: > > $ icc -v > icc version 15.0.1 (gcc version 4.9.0 compatibility) > > $ cat bt.c > unsigned long long IsBitSetB_64(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } > unsigned int IsBitSetB_32(unsigned int val, int index) { return (val & (1U<<index)) != 0U; } > > $ icc -O3 -S bt.c -o - | grep bt > .file "bt.c" > btq %rsi, %rdi > btl %esi, %edi > > Does anyone at Intel have guidance for us? > > > On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at gmail.com <mailto:echristo at gmail.com>> wrote: > > > On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at gmail.com <mailto:chris.sears at gmail.com>> wrote: > The status quo is: > > a) 40b REX+BT instruction for the 64b case > b) 48b TEST for the 32b case > c) unless it's small TEST > > You are currently paying a 16b penalty for TEST vs BT in the 32b case. > That may be worth testing the -Os flag. > > You'll want -Oz here, Os isn't supposed to affect the runtime as much as this is going to. > > -eric > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu <http://llvm.cs.uiuc.edu/> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev> > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/8e2572c2/attachment.html>
Full icc code sequence (for the 32-bit case): xorl %eax, %eax movl $1, %edx btl %esi, %edi cmovc %edx, %eax ret Chris's code example is actually returning the result, so no 'test' or 'bt' in the constant mask case: unsigned int IsBitSetA_32(unsigned int val) { return (val & (1U<<25)) !0U; } andl $33554432, %edi shrl $25, %edi movl %edi, %eax ret On Fri, Jan 23, 2015 at 9:45 AM, Stephen Canon <scanon at apple.com> wrote:> I suspect that this is because the mask in your example is the result of a > variable shift, which (a) has it’s own performance and flags hazards > pre-SHLX and (b) requires additional µops to do with TEST. I expect that > ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags > dependency, as well. > > If the mask were constant, I expect ICC would generate TEST instead (but I > don’t have it handy to check). > > – Steve > > On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at rotateright.com> wrote: > > If 'bt' is a perf sin, icc doesn't seem to know it: > > $ icc -v > icc version 15.0.1 (gcc version 4.9.0 compatibility) > > $ cat bt.c > unsigned long long IsBitSetB_64(unsigned long long val, int index) { > return (val & (1ULL<<index)) != 0ULL; } > unsigned int IsBitSetB_32(unsigned int val, int index) { return (val & > (1U<<index)) != 0U; } > > $ icc -O3 -S bt.c -o - | grep bt > .file "bt.c" > btq %rsi, %rdi > btl %esi, %edi > > Does anyone at Intel have guidance for us? > > > On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at gmail.com> > wrote: > >> >> >> On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at gmail.com> >> wrote: >> >>> The status quo is: >>> >>> a) 40b REX+BT instruction for the 64b case >>> b) 48b TEST for the 32b case >>> c) unless it's small TEST >>> >>> >>> You are currently paying a 16b penalty for TEST vs BT in the 32b case. >>> That may be worth testing the -Os flag. >>> >> >> You'll want -Oz here, Os isn't supposed to affect the runtime as much as >> this is going to. >> >> -eric >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150123/9f33c95a/attachment.html>
Right, so the xor breaks the false dependency on the previous flags state. Compare to what we get from clang with a variable mask: bt %esi, %edi sbb %eax, %eax // HAZARD — false dependency of flags state prior to BT. and $1, %eax If we instead generated: xor %eax, %eax bt %esi, %edi adc %eax, %eax We’d mostly avoid the partial-flags hazard, though we’d still get one extra µop generated. Targeting Haswell, I’d probably rather see: shrx %esi, %edi, %eax and $1, %eax but a reasonable case can be made for the bt sequence under –Oz. As I understand it though, this whole discussion is actually about the constant mask case, for which clang already generates reasonable code. – Steve> On Jan 23, 2015, at 11:57 AM, Sanjay Patel <spatel at rotateright.com> wrote: > > Full icc code sequence (for the 32-bit case): > xorl %eax, %eax > movl $1, %edx > btl %esi, %edi > cmovc %edx, %eax > ret > > Chris's code example is actually returning the result, so no 'test' or 'bt' in the constant mask case: > > unsigned int IsBitSetA_32(unsigned int val) { return (val & (1U<<25)) != 0U; } > > andl $33554432, %edi > shrl $25, %edi > movl %edi, %eax > ret > > > > > On Fri, Jan 23, 2015 at 9:45 AM, Stephen Canon <scanon at apple.com> wrote: > I suspect that this is because the mask in your example is the result of a variable shift, which (a) has it’s own performance and flags hazards pre-SHLX and (b) requires additional µops to do with TEST. I expect that ICC is putting a dummy TEST or XOR ahead of the BT to break the false flags dependency, as well. > > If the mask were constant, I expect ICC would generate TEST instead (but I don’t have it handy to check). > > – Steve > >> On Jan 23, 2015, at 11:32 AM, Sanjay Patel <spatel at rotateright.com> wrote: >> >> If 'bt' is a perf sin, icc doesn't seem to know it: >> >> $ icc -v >> icc version 15.0.1 (gcc version 4.9.0 compatibility) >> >> $ cat bt.c >> unsigned long long IsBitSetB_64(unsigned long long val, int index) { return (val & (1ULL<<index)) != 0ULL; } >> unsigned int IsBitSetB_32(unsigned int val, int index) { return (val & (1U<<index)) != 0U; } >> >> $ icc -O3 -S bt.c -o - | grep bt >> .file "bt.c" >> btq %rsi, %rdi >> btl %esi, %edi >> >> Does anyone at Intel have guidance for us? >> >> >> On Thu, Jan 22, 2015 at 4:34 PM, Eric Christopher <echristo at gmail.com> wrote: >> >> >> On Thu Jan 22 2015 at 3:32:53 PM Chris Sears <chris.sears at gmail.com> wrote: >> The status quo is: >> >> a) 40b REX+BT instruction for the 64b case >> b) 48b TEST for the 32b case >> c) unless it's small TEST >> >> You are currently paying a 16b penalty for TEST vs BT in the 32b case. >> That may be worth testing the -Os flag. >> >> You'll want -Oz here, Os isn't supposed to affect the runtime as much as this is going to. >> >> -eric >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >