thr3ads.net - llvm dev - [llvm-dev] [X86] How do I set just the low byte of an x86

If this information is useful, please help other people find it:
Share via:

Mat Hostetter via llvm-dev

2021-Feb-11 13:29 UTC

[llvm-dev] [X86] How do I set just the low byte of an x86_64 register?

I work on a compiler that uses LLVM for its back end. I'm interested in
setting just the low byte of a register, leaving the other bits alone, for some
GC tag bit shenanigans, e.g.:

long replace_low_byte_with_37(long* a) {
  return (*a & ~0xFFL) | 37;
}

x86_64 has a movb instruction that does exactly this, but I can't get clang
(or any other compiler), to use movb for this purpose, even at -Os.

Here is the -Os -march=sandybridge compiler output for gcc-10.2, icc-21.1.9, and
clang-11.0.1 (all different!), as well as how a simple movb assembles:

0000000000000000 <gcc>:
   0:   48 8b 07                mov    (%rdi),%rax
   3:   30 c0                   xor    %al,%al
   5:   48 83 c8 25             or     $0x25,%rax
   9:   c3                      retq
000000000000000a <icc>:
   a:   48 8b 07                mov    (%rdi),%rax
   d:   48 25 00 ff ff ff       and    $0xffffffffffffff00,%rax
  13:   48 83 c0 25             add    $0x25,%rax
  17:   c3                      retq
0000000000000018 <clang>:
  18:   48 c7 c0 00 ff ff ff    mov    $0xffffffffffffff00,%rax
  1f:   48 23 07                and    (%rdi),%rax
  22:   48 83 c8 25             or     $0x25,%rax
  26:   c3                      retq
0000000000000027 <simple_movb_by_hand>:
  27:   48 8b 07                mov    (%rdi),%rax
  2a:   b0 25                   mov    $0x25,%al
  2c:   c3                      retq

As you can see, movb would be smallest (and llvm's is the biggest). Size is
important for my use case.

So why don't these compilers generate movb? Perhaps the concern is partial
register stalls and how %rax and %al interact with the register renamer. As I
understand the
background<https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to>
from Peter Cordes referenced by
#34707<https://bugs.llvm.org/show_bug.cgi?id=34707>, the punchline is that
since Sandy Bridge, and especially Skylake, the partial register stall is no big
deal for an actual RMW operation like this.

But even on CPUs where there is a stall that's worse than the added
instructions from not using movb, -Os should still prefer movb.

I'm not advocating using this for %ah (etc.), which is famously
incorrect<http://gallium.inria.fr/blog/intel-skylake-bug/> in some Skylake
and Kaby Lake CPUs without a microcode patch.

Is there a way to get LLVM to generate movb to set just the low byte?

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210211/71bdd072/attachment.html>

Tim Northover via llvm-dev

2021-Feb-11 18:13 UTC

head link

[llvm-dev] [X86] How do I set just the low byte of an x86_64 register?

On Thu, 11 Feb 2021 at 13:29, Mat Hostetter via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> But even on CPUs where there is a stall that's worse than the added
instructions from not using movb, -Os should still prefer movb.
It doesn't really change anything (we still don't do movb), but
Clang's real size optimization option is "-Oz". -Os is much closer
to
-O2 with just a hint of caring about size.

Cheers.

Tim.

llvm dev - Feb 2021 - [X86] How do I set just the low byte of an x86_64 register?

[llvm-dev] [X86] How do I set just the low byte of an x86_64 register?

[llvm-dev] [X86] How do I set just the low byte of an x86_64 register?