thr3ads.net - llvm dev - [llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types [May 2020]

If this information is useful, please help other people find it:
Share via:

Arthur O'Dwyer via llvm-dev

2020-May-27 00:31 UTC

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
> > [...] The store is a byte:
> >
> >     orb    $0x1,0x4a(%rbx)
> >
> > while the read is a word:
> >
> >     movzwl 0x4a(%r12),%r15d
> >
> > The problem is that between the store and the load the value
hasn't
> > been retired / placed in the cache. One would expect store-to-load
> > forwarding to kick in, but on x86 that doesn't happen because x86
> > requires the store to be of equal or greater size than the load. So
> > instead the load takes the slow path, causing unacceptable slowdowns.
> [...]
>
> Clang used to generate narrower loads and stores for bit-fields, but a
> long time ago it was intentionally changed to generate wider loads
> and stores, IIRC by Chandler.  There are some cases where I think the
> “new” code goes overboard, but in this case I don’t particularly have
> an issue with the wider loads and stores.  I guess we could make a
> best-effort attempt to stick to the storage-unit size when the
> bit-fields break evenly on a boundary.  But mostly I think the frontend’s
> responsibility ends with it generating same-size accesses in both
> places, and if inconsistent access sizes trigger poor performance,
> the backend should be more careful about intentionally changing access
> sizes.
>
FWIW, when I was at Green Hills, I recall the rule being "Always use the
declared type of the bitfield to govern the size of the read or write."
(There was a similar rule for the meaning of `volatile`. I hope I'm not
just getting confused between the two. Actually, since of the compilers on
Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>,
I'm
*probably* wrong.)  That is, if the bitfield is declared `int16_t`, then
use 16-bit loads and stores for it; if it's declared `int32_t`, then use
32-bit loads and stores. This gives the programmer a reason to prefer one
declared type over another. For example, in

template<class T>
struct A {
    T w : 5;
    T x : 3;
    T y : 4;
    T z : 4;
};

the only differences between A<char> and A<short> are
- whether the struct's alignment is 1 or 2, and
- whether you use 8-bit or 16-bit accesses to modify its fields.

"The backend should be more careful about intentionally changing access
sizes" sounds like absolutely the correct diagnosis, to me.

my $.02,
Arthur
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/59c929f8/attachment.html>

John McCall via llvm-dev

2020-May-27 01:29 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>>> [...] The store is a byte:
>>>
>>>     orb    $0x1,0x4a(%rbx)
>>>
>>> while the read is a word:
>>>
>>>     movzwl 0x4a(%r12),%r15d
>>>
>>> The problem is that between the store and the load the value
hasn't
>>> been retired / placed in the cache. One would expect store-to-load
>>> forwarding to kick in, but on x86 that doesn't happen because
x86
>>> requires the store to be of equal or greater size than the load. So
>>> instead the load takes the slow path, causing unacceptable 
>>> slowdowns.
>> [...]
>>
>> Clang used to generate narrower loads and stores for bit-fields, but 
>> a
>> long time ago it was intentionally changed to generate wider loads
>> and stores, IIRC by Chandler.  There are some cases where I think the
>> “new” code goes overboard, but in this case I don’t 
>> particularly have
>> an issue with the wider loads and stores.  I guess we could make a
>> best-effort attempt to stick to the storage-unit size when the
>> bit-fields break evenly on a boundary.  But mostly I think the 
>> frontend’s
>> responsibility ends with it generating same-size accesses in both
>> places, and if inconsistent access sizes trigger poor performance,
>> the backend should be more careful about intentionally changing 
>> access
>> sizes.
>>
>
> FWIW, when I was at Green Hills, I recall the rule being "Always use 
> the
> declared type of the bitfield to govern the size of the read or 
> write."
> (There was a similar rule for the meaning of `volatile`. I hope I'm 
> not
> just getting confused between the two. Actually, since of the 
> compilers on
> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, 
> I'm
> *probably* wrong.)  That is, if the bitfield is declared `int16_t`, 
> then
> use 16-bit loads and stores for it; if it's declared `int32_t`, then 
> use
> 32-bit loads and stores.
I’ve always liked MSVC’s bit-field rules as a coherent whole, but 
they are
quite different from the standard Unix rules.  On Windows, `T x : 3`
literally allocates an entire `T` in the structure, and successive
bit-fields get packed into that `T` only if their base type is of the
same size (and they haven’t exhausted the original `T`).  So of course
all accesses to that bit-field are basically of the full size of the 
`T`;
there’s no overlap to be concerned with.  On Unix, bit-fields will 
typically
get packed together regardless of the base type; the base type does have
some influence, but it’s target-specific and somewhat odd.

I’d prefer if we degraded to a Windows-like access behavior as much
as we can, but it’s not always possible because of that packing.

John.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/a1ee328b/attachment.html>

James Y Knight via llvm-dev

2020-May-27 02:49 UTC

head link

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

At least in this test-case, the "bitfield" part of this seems to be a
distraction. As Eli notes, Clang has lowered the function to LLVM IR
containing consistent i16 operations. Despite that being a different choice
from GCC, it should still be correct and consistent.

Of course that insight does mean it's quite easy to create a test-case with
the exact same problematic store->load mismatch which doesn't use
bit-fields at all. For example:
short f2(short *bfs) {
    *bfs &= ~0x1;
    g();
    return *bfs;
}

creates the same bad sequence:
        movq    %rdi, %rbx
        andb    $-2, (%rdi)
        callq   g()
        movzwl  (%rbx), %eax

On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:
>
> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>
> [...] The store is a byte:
>
> orb $0x1,0x4a(%rbx)
>
> while the read is a word:
>
> movzwl 0x4a(%r12),%r15d
>
> The problem is that between the store and the load the value hasn't
> been retired / placed in the cache. One would expect store-to-load
> forwarding to kick in, but on x86 that doesn't happen because x86
> requires the store to be of equal or greater size than the load. So
> instead the load takes the slow path, causing unacceptable slowdowns.
>
> [...]
>
> Clang used to generate narrower loads and stores for bit-fields, but a
> long time ago it was intentionally changed to generate wider loads
> and stores, IIRC by Chandler. There are some cases where I think the
> “new” code goes overboard, but in this case I don’t particularly have
> an issue with the wider loads and stores. I guess we could make a
> best-effort attempt to stick to the storage-unit size when the
> bit-fields break evenly on a boundary. But mostly I think the frontend’s
> responsibility ends with it generating same-size accesses in both
> places, and if inconsistent access sizes trigger poor performance,
> the backend should be more careful about intentionally changing access
> sizes.
>
> FWIW, when I was at Green Hills, I recall the rule being "Always use
the
> declared type of the bitfield to govern the size of the read or
write."
> (There was a similar rule for the meaning of `volatile`. I hope I'm not
> just getting confused between the two. Actually, since of the compilers on
> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>,
I'm
> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then
> use 16-bit loads and stores for it; if it's declared `int32_t`, then
use
> 32-bit loads and stores.
>
> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
> quite different from the standard Unix rules. On Windows, T x : 3
> literally allocates an entire T in the structure, and successive
> bit-fields get packed into that T only if their base type is of the
> same size (and they haven’t exhausted the original T). So of course
> all accesses to that bit-field are basically of the full size of the T;
> there’s no overlap to be concerned with. On Unix, bit-fields will typically
> get packed together regardless of the base type; the base type does have
> some influence, but it’s target-specific and somewhat odd.
>
> I’d prefer if we degraded to a Windows-like access behavior as much
> as we can, but it’s not always possible because of that packing.
>
> John.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/03d20cab/attachment.html>

llvm dev - May 2020 - [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types