Arthur O'Dwyer via llvm-dev
2020-May-27 00:31 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < cfe-dev at lists.llvm.org> wrote:> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > > [...] The store is a byte: > > > > orb $0x1,0x4a(%rbx) > > > > while the read is a word: > > > > movzwl 0x4a(%r12),%r15d > > > > The problem is that between the store and the load the value hasn't > > been retired / placed in the cache. One would expect store-to-load > > forwarding to kick in, but on x86 that doesn't happen because x86 > > requires the store to be of equal or greater size than the load. So > > instead the load takes the slow path, causing unacceptable slowdowns. > [...] > > Clang used to generate narrower loads and stores for bit-fields, but a > long time ago it was intentionally changed to generate wider loads > and stores, IIRC by Chandler. There are some cases where I think the > “new” code goes overboard, but in this case I don’t particularly have > an issue with the wider loads and stores. I guess we could make a > best-effort attempt to stick to the storage-unit size when the > bit-fields break evenly on a boundary. But mostly I think the frontend’s > responsibility ends with it generating same-size accesses in both > places, and if inconsistent access sizes trigger poor performance, > the backend should be more careful about intentionally changing access > sizes. >FWIW, when I was at Green Hills, I recall the rule being "Always use the declared type of the bitfield to govern the size of the read or write." (There was a similar rule for the meaning of `volatile`. I hope I'm not just getting confused between the two. Actually, since of the compilers on Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm *probably* wrong.) That is, if the bitfield is declared `int16_t`, then use 16-bit loads and stores for it; if it's declared `int32_t`, then use 32-bit loads and stores. This gives the programmer a reason to prefer one declared type over another. For example, in template<class T> struct A { T w : 5; T x : 3; T y : 4; T z : 4; }; the only differences between A<char> and A<short> are - whether the struct's alignment is 1 or 2, and - whether you use 8-bit or 16-bit accesses to modify its fields. "The backend should be more careful about intentionally changing access sizes" sounds like absolutely the correct diagnosis, to me. my $.02, Arthur -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/59c929f8/attachment.html>
John McCall via llvm-dev
2020-May-27 01:29 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < > cfe-dev at lists.llvm.org> wrote: > >> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: >>> [...] The store is a byte: >>> >>> orb $0x1,0x4a(%rbx) >>> >>> while the read is a word: >>> >>> movzwl 0x4a(%r12),%r15d >>> >>> The problem is that between the store and the load the value hasn't >>> been retired / placed in the cache. One would expect store-to-load >>> forwarding to kick in, but on x86 that doesn't happen because x86 >>> requires the store to be of equal or greater size than the load. So >>> instead the load takes the slow path, causing unacceptable >>> slowdowns. >> [...] >> >> Clang used to generate narrower loads and stores for bit-fields, but >> a >> long time ago it was intentionally changed to generate wider loads >> and stores, IIRC by Chandler. There are some cases where I think the >> “new” code goes overboard, but in this case I don’t >> particularly have >> an issue with the wider loads and stores. I guess we could make a >> best-effort attempt to stick to the storage-unit size when the >> bit-fields break evenly on a boundary. But mostly I think the >> frontend’s >> responsibility ends with it generating same-size accesses in both >> places, and if inconsistent access sizes trigger poor performance, >> the backend should be more careful about intentionally changing >> access >> sizes. >> > > FWIW, when I was at Green Hills, I recall the rule being "Always use > the > declared type of the bitfield to govern the size of the read or > write." > (There was a similar rule for the meaning of `volatile`. I hope I'm > not > just getting confused between the two. Actually, since of the > compilers on > Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, > I'm > *probably* wrong.) That is, if the bitfield is declared `int16_t`, > then > use 16-bit loads and stores for it; if it's declared `int32_t`, then > use > 32-bit loads and stores.I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are quite different from the standard Unix rules. On Windows, `T x : 3` literally allocates an entire `T` in the structure, and successive bit-fields get packed into that `T` only if their base type is of the same size (and they haven’t exhausted the original `T`). So of course all accesses to that bit-field are basically of the full size of the `T`; there’s no overlap to be concerned with. On Unix, bit-fields will typically get packed together regardless of the base type; the base type does have some influence, but it’s target-specific and somewhat odd. I’d prefer if we degraded to a Windows-like access behavior as much as we can, but it’s not always possible because of that packing. John. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/a1ee328b/attachment.html>
James Y Knight via llvm-dev
2020-May-27 02:49 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent. Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example: short f2(short *bfs) { *bfs &= ~0x1; g(); return *bfs; } creates the same bad sequence: movq %rdi, %rbx andb $-2, (%rdi) callq g() movzwl (%rbx), %eax On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote: > > On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < > cfe-dev at lists.llvm.org> wrote: > > On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > > [...] The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > [...] > > Clang used to generate narrower loads and stores for bit-fields, but a > long time ago it was intentionally changed to generate wider loads > and stores, IIRC by Chandler. There are some cases where I think the > “new” code goes overboard, but in this case I don’t particularly have > an issue with the wider loads and stores. I guess we could make a > best-effort attempt to stick to the storage-unit size when the > bit-fields break evenly on a boundary. But mostly I think the frontend’s > responsibility ends with it generating same-size accesses in both > places, and if inconsistent access sizes trigger poor performance, > the backend should be more careful about intentionally changing access > sizes. > > FWIW, when I was at Green Hills, I recall the rule being "Always use the > declared type of the bitfield to govern the size of the read or write." > (There was a similar rule for the meaning of `volatile`. I hope I'm not > just getting confused between the two. Actually, since of the compilers on > Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm > *probably* wrong.) That is, if the bitfield is declared `int16_t`, then > use 16-bit loads and stores for it; if it's declared `int32_t`, then use > 32-bit loads and stores. > > I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are > quite different from the standard Unix rules. On Windows, T x : 3 > literally allocates an entire T in the structure, and successive > bit-fields get packed into that T only if their base type is of the > same size (and they haven’t exhausted the original T). So of course > all accesses to that bit-field are basically of the full size of the T; > there’s no overlap to be concerned with. On Unix, bit-fields will typically > get packed together regardless of the base type; the base type does have > some influence, but it’s target-specific and somewhat odd. > > I’d prefer if we degraded to a Windows-like access behavior as much > as we can, but it’s not always possible because of that packing. > > John. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/03d20cab/attachment.html>
Reasonably Related Threads
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [RFC] Loading Bitfields with Smallest Needed Types
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [Bug 93557] New: Kernel Panic on Linux Kernel 4.4 when loading KDE/KDM on Nvidia GeForce 7025 / nForce 630a