James Y Knight via llvm-dev
2020-May-27 02:49 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent. Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example: short f2(short *bfs) { *bfs &= ~0x1; g(); return *bfs; } creates the same bad sequence: movq %rdi, %rbx andb $-2, (%rdi) callq g() movzwl (%rbx), %eax On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote: > > On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < > cfe-dev at lists.llvm.org> wrote: > > On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > > [...] The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > [...] > > Clang used to generate narrower loads and stores for bit-fields, but a > long time ago it was intentionally changed to generate wider loads > and stores, IIRC by Chandler. There are some cases where I think the > “new” code goes overboard, but in this case I don’t particularly have > an issue with the wider loads and stores. I guess we could make a > best-effort attempt to stick to the storage-unit size when the > bit-fields break evenly on a boundary. But mostly I think the frontend’s > responsibility ends with it generating same-size accesses in both > places, and if inconsistent access sizes trigger poor performance, > the backend should be more careful about intentionally changing access > sizes. > > FWIW, when I was at Green Hills, I recall the rule being "Always use the > declared type of the bitfield to govern the size of the read or write." > (There was a similar rule for the meaning of `volatile`. I hope I'm not > just getting confused between the two. Actually, since of the compilers on > Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm > *probably* wrong.) That is, if the bitfield is declared `int16_t`, then > use 16-bit loads and stores for it; if it's declared `int32_t`, then use > 32-bit loads and stores. > > I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are > quite different from the standard Unix rules. On Windows, T x : 3 > literally allocates an entire T in the structure, and successive > bit-fields get packed into that T only if their base type is of the > same size (and they haven’t exhausted the original T). So of course > all accesses to that bit-field are basically of the full size of the T; > there’s no overlap to be concerned with. On Unix, bit-fields will typically > get packed together regardless of the base type; the base type does have > some influence, but it’s target-specific and somewhat odd. > > I’d prefer if we degraded to a Windows-like access behavior as much > as we can, but it’s not always possible because of that packing. > > John. > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/03d20cab/attachment.html>
Craig Topper via llvm-dev
2020-May-27 03:37 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
Similar issue with aligned extload widening. The load and store are both i16 in IR. struct s { int x; // force alignment short y; }; void g(); short f2(s *a, short b) { a->y = b; g(); return a->y - b; } produces this movq %rdi, %rbx movw %bp, 4(%rdi) // 2 byte store callq g() movl 4(%rbx), %eax // 4 byte load subl %ebp, %eax ~Craig On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev < llvm-dev at lists.llvm.org> wrote:> At least in this test-case, the "bitfield" part of this seems to be a > distraction. As Eli notes, Clang has lowered the function to LLVM IR > containing consistent i16 operations. Despite that being a different choice > from GCC, it should still be correct and consistent. > > Of course that insight does mean it's quite easy to create a test-case > with the exact same problematic store->load mismatch which doesn't use > bit-fields at all. For example: > short f2(short *bfs) { > *bfs &= ~0x1; > g(); > return *bfs; > } > > creates the same bad sequence: > movq %rdi, %rbx > andb $-2, (%rdi) > callq g() > movzwl (%rbx), %eax > > On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote: >> >> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < >> cfe-dev at lists.llvm.org> wrote: >> >> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: >> >> [...] The store is a byte: >> >> orb $0x1,0x4a(%rbx) >> >> while the read is a word: >> >> movzwl 0x4a(%r12),%r15d >> >> The problem is that between the store and the load the value hasn't >> been retired / placed in the cache. One would expect store-to-load >> forwarding to kick in, but on x86 that doesn't happen because x86 >> requires the store to be of equal or greater size than the load. So >> instead the load takes the slow path, causing unacceptable slowdowns. >> >> [...] >> >> Clang used to generate narrower loads and stores for bit-fields, but a >> long time ago it was intentionally changed to generate wider loads >> and stores, IIRC by Chandler. There are some cases where I think the >> “new” code goes overboard, but in this case I don’t particularly have >> an issue with the wider loads and stores. I guess we could make a >> best-effort attempt to stick to the storage-unit size when the >> bit-fields break evenly on a boundary. But mostly I think the frontend’s >> responsibility ends with it generating same-size accesses in both >> places, and if inconsistent access sizes trigger poor performance, >> the backend should be more careful about intentionally changing access >> sizes. >> >> FWIW, when I was at Green Hills, I recall the rule being "Always use the >> declared type of the bitfield to govern the size of the read or write." >> (There was a similar rule for the meaning of `volatile`. I hope I'm not >> just getting confused between the two. Actually, since of the compilers on >> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm >> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then >> use 16-bit loads and stores for it; if it's declared `int32_t`, then use >> 32-bit loads and stores. >> >> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are >> quite different from the standard Unix rules. On Windows, T x : 3 >> literally allocates an entire T in the structure, and successive >> bit-fields get packed into that T only if their base type is of the >> same size (and they haven’t exhausted the original T). So of course >> all accesses to that bit-field are basically of the full size of the T; >> there’s no overlap to be concerned with. On Unix, bit-fields will >> typically >> get packed together regardless of the base type; the base type does have >> some influence, but it’s target-specific and somewhat odd. >> >> I’d prefer if we degraded to a Windows-like access behavior as much >> as we can, but it’s not always possible because of that packing. >> >> John. >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/86f79ec1/attachment.html>
Xinliang David Li via llvm-dev
2020-May-27 16:25 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
There is a related discussion about load widening (byte read --> 64bit read) transformation 9 years ago: http://lists.llvm.org/pipermail/llvm-dev/2011-December/046322.html David On Tue, May 26, 2020 at 8:37 PM Craig Topper via cfe-dev < cfe-dev at lists.llvm.org> wrote:> Similar issue with aligned extload widening. The load and store are both > i16 in IR. > > struct s { > int x; // force alignment > short y; > }; > > void g(); > short f2(s *a, short b) { > a->y = b; > g(); > return a->y - b; > } > > produces this > movq %rdi, %rbx > movw %bp, 4(%rdi) // 2 byte store > callq g() > movl 4(%rbx), %eax // 4 byte load > subl %ebp, %eax > > ~Craig > > > On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > >> At least in this test-case, the "bitfield" part of this seems to be a >> distraction. As Eli notes, Clang has lowered the function to LLVM IR >> containing consistent i16 operations. Despite that being a different choice >> from GCC, it should still be correct and consistent. >> >> Of course that insight does mean it's quite easy to create a test-case >> with the exact same problematic store->load mismatch which doesn't use >> bit-fields at all. For example: >> short f2(short *bfs) { >> *bfs &= ~0x1; >> g(); >> return *bfs; >> } >> >> creates the same bad sequence: >> movq %rdi, %rbx >> andb $-2, (%rdi) >> callq g() >> movzwl (%rbx), %eax >> >> On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >>> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote: >>> >>> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < >>> cfe-dev at lists.llvm.org> wrote: >>> >>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: >>> >>> [...] The store is a byte: >>> >>> orb $0x1,0x4a(%rbx) >>> >>> while the read is a word: >>> >>> movzwl 0x4a(%r12),%r15d >>> >>> The problem is that between the store and the load the value hasn't >>> been retired / placed in the cache. One would expect store-to-load >>> forwarding to kick in, but on x86 that doesn't happen because x86 >>> requires the store to be of equal or greater size than the load. So >>> instead the load takes the slow path, causing unacceptable slowdowns. >>> >>> [...] >>> >>> Clang used to generate narrower loads and stores for bit-fields, but a >>> long time ago it was intentionally changed to generate wider loads >>> and stores, IIRC by Chandler. There are some cases where I think the >>> “new” code goes overboard, but in this case I don’t particularly have >>> an issue with the wider loads and stores. I guess we could make a >>> best-effort attempt to stick to the storage-unit size when the >>> bit-fields break evenly on a boundary. But mostly I think the frontend’s >>> responsibility ends with it generating same-size accesses in both >>> places, and if inconsistent access sizes trigger poor performance, >>> the backend should be more careful about intentionally changing access >>> sizes. >>> >>> FWIW, when I was at Green Hills, I recall the rule being "Always use the >>> declared type of the bitfield to govern the size of the read or write." >>> (There was a similar rule for the meaning of `volatile`. I hope I'm not >>> just getting confused between the two. Actually, since of the compilers >>> on >>> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm >>> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then >>> use 16-bit loads and stores for it; if it's declared `int32_t`, then use >>> 32-bit loads and stores. >>> >>> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they >>> are >>> quite different from the standard Unix rules. On Windows, T x : 3 >>> literally allocates an entire T in the structure, and successive >>> bit-fields get packed into that T only if their base type is of the >>> same size (and they haven’t exhausted the original T). So of course >>> all accesses to that bit-field are basically of the full size of the T; >>> there’s no overlap to be concerned with. On Unix, bit-fields will >>> typically >>> get packed together regardless of the base type; the base type does have >>> some influence, but it’s target-specific and somewhat odd. >>> >>> I’d prefer if we degraded to a Windows-like access behavior as much >>> as we can, but it’s not always possible because of that packing. >>> >>> John. >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >>> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200527/bd917a96/attachment-0001.html>
Bill Wendling via llvm-dev
2020-May-28 22:42 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent. >I suspect that this is more prevalent with bitfields as they're more likely to have the load / bitwise op / store operations done on them, resulting in an access type that can be shortened. But yes, it's not specific to just bitfields. I'm more interested in consistency, to be honest. If the loads and stores for the bitfields (or other such shorten-able objects) were the same, then we wouldn't run into the store-to-load forwarding issue on x86 (I don't know about other platforms, but suspect that consistency wouldn't hurt). I liked Arthur's idea of accessing the object using the type size the bitfield was defined with (i8, i16, i256). It would help with improving the heuristic. The downside is that it could lead to un-optimal code, but that's the situation we have now, so... -bw> Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example: > short f2(short *bfs) { > *bfs &= ~0x1; > g(); > return *bfs; > } > > creates the same bad sequence: > movq %rdi, %rbx > andb $-2, (%rdi) > callq g() > movzwl (%rbx), %eax > > On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <llvm-dev at lists.llvm.org> wrote: >> >> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote: >> >> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < >> cfe-dev at lists.llvm.org> wrote: >> >> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: >> >> [...] The store is a byte: >> >> orb $0x1,0x4a(%rbx) >> >> while the read is a word: >> >> movzwl 0x4a(%r12),%r15d >> >> The problem is that between the store and the load the value hasn't >> been retired / placed in the cache. One would expect store-to-load >> forwarding to kick in, but on x86 that doesn't happen because x86 >> requires the store to be of equal or greater size than the load. So >> instead the load takes the slow path, causing unacceptable slowdowns. >> >> [...] >> >> Clang used to generate narrower loads and stores for bit-fields, but a >> long time ago it was intentionally changed to generate wider loads >> and stores, IIRC by Chandler. There are some cases where I think the >> “new” code goes overboard, but in this case I don’t particularly have >> an issue with the wider loads and stores. I guess we could make a >> best-effort attempt to stick to the storage-unit size when the >> bit-fields break evenly on a boundary. But mostly I think the frontend’s >> responsibility ends with it generating same-size accesses in both >> places, and if inconsistent access sizes trigger poor performance, >> the backend should be more careful about intentionally changing access >> sizes. >> >> FWIW, when I was at Green Hills, I recall the rule being "Always use the >> declared type of the bitfield to govern the size of the read or write." >> (There was a similar rule for the meaning of `volatile`. I hope I'm not >> just getting confused between the two. Actually, since of the compilers on >> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm >> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then >> use 16-bit loads and stores for it; if it's declared `int32_t`, then use >> 32-bit loads and stores. >> >> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are >> quite different from the standard Unix rules. On Windows, T x : 3 >> literally allocates an entire T in the structure, and successive >> bit-fields get packed into that T only if their base type is of the >> same size (and they haven’t exhausted the original T). So of course >> all accesses to that bit-field are basically of the full size of the T; >> there’s no overlap to be concerned with. On Unix, bit-fields will typically >> get packed together regardless of the base type; the base type does have >> some influence, but it’s target-specific and somewhat odd. >> >> I’d prefer if we degraded to a Windows-like access behavior as much >> as we can, but it’s not always possible because of that packing. >> >> John. >> >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
John McCall via llvm-dev
2020-May-29 18:05 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
On 28 May 2020, at 18:42, Bill Wendling wrote:> On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev > <llvm-dev at lists.llvm.org> wrote: >> >> At least in this test-case, the "bitfield" part of this seems to be a >> distraction. As Eli notes, Clang has lowered the function to LLVM IR >> containing consistent i16 operations. Despite that being a different >> choice from GCC, it should still be correct and consistent. >> > I suspect that this is more prevalent with bitfields as they're more > likely to have the load / bitwise op / store operations done on them, > resulting in an access type that can be shortened. But yes, it's not > specific to just bitfields. > > I'm more interested in consistency, to be honest. If the loads and > stores for the bitfields (or other such shorten-able objects) were the > same, then we wouldn't run into the store-to-load forwarding issue on > x86 (I don't know about other platforms, but suspect that consistency > wouldn't hurt). I liked Arthur's idea of accessing the object using > the type size the bitfield was defined with (i8, i16, i256). It would > help with improving the heuristic. The downside is that it could lead > to un-optimal code, but that's the situation we have now, so...Okay, but what concretely are you suggesting here? Clang IRGen is emitting accesses with consistent sizes, and LLVM is making them inconsistent. Are you just asking Clang to emit smaller accesses in the hope that LLVM won’t mess them up? John.
Apparently Analagous Threads
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [RFC] Loading Bitfields with Smallest Needed Types
- [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
- [RFC] bitfield access shrinking