Bill Wendling via llvm-dev
2020-May-26 22:28 UTC
[llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types
We're running into an interesting issue with the Linux kernel, and wanted advice on how to proceed. Example of what we're talking about: https://godbolt.org/z/ABGySq The issue is that when working with a bitfield a load may happen quickly after a store. For instance: struct napi_gro_cb { void *frag; unsigned int frag_len; u16 flush; u16 flush_id; u16 count; u16 gro_remcsum_start; unsigned long age; u16 proto; u8 same_flow : 1; u8 encap_mark : 1; u8 csum_valid : 1; u8 csum_cnt : 3; u8 free : 2; u8 is_ipv6 : 1; u8 is_fou : 1; u8 is_atomic : 1; u8 recursive_counter : 4; __wsum csum; struct sk_buff *last; }; void dev_gro_receive(struct sk_buff *skb) { ... same_flow = NAPI_GRO_CB(skb)->same_flow; ... } Right before the "same_flow = ... ->same_flow;" statement is executed, a store is made to the bitfield at the end of a called function: NAPI_GRO_CB(skb)->same_flow = 1; The store is a byte: orb $0x1,0x4a(%rbx) while the read is a word: movzwl 0x4a(%r12),%r15d The problem is that between the store and the load the value hasn't been retired / placed in the cache. One would expect store-to-load forwarding to kick in, but on x86 that doesn't happen because x86 requires the store to be of equal or greater size than the load. So instead the load takes the slow path, causing unacceptable slowdowns. GCC gets around this by using the smallest load for a bitfield. It seems to use a byte for everything, at least in our examples. From the comments, this is intentional, because according to the comments (which are never wrong) C++0x doesn't allow one to touch bits outside of the bitfield. (I'm not a language lawyer, but take this to mean that gcc is trying to minimize which bits it's accessing by using byte stores and loads whenever possible.) The question I have is what should we do to fix this issue? Once we get to LLVM IR, the information saying that we're accessing a bitfield is gone. We have a few options: * We can glean this information from how the loaded value is used and fix this during DAG combine, but it seems a bit brittle. * We could correct the size of the load during the front-end's code generation. This benefits from using all of LLVM's passes on the code. * We could perform the transformation in another place, possible in MIR or MC. What do people think? -bw
Eli Friedman via llvm-dev
2020-May-26 22:54 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members. If you look at the LLVM IR for your testcase, all the bitfield operations are i16. (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".) If you look at the assembly, though, an "andb $-2, (%rdi)" slips in. This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate. Maybe we shouldn't do that, or we should guard it with a better heuristic. -Eli -----Original Message----- From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Bill Wendling via cfe-dev Sent: Tuesday, May 26, 2020 3:29 PM To: cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org>; llvm-dev <llvm-dev at lists.llvm.org> Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types We're running into an interesting issue with the Linux kernel, and wanted advice on how to proceed. Example of what we're talking about: https://godbolt.org/z/ABGySq The issue is that when working with a bitfield a load may happen quickly after a store. For instance: struct napi_gro_cb { void *frag; unsigned int frag_len; u16 flush; u16 flush_id; u16 count; u16 gro_remcsum_start; unsigned long age; u16 proto; u8 same_flow : 1; u8 encap_mark : 1; u8 csum_valid : 1; u8 csum_cnt : 3; u8 free : 2; u8 is_ipv6 : 1; u8 is_fou : 1; u8 is_atomic : 1; u8 recursive_counter : 4; __wsum csum; struct sk_buff *last; }; void dev_gro_receive(struct sk_buff *skb) { ... same_flow = NAPI_GRO_CB(skb)->same_flow; ... } Right before the "same_flow = ... ->same_flow;" statement is executed, a store is made to the bitfield at the end of a called function: NAPI_GRO_CB(skb)->same_flow = 1; The store is a byte: orb $0x1,0x4a(%rbx) while the read is a word: movzwl 0x4a(%r12),%r15d The problem is that between the store and the load the value hasn't been retired / placed in the cache. One would expect store-to-load forwarding to kick in, but on x86 that doesn't happen because x86 requires the store to be of equal or greater size than the load. So instead the load takes the slow path, causing unacceptable slowdowns. GCC gets around this by using the smallest load for a bitfield. It seems to use a byte for everything, at least in our examples. From the comments, this is intentional, because according to the comments (which are never wrong) C++0x doesn't allow one to touch bits outside of the bitfield. (I'm not a language lawyer, but take this to mean that gcc is trying to minimize which bits it's accessing by using byte stores and loads whenever possible.) The question I have is what should we do to fix this issue? Once we get to LLVM IR, the information saying that we're accessing a bitfield is gone. We have a few options: * We can glean this information from how the loaded value is used and fix this during DAG combine, but it seems a bit brittle. * We could correct the size of the load during the front-end's code generation. This benefits from using all of LLVM's passes on the code. * We could perform the transformation in another place, possible in MIR or MC. What do people think? -bw _______________________________________________ cfe-dev mailing list cfe-dev at lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Craig Topper via llvm-dev
2020-May-26 23:00 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
We also have the opposite problem of the store shrinking. We'll also try to widen 4 byte aligned i16/i8 extload to i32. I didn't count out the alignment in this particular struct layout. ~Craig On Tue, May 26, 2020 at 3:54 PM Eli Friedman via cfe-dev < cfe-dev at lists.llvm.org> wrote:> By default, clang emits all bitfield load/store operations using the width > of the entire sequence of bitfield members. If you look at the LLVM IR for > your testcase, all the bitfield operations are i16. (For thread safety, > the C/C++ standards treat a sequence of bitfield members as a single > "field".) > > If you look at the assembly, though, an "andb $-2, (%rdi)" slips in. This > is specific to the x86 backend: it's narrowing the store to save a couple > bytes in the encoding, and a potential decoding stall due to a 2-byte > immediate. Maybe we shouldn't do that, or we should guard it with a better > heuristic. > > -Eli > > -----Original Message----- > From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Bill Wendling > via cfe-dev > Sent: Tuesday, May 26, 2020 3:29 PM > To: cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org>; llvm-dev < > llvm-dev at lists.llvm.org> > Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types > > We're running into an interesting issue with the Linux kernel, and > wanted advice on how to proceed. > > Example of what we're talking about: https://godbolt.org/z/ABGySq > > The issue is that when working with a bitfield a load may happen > quickly after a store. For instance: > > struct napi_gro_cb { > void *frag; > unsigned int frag_len; > u16 flush; > u16 flush_id; > u16 count; > u16 gro_remcsum_start; > unsigned long age; > u16 proto; > u8 same_flow : 1; > u8 encap_mark : 1; > u8 csum_valid : 1; > u8 csum_cnt : 3; > u8 free : 2; > u8 is_ipv6 : 1; > u8 is_fou : 1; > u8 is_atomic : 1; > u8 recursive_counter : 4; > __wsum csum; > struct sk_buff *last; > }; > > void dev_gro_receive(struct sk_buff *skb) > { > ... > same_flow = NAPI_GRO_CB(skb)->same_flow; > ... > } > > Right before the "same_flow = ... ->same_flow;" statement is executed, > a store is made to the bitfield at the end of a called function: > > NAPI_GRO_CB(skb)->same_flow = 1; > > The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > GCC gets around this by using the smallest load for a bitfield. It > seems to use a byte for everything, at least in our examples. From the > comments, this is intentional, because according to the comments > (which are never wrong) C++0x doesn't allow one to touch bits outside > of the bitfield. (I'm not a language lawyer, but take this to mean > that gcc is trying to minimize which bits it's accessing by using byte > stores and loads whenever possible.) > > The question I have is what should we do to fix this issue? Once we > get to LLVM IR, the information saying that we're accessing a bitfield > is gone. We have a few options: > > * We can glean this information from how the loaded value is used and > fix this during DAG combine, but it seems a bit brittle. > > * We could correct the size of the load during the front-end's code > generation. This benefits from using all of LLVM's passes on the code. > > * We could perform the transformation in another place, possible in MIR or > MC. > > What do people think? > > -bw > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/6eb9f7be/attachment-0001.html>
Bill Wendling via llvm-dev
2020-May-26 23:17 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
When writing my initial email, I forgot another option which Eli pointed out: don't shrink the store's size. That would be acceptable for our purposes. If it's something that needs further consideration, perhaps we could disable it via a flag (not an official "-m..." flag, but "-mllvm -disable-store-shortening" or whatever)? -bw On Tue, May 26, 2020 at 3:54 PM Eli Friedman <efriedma at quicinc.com> wrote:> > By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members. If you look at the LLVM IR for your testcase, all the bitfield operations are i16. (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".) > > If you look at the assembly, though, an "andb $-2, (%rdi)" slips in. This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate. Maybe we shouldn't do that, or we should guard it with a better heuristic. > > -Eli > > -----Original Message----- > From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Bill Wendling via cfe-dev > Sent: Tuesday, May 26, 2020 3:29 PM > To: cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org>; llvm-dev <llvm-dev at lists.llvm.org> > Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types > > We're running into an interesting issue with the Linux kernel, and > wanted advice on how to proceed. > > Example of what we're talking about: https://godbolt.org/z/ABGySq > > The issue is that when working with a bitfield a load may happen > quickly after a store. For instance: > > struct napi_gro_cb { > void *frag; > unsigned int frag_len; > u16 flush; > u16 flush_id; > u16 count; > u16 gro_remcsum_start; > unsigned long age; > u16 proto; > u8 same_flow : 1; > u8 encap_mark : 1; > u8 csum_valid : 1; > u8 csum_cnt : 3; > u8 free : 2; > u8 is_ipv6 : 1; > u8 is_fou : 1; > u8 is_atomic : 1; > u8 recursive_counter : 4; > __wsum csum; > struct sk_buff *last; > }; > > void dev_gro_receive(struct sk_buff *skb) > { > ... > same_flow = NAPI_GRO_CB(skb)->same_flow; > ... > } > > Right before the "same_flow = ... ->same_flow;" statement is executed, > a store is made to the bitfield at the end of a called function: > > NAPI_GRO_CB(skb)->same_flow = 1; > > The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > GCC gets around this by using the smallest load for a bitfield. It > seems to use a byte for everything, at least in our examples. From the > comments, this is intentional, because according to the comments > (which are never wrong) C++0x doesn't allow one to touch bits outside > of the bitfield. (I'm not a language lawyer, but take this to mean > that gcc is trying to minimize which bits it's accessing by using byte > stores and loads whenever possible.) > > The question I have is what should we do to fix this issue? Once we > get to LLVM IR, the information saying that we're accessing a bitfield > is gone. We have a few options: > > * We can glean this information from how the loaded value is used and > fix this during DAG combine, but it seems a bit brittle. > > * We could correct the size of the load during the front-end's code > generation. This benefits from using all of LLVM's passes on the code. > > * We could perform the transformation in another place, possible in MIR or MC. > > What do people think? > > -bw > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
John McCall via llvm-dev
2020-May-26 23:32 UTC
[llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types
On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:> We're running into an interesting issue with the Linux kernel, and > wanted advice on how to proceed. > > Example of what we're talking about: https://godbolt.org/z/ABGySq > > The issue is that when working with a bitfield a load may happen > quickly after a store. For instance: > > struct napi_gro_cb { > void *frag; > unsigned int frag_len; > u16 flush; > u16 flush_id; > u16 count; > u16 gro_remcsum_start; > unsigned long age; > u16 proto; > u8 same_flow : 1; > u8 encap_mark : 1; > u8 csum_valid : 1; > u8 csum_cnt : 3; > u8 free : 2; > u8 is_ipv6 : 1; > u8 is_fou : 1; > u8 is_atomic : 1; > u8 recursive_counter : 4; > __wsum csum; > struct sk_buff *last; > }; > > void dev_gro_receive(struct sk_buff *skb) > { > ... > same_flow = NAPI_GRO_CB(skb)->same_flow; > ... > } > > Right before the "same_flow = ... ->same_flow;" statement is executed, > a store is made to the bitfield at the end of a called function: > > NAPI_GRO_CB(skb)->same_flow = 1; > > The store is a byte: > > orb $0x1,0x4a(%rbx) > > while the read is a word: > > movzwl 0x4a(%r12),%r15d > > The problem is that between the store and the load the value hasn't > been retired / placed in the cache. One would expect store-to-load > forwarding to kick in, but on x86 that doesn't happen because x86 > requires the store to be of equal or greater size than the load. So > instead the load takes the slow path, causing unacceptable slowdowns. > > GCC gets around this by using the smallest load for a bitfield. It > seems to use a byte for everything, at least in our examples. From the > comments, this is intentional, because according to the comments > (which are never wrong) C++0x doesn't allow one to touch bits outside > of the bitfield. (I'm not a language lawyer, but take this to mean > that gcc is trying to minimize which bits it's accessing by using byte > stores and loads whenever possible.) > > The question I have is what should we do to fix this issue? Once we > get to LLVM IR, the information saying that we're accessing a bitfield > is gone. We have a few options: > > * We can glean this information from how the loaded value is used and > fix this during DAG combine, but it seems a bit brittle. > > * We could correct the size of the load during the front-end's code > generation. This benefits from using all of LLVM's passes on the code. > > * We could perform the transformation in another place, possible in > MIR or MC. > > What do people think?Clang used to generate narrower loads and stores for bit-fields, but a long time ago it was intentionally changed to generate wider loads and stores, IIRC by Chandler. There are some cases where I think the “new” code goes overboard, but in this case I don’t particularly have an issue with the wider loads and stores. I guess we could make a best-effort attempt to stick to the storage-unit size when the bit-fields break evenly on a boundary. But mostly I think the frontend’s responsibility ends with it generating same-size accesses in both places, and if inconsistent access sizes trigger poor performance, the backend should be more careful about intentionally changing access sizes. John.
Craig Topper via llvm-dev
2020-May-26 23:38 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits. ~Craig On Tue, May 26, 2020 at 4:32 PM John McCall via cfe-dev < cfe-dev at lists.llvm.org> wrote:> > > On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > > > We're running into an interesting issue with the Linux kernel, and > > wanted advice on how to proceed. > > > > Example of what we're talking about: https://godbolt.org/z/ABGySq > > > > The issue is that when working with a bitfield a load may happen > > quickly after a store. For instance: > > > > struct napi_gro_cb { > > void *frag; > > unsigned int frag_len; > > u16 flush; > > u16 flush_id; > > u16 count; > > u16 gro_remcsum_start; > > unsigned long age; > > u16 proto; > > u8 same_flow : 1; > > u8 encap_mark : 1; > > u8 csum_valid : 1; > > u8 csum_cnt : 3; > > u8 free : 2; > > u8 is_ipv6 : 1; > > u8 is_fou : 1; > > u8 is_atomic : 1; > > u8 recursive_counter : 4; > > __wsum csum; > > struct sk_buff *last; > > }; > > > > void dev_gro_receive(struct sk_buff *skb) > > { > > ... > > same_flow = NAPI_GRO_CB(skb)->same_flow; > > ... > > } > > > > Right before the "same_flow = ... ->same_flow;" statement is executed, > > a store is made to the bitfield at the end of a called function: > > > > NAPI_GRO_CB(skb)->same_flow = 1; > > > > The store is a byte: > > > > orb $0x1,0x4a(%rbx) > > > > while the read is a word: > > > > movzwl 0x4a(%r12),%r15d > > > > The problem is that between the store and the load the value hasn't > > been retired / placed in the cache. One would expect store-to-load > > forwarding to kick in, but on x86 that doesn't happen because x86 > > requires the store to be of equal or greater size than the load. So > > instead the load takes the slow path, causing unacceptable slowdowns. > > > > GCC gets around this by using the smallest load for a bitfield. It > > seems to use a byte for everything, at least in our examples. From the > > comments, this is intentional, because according to the comments > > (which are never wrong) C++0x doesn't allow one to touch bits outside > > of the bitfield. (I'm not a language lawyer, but take this to mean > > that gcc is trying to minimize which bits it's accessing by using byte > > stores and loads whenever possible.) > > > > The question I have is what should we do to fix this issue? Once we > > get to LLVM IR, the information saying that we're accessing a bitfield > > is gone. We have a few options: > > > > * We can glean this information from how the loaded value is used and > > fix this during DAG combine, but it seems a bit brittle. > > > > * We could correct the size of the load during the front-end's code > > generation. This benefits from using all of LLVM's passes on the code. > > > > * We could perform the transformation in another place, possible in > > MIR or MC. > > > > What do people think? > > Clang used to generate narrower loads and stores for bit-fields, but a > long time ago it was intentionally changed to generate wider loads > and stores, IIRC by Chandler. There are some cases where I think the > “new” code goes overboard, but in this case I don’t particularly > have > an issue with the wider loads and stores. I guess we could make a > best-effort attempt to stick to the storage-unit size when the > bit-fields break evenly on a boundary. But mostly I think the > frontend’s > responsibility ends with it generating same-size accesses in both > places, and if inconsistent access sizes trigger poor performance, > the backend should be more careful about intentionally changing access > sizes. > > John. > _______________________________________________ > cfe-dev mailing list > cfe-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/5fec13f5/attachment.html>
Arthur O'Dwyer via llvm-dev
2020-May-27 00:31 UTC
[llvm-dev] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types
On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev < cfe-dev at lists.llvm.org> wrote:> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote: > > [...] The store is a byte: > > > > orb $0x1,0x4a(%rbx) > > > > while the read is a word: > > > > movzwl 0x4a(%r12),%r15d > > > > The problem is that between the store and the load the value hasn't > > been retired / placed in the cache. One would expect store-to-load > > forwarding to kick in, but on x86 that doesn't happen because x86 > > requires the store to be of equal or greater size than the load. So > > instead the load takes the slow path, causing unacceptable slowdowns. > [...] > > Clang used to generate narrower loads and stores for bit-fields, but a > long time ago it was intentionally changed to generate wider loads > and stores, IIRC by Chandler. There are some cases where I think the > “new” code goes overboard, but in this case I don’t particularly have > an issue with the wider loads and stores. I guess we could make a > best-effort attempt to stick to the storage-unit size when the > bit-fields break evenly on a boundary. But mostly I think the frontend’s > responsibility ends with it generating same-size accesses in both > places, and if inconsistent access sizes trigger poor performance, > the backend should be more careful about intentionally changing access > sizes. >FWIW, when I was at Green Hills, I recall the rule being "Always use the declared type of the bitfield to govern the size of the read or write." (There was a similar rule for the meaning of `volatile`. I hope I'm not just getting confused between the two. Actually, since of the compilers on Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm *probably* wrong.) That is, if the bitfield is declared `int16_t`, then use 16-bit loads and stores for it; if it's declared `int32_t`, then use 32-bit loads and stores. This gives the programmer a reason to prefer one declared type over another. For example, in template<class T> struct A { T w : 5; T x : 3; T y : 4; T z : 4; }; the only differences between A<char> and A<short> are - whether the struct's alignment is 1 or 2, and - whether you use 8-bit or 16-bit accesses to modify its fields. "The backend should be more careful about intentionally changing access sizes" sounds like absolutely the correct diagnosis, to me. my $.02, Arthur -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200526/59c929f8/attachment.html>