Alexander Potapenko via llvm-dev
2019-Aug-01 16:50 UTC
[llvm-dev] Dead store elimination in the backend for -ftrivial-auto-var-init
On Thu, Aug 1, 2019 at 6:38 PM JF Bastien <jfbastien at apple.com> wrote:> > > > > On Aug 1, 2019, at 9:20 AM, Alexander Potapenko <glider at google.com> wrote: > > > > On Thu, Aug 1, 2019 at 6:09 PM JF Bastien <jfbastien at apple.com> wrote: > >> > >> Hi Alexander, > >> > >> The code doesn’t compile. Could you send a godbolt.org link that shows the issue? > > Sorry about that, here's the link: https://godbolt.org/z/-PinQP > > > > Lines 4 to 8 are initializing |acpar|. > > If I'm understanding correctly, the store to 8(%rsp) at line 7 can be > > removed because of the store at line 35. > > These are in different basic blocks. IIRC there are a bunch of cases where DCE just fails when there’s any control flow. I think this is what you’re seeing here.I think Vitaly's patches (https://reviews.llvm.org/D61879) should allow DSE to work across basic blocks. But, as you mentioned, the problem is that we only have a memset at IR level.> > > Same for store to 24(%rsp), which is later overwritten at lines 11 and 16. > > I’m not sure what we should do here… If it were a constant store we’d fold it in, but here both stores are variables so it’s not clear that we really should be doing much in MemCpyOptimizer. > The problem I see here is: there’s a small memset, followed by variable stores to the same location. DCE’ing requires realizing that the memset is small enough that the backend will expand it, at which point DCE will kick in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based on alignment and optimization level. Maybe it can be a target hook (“what would you do with this memset?”), that MemCpyOptimizer uses when it sees a memset followed by variable stores to the same location? We already fold in constant stores (replace the m() call by one)… but again this isn’t clearly a win: if you enable SSE you’ll get wide stores, followed by stores of smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS instructions.Yes, this depends very much on the backend flags, therefore it might be too early to lower that memset before DSE. I was thinking about some kind of a peephole optimization, that deletes stores to the same memory location (at least for %rsp-based addresses) in the backend. I suspect AArch64 has something like this already, at least we don't see these redundant stores there.> > > >> Thanks, > >> > >> JF > >> > >>> On Aug 1, 2019, at 7:21 AM, Alexander Potapenko <glider at google.com> wrote: > >>> > >>> Hi folks, > >>> > >>> When compiling the attached example with -ftrivial-auto-var-init=zero: > >>> > >>> $ clang -no-integrated-as -mno-sse -m64 -mstack-alignment=8 -O2 > >>> -ftrivial-auto-var-init=zero > >>> -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang > >>> -g -o ipt.ll -c ipt.i -w -S -emit-llvm > >>> > >>> , Clang generates an initialization memset() call for |acpar| in the IR: > >>> > >>> %0 = bitcast %struct.xt_action_param* %acpar to i8*, !dbg !27 > >>> call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %0, i8 0, i64 > >>> 40, i1 false), !dbg !28 > >>> > >>> Clang only splits memsets into series of stores under certain > >>> conditions, so it's hard to remove redundant stores on the IR level > >>> (even with the recent DSE patches: https://reviews.llvm.org/D61879). > >>> In the resulting assembly, however, the memset is lowered into a > >>> series of MOVQ instructions (see the attached ipt.s file), of which at > >>> least the stores to 24(%rsp) and 16(%rsp) are redundant. > >>> > >>> Given that at MCInst level we already know if a memset is split into a > >>> sequence of stores, can we do a better job by making the backend > >>> delete these redundant stores for us? > >>> > >>> Alex > >>> > >>> -- > >>> Alexander Potapenko > >>> Software Engineer > >>> > >>> Google Germany GmbH > >>> Erika-Mann-Straße, 33 > >>> 80636 München > >>> > >>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado > >>> Registergericht und -nummer: Hamburg, HRB 86891 > >>> Sitz der Gesellschaft: Hamburg > >>> <ipt.i><ipt.s> > >> > > > > > > -- > > Alexander Potapenko > > Software Engineer > > > > Google Germany GmbH > > Erika-Mann-Straße, 33 > > 80636 München > > > > Geschäftsführer: Paul Manicle, Halimah DeLaine Prado > > Registergericht und -nummer: Hamburg, HRB 86891 > > Sitz der Gesellschaft: Hamburg >-- Alexander Potapenko Software Engineer Google Germany GmbH Erika-Mann-Straße, 33 80636 München Geschäftsführer: Paul Manicle, Halimah DeLaine Prado Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg
Vitaly Buka via llvm-dev
2019-Aug-02 06:29 UTC
[llvm-dev] Dead store elimination in the backend for -ftrivial-auto-var-init
On a first look case like this should nor be a problem. The tail of the memset here is unused because it's replaced immediately without reads . So it can be trimmed. I will try the patch and let you know. On Thu, Aug 1, 2019 at 9:51 AM Alexander Potapenko <glider at google.com> wrote:> On Thu, Aug 1, 2019 at 6:38 PM JF Bastien <jfbastien at apple.com> wrote: > > > > > > > > > On Aug 1, 2019, at 9:20 AM, Alexander Potapenko <glider at google.com> > wrote: > > > > > > On Thu, Aug 1, 2019 at 6:09 PM JF Bastien <jfbastien at apple.com> wrote: > > >> > > >> Hi Alexander, > > >> > > >> The code doesn’t compile. Could you send a godbolt.org link that > shows the issue? > > > Sorry about that, here's the link: https://godbolt.org/z/-PinQP > > > > > > Lines 4 to 8 are initializing |acpar|. > > > If I'm understanding correctly, the store to 8(%rsp) at line 7 can be > > > removed because of the store at line 35. > > > > These are in different basic blocks. IIRC there are a bunch of cases > where DCE just fails when there’s any control flow. I think this is what > you’re seeing here. > I think Vitaly's patches (https://reviews.llvm.org/D61879) should > allow DSE to work across basic blocks. But, as you mentioned, the > problem is that we only have a memset at IR level. > > > > > Same for store to 24(%rsp), which is later overwritten at lines 11 and > 16. > > > > I’m not sure what we should do here… If it were a constant store we’d > fold it in, but here both stores are variables so it’s not clear that we > really should be doing much in MemCpyOptimizer. > > The problem I see here is: there’s a small memset, followed by variable > stores to the same location. DCE’ing requires realizing that the memset is > small enough that the backend will expand it, at which point DCE will kick > in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based > on alignment and optimization level. Maybe it can be a target hook (“what > would you do with this memset?”), that MemCpyOptimizer uses when it sees a > memset followed by variable stores to the same location? We already fold in > constant stores (replace the m() call by one)… but again this isn’t clearly > a win: if you enable SSE you’ll get wide stores, followed by stores of > smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS > instructions. > Yes, this depends very much on the backend flags, therefore it might > be too early to lower that memset before DSE. > I was thinking about some kind of a peephole optimization, that > deletes stores to the same memory location (at least for %rsp-based > addresses) in the backend. > I suspect AArch64 has something like this already, at least we don't > see these redundant stores there. > > > > > > > > >> Thanks, > > >> > > >> JF > > >> > > >>> On Aug 1, 2019, at 7:21 AM, Alexander Potapenko <glider at google.com> > wrote: > > >>> > > >>> Hi folks, > > >>> > > >>> When compiling the attached example with > -ftrivial-auto-var-init=zero: > > >>> > > >>> $ clang -no-integrated-as -mno-sse -m64 -mstack-alignment=8 -O2 > > >>> -ftrivial-auto-var-init=zero > > >>> > -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang > > >>> -g -o ipt.ll -c ipt.i -w -S -emit-llvm > > >>> > > >>> , Clang generates an initialization memset() call for |acpar| in the > IR: > > >>> > > >>> %0 = bitcast %struct.xt_action_param* %acpar to i8*, !dbg !27 > > >>> call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %0, i8 0, i64 > > >>> 40, i1 false), !dbg !28 > > >>> > > >>> Clang only splits memsets into series of stores under certain > > >>> conditions, so it's hard to remove redundant stores on the IR level > > >>> (even with the recent DSE patches: https://reviews.llvm.org/D61879). > > >>> In the resulting assembly, however, the memset is lowered into a > > >>> series of MOVQ instructions (see the attached ipt.s file), of which > at > > >>> least the stores to 24(%rsp) and 16(%rsp) are redundant. > > >>> > > >>> Given that at MCInst level we already know if a memset is split into > a > > >>> sequence of stores, can we do a better job by making the backend > > >>> delete these redundant stores for us? > > >>> > > >>> Alex > > >>> > > >>> -- > > >>> Alexander Potapenko > > >>> Software Engineer > > >>> > > >>> Google Germany GmbH > > >>> Erika-Mann-Straße, 33 > > >>> 80636 München > > >>> > > >>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado > > >>> Registergericht und -nummer: Hamburg, HRB 86891 > > >>> Sitz der Gesellschaft: Hamburg > > >>> <ipt.i><ipt.s> > > >> > > > > > > > > > -- > > > Alexander Potapenko > > > Software Engineer > > > > > > Google Germany GmbH > > > Erika-Mann-Straße, 33 > > > 80636 München > > > > > > Geschäftsführer: Paul Manicle, Halimah DeLaine Prado > > > Registergericht und -nummer: Hamburg, HRB 86891 > > > Sitz der Gesellschaft: Hamburg > > > > > -- > Alexander Potapenko > Software Engineer > > Google Germany GmbH > Erika-Mann-Straße, 33 > 80636 München > > Geschäftsführer: Paul Manicle, Halimah DeLaine Prado > Registergericht und -nummer: Hamburg, HRB 86891 > Sitz der Gesellschaft: Hamburg >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190801/fb1332f2/attachment.html>
Vitaly Buka via llvm-dev
2019-Aug-07 01:42 UTC
[llvm-dev] Dead store elimination in the backend for -ftrivial-auto-var-init
There are two problems: 1. padding after union and call to q(), without LTO we can't remove that store. 2. shortcut which I have which ignores all instructions q() . this assume that memset to acpar.match, acpar.matchinfo also useful which is not true. I should be able to improve this case. On Thu, Aug 1, 2019 at 11:29 PM Vitaly Buka <vitalybuka at google.com> wrote:> On a first look case like this should nor be a problem. The tail of the > memset here is unused because it's replaced immediately without reads . So > it can be trimmed. I will try the patch and let you know. > > On Thu, Aug 1, 2019 at 9:51 AM Alexander Potapenko <glider at google.com> > wrote: > >> On Thu, Aug 1, 2019 at 6:38 PM JF Bastien <jfbastien at apple.com> wrote: >> > >> > >> > >> > > On Aug 1, 2019, at 9:20 AM, Alexander Potapenko <glider at google.com> >> wrote: >> > > >> > > On Thu, Aug 1, 2019 at 6:09 PM JF Bastien <jfbastien at apple.com> >> wrote: >> > >> >> > >> Hi Alexander, >> > >> >> > >> The code doesn’t compile. Could you send a godbolt.org link that >> shows the issue? >> > > Sorry about that, here's the link: https://godbolt.org/z/-PinQP >> > > >> > > Lines 4 to 8 are initializing |acpar|. >> > > If I'm understanding correctly, the store to 8(%rsp) at line 7 can be >> > > removed because of the store at line 35. >> > > >> > These are in different basic blocks. IIRC there are a bunch of cases >> where DCE just fails when there’s any control flow. I think this is what >> you’re seeing here. >> I think Vitaly's patches (https://reviews.llvm.org/D61879) should >> allow DSE to work across basic blocks. But, as you mentioned, the >> problem is that we only have a memset at IR level. >> > >> > > Same for store to 24(%rsp), which is later overwritten at lines 11 >> and 16. >> > >> > I’m not sure what we should do here… If it were a constant store we’d >> fold it in, but here both stores are variables so it’s not clear that we >> really should be doing much in MemCpyOptimizer. >> > The problem I see here is: there’s a small memset, followed by variable >> stores to the same location. DCE’ing requires realizing that the memset is >> small enough that the backend will expand it, at which point DCE will kick >> in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based >> on alignment and optimization level. Maybe it can be a target hook (“what >> would you do with this memset?”), that MemCpyOptimizer uses when it sees a >> memset followed by variable stores to the same location? We already fold in >> constant stores (replace the m() call by one)… but again this isn’t clearly >> a win: if you enable SSE you’ll get wide stores, followed by stores of >> smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS >> instructions. >> Yes, this depends very much on the backend flags, therefore it might >> be too early to lower that memset before DSE. >> I was thinking about some kind of a peephole optimization, that >> deletes stores to the same memory location (at least for %rsp-based >> addresses) in the backend. >> I suspect AArch64 has something like this already, at least we don't >> see these redundant stores there. >> >> >> > >> > >> > >> Thanks, >> > >> >> > >> JF >> > >> >> > >>> On Aug 1, 2019, at 7:21 AM, Alexander Potapenko <glider at google.com> >> wrote: >> > >>> >> > >>> Hi folks, >> > >>> >> > >>> When compiling the attached example with >> -ftrivial-auto-var-init=zero: >> > >>> >> > >>> $ clang -no-integrated-as -mno-sse -m64 -mstack-alignment=8 >> -O2 >> > >>> -ftrivial-auto-var-init=zero >> > >>> >> -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang >> > >>> -g -o ipt.ll -c ipt.i -w -S -emit-llvm >> > >>> >> > >>> , Clang generates an initialization memset() call for |acpar| in >> the IR: >> > >>> >> > >>> %0 = bitcast %struct.xt_action_param* %acpar to i8*, !dbg !27 >> > >>> call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %0, i8 0, i64 >> > >>> 40, i1 false), !dbg !28 >> > >>> >> > >>> Clang only splits memsets into series of stores under certain >> > >>> conditions, so it's hard to remove redundant stores on the IR level >> > >>> (even with the recent DSE patches: https://reviews.llvm.org/D61879 >> ). >> > >>> In the resulting assembly, however, the memset is lowered into a >> > >>> series of MOVQ instructions (see the attached ipt.s file), of which >> at >> > >>> least the stores to 24(%rsp) and 16(%rsp) are redundant. >> > >>> >> > >>> Given that at MCInst level we already know if a memset is split >> into a >> > >>> sequence of stores, can we do a better job by making the backend >> > >>> delete these redundant stores for us? >> > >>> >> > >>> Alex >> > >>> >> > >>> -- >> > >>> Alexander Potapenko >> > >>> Software Engineer >> > >>> >> > >>> Google Germany GmbH >> > >>> Erika-Mann-Straße, 33 >> > >>> 80636 München >> > >>> >> > >>> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado >> > >>> Registergericht und -nummer: Hamburg, HRB 86891 >> > >>> Sitz der Gesellschaft: Hamburg >> > >>> <ipt.i><ipt.s> >> > >> >> > > >> > > >> > > -- >> > > Alexander Potapenko >> > > Software Engineer >> > > >> > > Google Germany GmbH >> > > Erika-Mann-Straße, 33 >> > > 80636 München >> > > >> > > Geschäftsführer: Paul Manicle, Halimah DeLaine Prado >> > > Registergericht und -nummer: Hamburg, HRB 86891 >> > > Sitz der Gesellschaft: Hamburg >> > >> >> >> -- >> Alexander Potapenko >> Software Engineer >> >> Google Germany GmbH >> Erika-Mann-Straße, 33 >> 80636 München >> >> Geschäftsführer: Paul Manicle, Halimah DeLaine Prado >> Registergericht und -nummer: Hamburg, HRB 86891 >> Sitz der Gesellschaft: Hamburg >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190806/50b20285/attachment.html>