Andres Freund via llvm-dev
2018-Sep-10 20:42 UTC
[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Hi, I have, in postres, a piece of IR that, after inlining and constant propagation boils (when cooked on really high heat) down to (also attached for your convenience): source_filename = "pg" target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) { entry: %a01 = getelementptr i8, i8* %0, i16 0 store i8 0, i8* %a01 ; in the real case this also loads data %b01 = getelementptr i32, i32* %1, i16 0 store i32 0, i32* %b01 %a02 = getelementptr i8, i8* %0, i16 1 store i8 0, i8* %a02 ; in the real case this also loads data %b02 = getelementptr i32, i32* %1, i16 1 store i32 0, i32* %b02 ; in the real case this also loads data %a03 = getelementptr i8, i8* %0, i16 2 store i8 0, i8* %a03 ; in the real case this also loads data %b03 = getelementptr i32, i32* %1, i16 2 store i32 0, i32* %b03 %a04 = getelementptr i8, i8* %0, i16 3 store i8 0, i8* %a04 ; in the real case this also loads data %b04 = getelementptr i32, i32* %1, i16 3 store i32 0, i32* %b04 ret void } I expected LLVM to be able to coalesce the i8 stores to an i32 (or a number of i64 stores in my actual case). But it turns out it doesn't really do so. In postgres' current use the optimization pipeline doesn't contain SLP (mostly because enabling it via PassManagerBuilder isn't exposed to C). In that case optimization doesn't yield anything interesting with: opt -mcpu=native -disable-slp-vectorization -O3 -S /tmp/combine.ll With SLP enabled, this turns out a bit better: define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 { entry: store i8 0, i8* %0, align 8 %a02 = getelementptr i8, i8* %0, i64 1 store i8 0, i8* %a02, align 1 %a03 = getelementptr i8, i8* %0, i64 2 store i8 0, i8* %a03, align 2 %a04 = getelementptr i8, i8* %0, i64 3 store i8 0, i8* %a04, align 1 %2 = bitcast i32* %1 to <4 x i32>* store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8 ret void } but note that the i8 stores *still* haven't been coalesced, although without the interspersed stores, llc/lowering is able to do so. If I run another round of opt on it then "MemCpy Optimization" manages to also optimize this on the IR level: *** IR Dump After Global Value Numbering *** ; Function Attrs: norecurse nounwind define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 { entry: store i8 0, i8* %0, align 8 %a02 = getelementptr i8, i8* %0, i64 1 store i8 0, i8* %a02, align 1 %a03 = getelementptr i8, i8* %0, i64 2 store i8 0, i8* %a03, align 2 %a04 = getelementptr i8, i8* %0, i64 3 store i8 0, i8* %a04, align 1 %2 = bitcast i32* %1 to <4 x i32>* store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8 ret void } *** IR Dump After MemCpy Optimization *** ; Function Attrs: norecurse nounwind define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 { entry: %a02 = getelementptr i8, i8* %0, i64 1 %a03 = getelementptr i8, i8* %0, i64 2 %a04 = getelementptr i8, i8* %0, i64 3 %2 = bitcast i32* %1 to <4 x i32>* call void @llvm.memset.p0i8.i64(i8* align 8 %0, i8 0, i64 4, i1 false) store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8 ret void } which later then gets turned into a normal i32 store: *** IR Dump After Combine redundant instructions *** ; Function Attrs: norecurse nounwind define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 { entry: %2 = bitcast i32* %1 to <4 x i32>* %3 = bitcast i8* %0 to i32* store i32 0, i32* %3, align 8 store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8 ret void } So, here we finally come to my question: Is it really expected that, unless largely independent optimizations (SLP in this case) happen to move instructions *within the same basic block* out of the way, these stores don't get coalesced? And then only if the either the optimization pipeline is run again, or if instruction selection can do so? On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which might address this indirectly. But I'm somewhat doubtful that that's the most straightforward way to optimize this kind of code? Greetings, Andres Freund -------------- next part -------------- ; ModuleID = '<stdin>' source_filename = "pg" target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-pc-linux-gnu" define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) { entry: %a01 = getelementptr i8, i8* %0, i16 0 store i8 0, i8* %a01 ; in the real case this also loads data %b01 = getelementptr i32, i32* %1, i16 0 store i32 0, i32* %b01 %a02 = getelementptr i8, i8* %0, i16 1 store i8 0, i8* %a02 ; in the real case this also loads data %b02 = getelementptr i32, i32* %1, i16 1 store i32 0, i32* %b02 ; in the real case this also loads data %a03 = getelementptr i8, i8* %0, i16 2 store i8 0, i8* %a03 ; in the real case this also loads data %b03 = getelementptr i32, i32* %1, i16 2 store i32 0, i32* %b03 %a04 = getelementptr i8, i8* %0, i16 3 store i8 0, i8* %a04 ; in the real case this also loads data %b04 = getelementptr i32, i32* %1, i16 3 store i32 0, i32* %b04 ret void }
Andres Freund via llvm-dev
2018-Sep-11 00:33 UTC
[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Hi, On 2018-09-10 13:42:21 -0700, Andres Freund wrote:> I have, in postres, a piece of IR that, after inlining and constant > propagation boils (when cooked on really high heat) down to (also > attached for your convenience): > > source_filename = "pg" > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" > target triple = "x86_64-pc-linux-gnu" > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) { > entry: > %a01 = getelementptr i8, i8* %0, i16 0 > store i8 0, i8* %a01 > > ; in the real case this also loads data > %b01 = getelementptr i32, i32* %1, i16 0 > store i32 0, i32* %b01 > > %a02 = getelementptr i8, i8* %0, i16 1 > store i8 0, i8* %a02 > > ; in the real case this also loads data > %b02 = getelementptr i32, i32* %1, i16 1 > store i32 0, i32* %b02 > > ; in the real case this also loads data > %a03 = getelementptr i8, i8* %0, i16 2 > store i8 0, i8* %a03 > > ; in the real case this also loads data > %b03 = getelementptr i32, i32* %1, i16 2 > store i32 0, i32* %b03 > > %a04 = getelementptr i8, i8* %0, i16 3 > store i8 0, i8* %a04 > > ; in the real case this also loads data > %b04 = getelementptr i32, i32* %1, i16 3 > store i32 0, i32* %b04 > > ret void > }> So, here we finally come to my question: Is it really expected that, > unless largely independent optimizations (SLP in this case) happen to > move instructions *within the same basic block* out of the way, these > stores don't get coalesced? And then only if the either the > optimization pipeline is run again, or if instruction selection can do > so? > > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which > might address this indirectly. But I'm somewhat doubtful that that's > the most straightforward way to optimize this kind of code?That doesn't help, but it turns out that //reviews.llvm.org/D30703 can kinda somwhat help by adding a redundant %i32ptr = bitcast i8* %0 to i32* store i32 0, i32* %i32ptr at the start. Then dse-partial-store-merging does its magic and optimizes the sub-stores away. But it's fairly ugly to manually have to add superflous stores in the right granularity (a larger llvm.memset doesn't work). gcc, since 7, detects such cases in its "new" -fstore-merging pass. - Andres
Nirav Davé via llvm-dev
2018-Sep-11 15:16 UTC
[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores
Andres: FWIW, codegen will do the merge if you turn on global alias analysis for it "-combiner-global-alias-analysis". That said, we should be able to do this merging earlier. -Nirav On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hi, > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote: > > I have, in postres, a piece of IR that, after inlining and constant > > propagation boils (when cooked on really high heat) down to (also > > attached for your convenience): > > > > source_filename = "pg" > > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" > > target triple = "x86_64-pc-linux-gnu" > > > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) { > > entry: > > %a01 = getelementptr i8, i8* %0, i16 0 > > store i8 0, i8* %a01 > > > > ; in the real case this also loads data > > %b01 = getelementptr i32, i32* %1, i16 0 > > store i32 0, i32* %b01 > > > > %a02 = getelementptr i8, i8* %0, i16 1 > > store i8 0, i8* %a02 > > > > ; in the real case this also loads data > > %b02 = getelementptr i32, i32* %1, i16 1 > > store i32 0, i32* %b02 > > > > ; in the real case this also loads data > > %a03 = getelementptr i8, i8* %0, i16 2 > > store i8 0, i8* %a03 > > > > ; in the real case this also loads data > > %b03 = getelementptr i32, i32* %1, i16 2 > > store i32 0, i32* %b03 > > > > %a04 = getelementptr i8, i8* %0, i16 3 > > store i8 0, i8* %a04 > > > > ; in the real case this also loads data > > %b04 = getelementptr i32, i32* %1, i16 3 > > store i32 0, i32* %b04 > > > > ret void > > } > > > So, here we finally come to my question: Is it really expected that, > > unless largely independent optimizations (SLP in this case) happen to > > move instructions *within the same basic block* out of the way, these > > stores don't get coalesced? And then only if the either the > > optimization pipeline is run again, or if instruction selection can do > > so? > > > > > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which > > might address this indirectly. But I'm somewhat doubtful that that's > > the most straightforward way to optimize this kind of code? > > That doesn't help, but it turns out that //reviews.llvm.org/D30703 can > kinda somwhat help by adding a redundant > %i32ptr = bitcast i8* %0 to i32* > store i32 0, i32* %i32ptr > > at the start. Then dse-partial-store-merging does its magic and > optimizes the sub-stores away. But it's fairly ugly to manually have to > add superflous stores in the right granularity (a larger llvm.memset > doesn't work). > > gcc, since 7, detects such cases in its "new" -fstore-merging pass. > > - Andres > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/ed591c3e/attachment.html>