> On 29 Sep 2016, at 03:23, Sanjoy Das <sanjoy at playingwithpointers.com> wrote: > > Hi Artur, > > Artur Pilipenko via llvm-dev wrote: > > One of the arguments for doing this earlier is inline cost > > perception of the original pattern. Reading i32/i64 by bytes look much > > more expensive than it is and can prevent inlining of interesting > > function. > > I don't think this is just a perception issue -- if the loads have not > been widened then inlining the containing function _is_ expensive, and > the inliner cost analysis is doing the right thing. > > > Inhibiting other optimizations concern can be addressed by careful > > selection of the pattern we’d like to match. I limit the > > transformation to the case when all the individual have no uses other > > than forming a wider load. In this case it’s less likely to loose > > information during this transformation. > > I agree -- I think widening loads in "obvious" cases like: > > i16 *a = ... > i32 val = a[i] | (a[i + 1] << 16) > > is more defensible than trying to widen the example that broke in > https://llvm.org/bugs/show_bug.cgi?id=29110. > > Regarding atomicity, the only real optimization that we'll lose (that > I can think of) is PRE. Additionally, it may be more expensive to > lower wider atomic loads / stores, but that can be indicated by a > target hook. For instance, on x86, I don't think: > > load atomic i16, i16* %ptr, unordered > > is cheaper than > > load atomic i32, i32* %ptr.bitcast, unordered > > so from a lowering perspective there is no reason to prefer the former.BTW, do we really need to emit an atomic load if all the individual components are bytes? Artur> > Given this, perhaps scheduling load widening after one pass of GVN/PRE > is fine? > > -- Sanjoy > > > > > I didn’t think of atomicity aspect though. > > > > Artur > > > >> On 28 Sep 2016, at 18:50, Philip Reames<listmail at philipreames.com> wrote: > >> > >> There's a bit of additional context worth adding here... > >> > >> Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own. > >> > >> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable. > >> > >> With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable. > >> > >> Philip > >> > >> > >> On 09/28/2016 08:22 AM, Artur Pilipenko wrote: > >>> Hi, > >>> > >>> I'm trying to optimize a pattern like this into a single i16 load: > >>> %1 = bitcast i16* %pData to i8* > >>> %2 = load i8, i8* %1, align 1 > >>> %3 = zext i8 %2 to i16 > >>> %4 = shl nuw i16 %3, 8 > >>> %5 = getelementptr inbounds i8, i8* %1, i16 1 > >>> %6 = load i8, i8* %5, align 1 > >>> %7 = zext i8 %6 to i16 > >>> %8 = shl nuw nsw i16 %7, 0 > >>> %9 = or i16 %8, %4 > >>> > >>> I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It's not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer. > >>> > >>> I can't find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass? > >>> > >>> I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap. > >>> > >>> If the alternative approach sounds reasonable I'll post my patches for review. > >>> > >>> Artur > > > > _______________________________________________ > > LLVM Developers mailing list > > llvm-dev at lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Hi Artur, Artur Pilipenko wrote: > BTW, do we really need to emit an atomic load if all the individual > components are bytes? Depends -- do you mean at the at the hardware level or at the IR level? If you mean at the IR level, then I think yes; since otherwise it is legal to do transforms that break byte-wise atomicity in the IR, e.g.: i32* ptr = ... i32 val = *ptr => // Since no threads can be legally racing on *ptr i32* ptr = ... i32 val0 = *ptr i32 val1 = *ptr i32 val = (val0 & 1) | (val1 & ~1); If you're talking about the hardware level, then I'm not sure; and my guess is that the answer is almost certainly arch-dependent. -- Sanjoy > > Artur >> Given this, perhaps scheduling load widening after one pass of GVN/PRE >> is fine? >> >> -- Sanjoy >> >>> I didn’t think of atomicity aspect though. >>> >>> Artur >>> >>>> On 28 Sep 2016, at 18:50, Philip Reames<listmail at philipreames.com> wrote: >>>> >>>> There's a bit of additional context worth adding here... >>>> >>>> Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own. >>>> >>>> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable. >>>> >>>> With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable. >>>> >>>> Philip >>>> >>>> >>>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote: >>>>> Hi, >>>>> >>>>> I'm trying to optimize a pattern like this into a single i16 load: >>>>> %1 = bitcast i16* %pData to i8* >>>>> %2 = load i8, i8* %1, align 1 >>>>> %3 = zext i8 %2 to i16 >>>>> %4 = shl nuw i16 %3, 8 >>>>> %5 = getelementptr inbounds i8, i8* %1, i16 1 >>>>> %6 = load i8, i8* %5, align 1 >>>>> %7 = zext i8 %6 to i16 >>>>> %8 = shl nuw nsw i16 %7, 0 >>>>> %9 = or i16 %8, %4 >>>>> >>>>> I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It's not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer. >>>>> >>>>> I can't find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass? >>>>> >>>>> I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap. >>>>> >>>>> If the alternative approach sounds reasonable I'll post my patches for review. >>>>> >>>>> Artur >>> _______________________________________________ >>> LLVM Developers mailing list >>> llvm-dev at lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >
> On 29 Sep 2016, at 21:01, Sanjoy Das <sanjoy at playingwithpointers.com> wrote: > > Hi Artur, > > Artur Pilipenko wrote: > > > BTW, do we really need to emit an atomic load if all the individual > > components are bytes? > > Depends -- do you mean at the at the hardware level or at the IR > level? > > If you mean at the IR level, then I think yes; since otherwise it is > legal to do transforms that break byte-wise atomicity in the IR, e.g.: > > i32* ptr = ... > i32 val = *ptr > > => // Since no threads can be legally racing on *ptr > > i32* ptr = ... > i32 val0 = *ptr > i32 val1 = *ptr > i32 val = (val0 & 1) | (val1 & ~1); > > > If you're talking about the hardware level, then I'm not sure; and my > guess is that the answer is almost certainly arch-dependent.I meant the case when we have a load by bytes pattern like this: i8* p = ... i8 b0 = *p++; i8 b1 = *p++; i8 b2 = *p++; i8 b3 = *p++; i32 result = b0 << 24 | b1 << 16 | b2 << 8 | b << 0; When we fold it to a i32 load, should this load be atomic? Artur> > -- Sanjoy > > > > > > > > > > > Artur > >> Given this, perhaps scheduling load widening after one pass of GVN/PRE > >> is fine? > >> > >> -- Sanjoy > >> > >>> I didn’t think of atomicity aspect though. > >>> > >>> Artur > >>> > >>>> On 28 Sep 2016, at 18:50, Philip Reames<listmail at philipreames.com> wrote: > >>>> > >>>> There's a bit of additional context worth adding here... > >>>> > >>>> Up until very recently, we had a form of widening implemented in GVN. We decided to remove this in https://reviews.llvm.org/D24096 precisely because its placement in the pass pipeline was inhibiting other optimizations. There's also a major problem with doing widening at the IR level which is that widening a pair of atomic loads into a single wider atomic load can not be undone. This creates a major pass ordering problem of its own. > >>>> > >>>> At this point, my general view is that widening transformations of any kind should be done very late. Ideally, this is something the backend would do, but doing it as a CGP like fixup pass over the IR is also reasonable. > >>>> > >>>> With that in mind, I feel both the current placement of LoadCombine (within the inliner iteration) and the proposed InstCombine rule are undesirable. > >>>> > >>>> Philip > >>>> > >>>> > >>>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote: > >>>>> Hi, > >>>>> > >>>>> I'm trying to optimize a pattern like this into a single i16 load: > >>>>> %1 = bitcast i16* %pData to i8* > >>>>> %2 = load i8, i8* %1, align 1 > >>>>> %3 = zext i8 %2 to i16 > >>>>> %4 = shl nuw i16 %3, 8 > >>>>> %5 = getelementptr inbounds i8, i8* %1, i16 1 > >>>>> %6 = load i8, i8* %5, align 1 > >>>>> %7 = zext i8 %6 to i16 > >>>>> %8 = shl nuw nsw i16 %7, 0 > >>>>> %9 = or i16 %8, %4 > >>>>> > >>>>> I came across load combine pass which is motivated by virtualliy the same pattern. Load combine optimizes the pattern by combining adjacent loads into one load and lets the rest of the optimizer cleanup the rest. From what I see on the initial review for load combine (https://reviews.llvm.org/D3580) it was not enabled by default because it caused some performance regressions. It's not very surprising, I see how this type of widening can obscure some facts for the rest of the optimizer. > >>>>> > >>>>> I can't find any backstory for this pass, why was it chosen to optimize the pattern in question in this way? What is the current status of this pass? > >>>>> > >>>>> I have an alternative implementation for it locally. I implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It relies on collectBitParts (Local.cpp) to determine the origin of the bits in a given or value. If all the bits are happen to be loaded from adjacent locations it replaces the or with a single load or a load plus bswap. > >>>>> > >>>>> If the alternative approach sounds reasonable I'll post my patches for review. > >>>>> > >>>>> Artur > >>> _______________________________________________ > >>> LLVM Developers mailing list > >>> llvm-dev at lists.llvm.org > >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >