thr3ads.net - llvm dev - [llvm-dev] Load combine pass [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Artur Pilipenko via llvm-dev

2016-Sep-29 18:08 UTC

[llvm-dev] Load combine pass

> On 29 Sep 2016, at 21:01, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi Artur,
> 
> Artur Pilipenko wrote:
> 
> > BTW, do we really need to emit an atomic load if all the individual
> > components are bytes?
> 
> Depends -- do you mean at the at the hardware level or at the IR
> level?
> 
> If you mean at the IR level, then I think yes; since otherwise it is
> legal to do transforms that break byte-wise atomicity in the IR, e.g.:
> 
>  i32* ptr = ...
>  i32  val = *ptr
> 
> =>  // Since no threads can be legally racing on *ptr
> 
>  i32* ptr = ...
>  i32 val0 = *ptr
>  i32 val1 = *ptr
>  i32 val = (val0 & 1) | (val1 & ~1);
> 
> 
> If you're talking about the hardware level, then I'm not sure; and
my
> guess is that the answer is almost certainly arch-dependent.I meant the case when we have a load by bytes pattern like this:
i8* p = ...
i8 b0 = *p++;
i8 b1 = *p++;
i8 b2 = *p++;
i8 b3 = *p++;
i32 result = b0 << 24 | b1 << 16 | b2 << 8 | b << 0;

When we fold it to a i32 load, should this load be atomic?

Artur > 
> -- Sanjoy
> 
> 
> 
> 
> 
> 
> 
> >
> > Artur
> >> Given this, perhaps scheduling load widening after one pass of
GVN/PRE
> >> is fine?
> >>
> >> -- Sanjoy
> >>
> >>> I didn’t think of atomicity aspect though.
> >>>
> >>> Artur
> >>>
> >>>> On 28 Sep 2016, at 18:50, Philip Reames<listmail at
philipreames.com>  wrote:
> >>>>
> >>>> There's a bit of additional context worth adding
here...
> >>>>
> >>>> Up until very recently, we had a form of widening
implemented in GVN.  We decided to remove this in
https://reviews.llvm.org/D24096 precisely because its placement in the pass
pipeline was inhibiting other optimizations. There's also a major problem
with doing widening at the IR level which is that widening a pair of atomic
loads into a single wider atomic load can not be undone. This creates a major
pass ordering problem of its own.
> >>>>
> >>>> At this point, my general view is that widening
transformations of any kind should be done very late.  Ideally, this is
something the backend would do, but doing it as a CGP like fixup pass over the
IR is also reasonable.
> >>>>
> >>>> With that in mind, I feel both the current placement of
LoadCombine (within the inliner iteration) and the proposed InstCombine rule are
undesirable.
> >>>>
> >>>> Philip
> >>>>
> >>>>
> >>>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
> >>>>> Hi,
> >>>>>
> >>>>> I'm trying to optimize a pattern like this into a
single i16 load:
> >>>>>    %1 = bitcast i16* %pData to i8*
> >>>>>    %2 = load i8, i8* %1, align 1
> >>>>>    %3 = zext i8 %2 to i16
> >>>>>    %4 = shl nuw i16 %3, 8
> >>>>>    %5 = getelementptr inbounds i8, i8* %1, i16 1
> >>>>>    %6 = load i8, i8* %5, align 1
> >>>>>    %7 = zext i8 %6 to i16
> >>>>>    %8 = shl nuw nsw i16 %7, 0
> >>>>>    %9 = or i16 %8, %4
> >>>>>
> >>>>> I came across load combine pass which is motivated by
virtualliy the same pattern. Load combine optimizes the pattern by combining
adjacent loads into one load and lets the rest of the optimizer cleanup the
rest. From what I see on the initial review for load combine
(https://reviews.llvm.org/D3580) it was not enabled by default because it caused
some performance regressions. It's not very surprising, I see how this type
of widening can obscure some facts for the rest of the optimizer.
> >>>>>
> >>>>> I can't find any backstory for this pass, why was
it chosen to optimize the pattern in question in this way? What is the current
status of this pass?
> >>>>>
> >>>>> I have an alternative implementation for it locally. I
implemented an instcombine rule similar to recognise bswap/bitreverse idiom. It
relies on collectBitParts (Local.cpp) to determine the origin of the bits in a
given or value. If all the bits are happen to be loaded from adjacent locations
it replaces the or with a single load or a load plus bswap.
> >>>>>
> >>>>> If the alternative approach sounds reasonable I'll
post my patches for review.
> >>>>>
> >>>>> Artur
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >

Sanjoy Das via llvm-dev

2016-Sep-29 18:16 UTC

head link

[llvm-dev] Load combine pass

Hi Artur,

Artur Pilipenko wrote:
 >> On 29 Sep 2016, at 21:01, Sanjoy Das<sanjoy at
playingwithpointers.com>
  wrote:
 >>
 >> Hi Artur,
 >>
 >> Artur Pilipenko wrote:
 >>
 >>> BTW, do we really need to emit an atomic load if all the
individual
 >>> components are bytes?
 >> Depends -- do you mean at the at the hardware level or at the IR
 >> level?
 >>
 >> If you mean at the IR level, then I think yes; since otherwise it is
 >> legal to do transforms that break byte-wise atomicity in the IR, e.g.:
 >>
 >>   i32* ptr = ...
 >>   i32  val = *ptr
 >>
 >> =>   // Since no threads can be legally racing on *ptr
 >>
 >>   i32* ptr = ...
 >>   i32 val0 = *ptr
 >>   i32 val1 = *ptr
 >>   i32 val = (val0&  1) | (val1&  ~1);
 >>
 >>
 >> If you're talking about the hardware level, then I'm not sure;
and my
 >> guess is that the answer is almost certainly arch-dependent.
 > I meant the case when we have a load by bytes pattern like this:
 > i8* p = ...
 > i8 b0 = *p++;
 > i8 b1 = *p++;
 > i8 b2 = *p++;
 > i8 b3 = *p++;
 > i32 result = b0<<  24 | b1<<  16 | b2<<  8 | b<< 
0;
 >
 > When we fold it to a i32 load, should this load be atomic?

If we do fold it to a non-atomic i32 load, then it would be legal for
LLVM to do the IR transform I mentioned above.  That breaks the
byte-wise atomicity you had in the original program.

That is, in:

   i8* p = ...
   i8 b0 = *p++;
   i8 b1 = *p++;
   i8 b2 = *p++;
   i8 b3 = *p++;
   // Note: I changed this to be little endian, and I've assumed
   // that we're compiling for a little endian system
   i32 result = b3<< 24 | b2<<  16 | b1<<  8 | b0<<  0;

say all of p[0..3] are 0, and you have a thread racing to set b0 to
-1.  Then result can either be 0 or 255.

However, say you first transform this to a non-atomic i32 load:

   i8* p = ...
   i32* p.i32 = (i32*)p
   i32 result = *p.i32

and we do the transform above

   i8* p = ...
   i32* p.i32 = (i32*)p
   i32 result0 = *p.i32
   i32 result1 = *p.i32
   i32 result = (result0 & 1) | (result1 & ~1);

then it is possible for result to be 254 (by result0 observing 0 and
result observing 255).

-- Sanjoy

Sanjoy Das via llvm-dev

2016-Sep-29 18:17 UTC

head link

[llvm-dev] Load combine pass

Correcting obvious typo:

Sanjoy Das wrote:
[snip]> i32 result0 = *p.i32
> i32 result1 = *p.i32
> i32 result = (result0 & 1) | (result1 & ~1);
>
> then it is possible for result to be 254 (by result0 observing 0 and
> result observing 255).
Should be: "by result0 observing 0 and result1 observing 255"


-- Sanjoy

Artur Pilipenko via llvm-dev

2016-Sep-29 18:25 UTC

head link

[llvm-dev] Load combine pass

> On 29 Sep 2016, at 21:16, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi Artur,
> 
> Artur Pilipenko wrote:
> >> On 29 Sep 2016, at 21:01, Sanjoy Das<sanjoy at
playingwithpointers.com>  wrote:
> >>
> >> Hi Artur,
> >>
> >> Artur Pilipenko wrote:
> >>
> >>> BTW, do we really need to emit an atomic load if all the
individual
> >>> components are bytes?
> >> Depends -- do you mean at the at the hardware level or at the IR
> >> level?
> >>
> >> If you mean at the IR level, then I think yes; since otherwise it
is
> >> legal to do transforms that break byte-wise atomicity in the IR,
e.g.:
> >>
> >>   i32* ptr = ...
> >>   i32  val = *ptr
> >>
> >> =>   // Since no threads can be legally racing on *ptr
> >>
> >>   i32* ptr = ...
> >>   i32 val0 = *ptr
> >>   i32 val1 = *ptr
> >>   i32 val = (val0&  1) | (val1&  ~1);
> >>
> >>
> >> If you're talking about the hardware level, then I'm not
sure; and my
> >> guess is that the answer is almost certainly arch-dependent.
> > I meant the case when we have a load by bytes pattern like this:
> > i8* p = ...
> > i8 b0 = *p++;
> > i8 b1 = *p++;
> > i8 b2 = *p++;
> > i8 b3 = *p++;
> > i32 result = b0<<  24 | b1<<  16 | b2<<  8 |
b<<  0;
> >
> > When we fold it to a i32 load, should this load be atomic?
> 
> If we do fold it to a non-atomic i32 load, then it would be legal for
> LLVM to do the IR transform I mentioned above.  That breaks the
> byte-wise atomicity you had in the original program.
> 
> That is, in:
> 
>  i8* p = ...
>  i8 b0 = *p++;
>  i8 b1 = *p++;
>  i8 b2 = *p++;
>  i8 b3 = *p++;
>  // Note: I changed this to be little endian, and I've assumed
>  // that we're compiling for a little endian system
>  i32 result = b3<< 24 | b2<<  16 | b1<<  8 | b0<< 
0;
> 
> say all of p[0..3] are 0, and you have a thread racing to set b0 to
> -1.  Then result can either be 0 or 255.
> 
> However, say you first transform this to a non-atomic i32 load:
> 
>  i8* p = ...
>  i32* p.i32 = (i32*)p
>  i32 result = *p.i32
> 
> and we do the transform above
> 
>  i8* p = ...
>  i32* p.i32 = (i32*)p
>  i32 result0 = *p.i32
>  i32 result1 = *p.i32
>  i32 result = (result0 & 1) | (result1 & ~1);
> 
> then it is possible for result to be 254 (by result0 observing 0 and
> result observing 255).I see. For some reason I was assuming byte-wise atomicity for non-atomic loads.

So, if any of the components are atomic, the resulting load must be atomic as
well.

Artur > 
> -- Sanjoy

llvm dev - Sep 2016 - Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass

[llvm-dev] Load combine pass