thr3ads.net - llvm dev - [llvm-dev] Optimisation passes introducing address space casts [Jan 2017]

If this information is useful, please help other people find it:
Share via:

James Price via llvm-dev

2017-Jan-02 22:12 UTC

[llvm-dev] Optimisation passes introducing address space casts

Hi Mehdi,

Thanks for the reply - I’ve finally got round to trying to fix this based on
your suggestion. I’ve got something that mostly works, but I just wanted to
double-check something about the regression tests before I post a patch.
> The memcpy is supposed to be equivalent to a sequence of load and store.
Here we are just failing to keep the property that the load is performed through
addrspace(2).

Based on this comment, I am suspicious of the validity of a couple of existing
instcombine regression tests in `memcpy-from-global.ll`. Specifically, there are
two tests that look like this:

  define void @test3_addrspacecast() {
    %A = alloca %T
    %a = bitcast %T* %A to i8*
    call void @llvm.memcpy.p0i8.p1i8.i64(i8* %a, i8 addrspace(1)* addrspacecast
(%T* @G to i8 addrspace(1)*), i64 124, i32 4, i1 false)
    call void @bar(i8* %a) readonly
  ; CHECK-LABEL: @test3_addrspacecast(
  ; CHECK-NEXT: call void @bar(i8* getelementptr inbounds (%T, %T* @G, i64 0,
i32 0))
    ret void
  }

Here, there is a global variable in `addrspace(0)`, which is passed as a source
operand to `llvm.memcpy` via an `addrspacecast` to `addrspace(1)`. The memcpy
destination operand (an `alloca` in `addrspace(0)`) is then passed to a
function. The test asserts that this memcpy should be removed and the global
should just be passed directly to the function, but doesn’t this lose the
property that the load should be performed through `addrspace(1)`, as per your
comment above?

I can (and have) fixed this as part of the patch I’m working on, but since this
implies that a couple of existing regression tests would be incorrect I just
wanted to double-check that I’m not misinterpreting something.

CCing Matt Arsenault who added those specific tests originally in r207054.

James
> On 9 Nov 2016, at 21:11, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> Hi,
> 
>> On Nov 9, 2016, at 10:13 AM, James Price via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
>> 
>> Hi,
>> 
>> I’ve recently encountered an issue where the `instcombine` pass
replaces an `llvm.memcpy` between two distinct address spaces with an
`addrspacecast` instruction.
>> 
>> As an example, see the trivial OpenCL kernel attached. I’m compiling
like this:
>> 
>>   clang -cc1 -triple spir64-unknown-unknown -x cl -O0 -emit-llvm
array_init.cl -o before.ll
>> 
>> This yields an `llvm.memcpy` to copy the array initialiser data from
the global variable (in `addrspace(2)`) to the `alloca` result (in
`addrspace(0)`).
>> 
>> I then apply the `instcombine` pass via:
>> 
>>   opt -S -instcombine before.ll -o after.ll
>> 
>> This results in the memcpy being nuked, and the `addrspace(2)` data is
now accessed directly via an `addrspacecast` to `addrspace(0)`.
>> 
>> 
>> It seems to me that this sort of optimisation is only valid if it is
guaranteed that the two address spaces alias for the given target triple (which
for SPIR, they do not).
> 
> I’m not sure “alias” is the right consideration here.
> 
>> 
>> 
>> This particular optimisation is coming from lines ~290-300 of
InstCombineLoadStoreAlloca.cpp, although I suspect this isn’t the only case
where this might happen.
>> 
>> Adding a check to only perform this replacement if the two address
spaces are equal fixes the issue for me, but this is probably too conservative
since many targets with flat address spaces will probably benefit from this
optimisation.
>> It feels like passes should query the target about whether two address
spaces alias before introducing an `addrspacecast`, but I’m not familiar enough
with LLVM internals to know if this is information that is easy to make
available (if it isn’t already).
>> 
>> Is there something we can do here to avoid this sort of optimisation
causing problems for targets with segmented address spaces?
> 
> This is a bug, we can’t assume any memory layout I believe.
> 
> The memcpy is supposed to be equivalent to a sequence of load and store.
Here we are just failing to keep the property that the load is performed through
addrspace(2). The fix is not “trivial” though, the transformation can only be
performed if the address of the alloca does not escape, and some rewriting of
the uses is needed to propagate the address space through GEPs for instance.
> 
> — 
> Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170102/728b73af/attachment.html>

Mehdi Amini via llvm-dev

2017-Jan-02 22:52 UTC

head link

[llvm-dev] Optimisation passes introducing address space casts

> On Jan 2, 2017, at 2:12 PM, James Price <J.Price at bristol.ac.uk>
wrote:
> 
> Hi Mehdi,
> 
> Thanks for the reply - I’ve finally got round to trying to fix this based
on your suggestion.
Glad to read this!
> I’ve got something that mostly works, but I just wanted to double-check
something about the regression tests before I post a patch.
> 
>> The memcpy is supposed to be equivalent to a sequence of load and
store. Here we are just failing to keep the property that the load is performed
through addrspace(2).
> 
> 
> Based on this comment, I am suspicious of the validity of a couple of
existing instcombine regression tests in `memcpy-from-global.ll`. Specifically,
there are two tests that look like this:
> 
>   define void @test3_addrspacecast() {
>     %A = alloca %T
>     %a = bitcast %T* %A to i8*
>     call void @llvm.memcpy.p0i8.p1i8.i64(i8* %a, i8 addrspace(1)*
addrspacecast (%T* @G to i8 addrspace(1)*), i64 124, i32 4, i1 false)
>     call void @bar(i8* %a) readonly
>   ; CHECK-LABEL: @test3_addrspacecast(
>   ; CHECK-NEXT: call void @bar(i8* getelementptr inbounds (%T, %T* @G, i64
0, i32 0))
>     ret void
>   }
> 
> Here, there is a global variable in `addrspace(0)`, which is passed as a
source operand to `llvm.memcpy` via an `addrspacecast` to `addrspace(1)`. The
memcpy destination operand (an `alloca` in `addrspace(0)`) is then passed to a
function. The test asserts that this memcpy should be removed and the global
should just be passed directly to the function, but doesn’t this lose the
property that the load should be performed through `addrspace(1)`, as per your
comment above?
Let’s desugarize memcpy to a sequence of load-store as my previous comment
implied (simplified for a single i8, otherwise we need a loop or a load of T,
but the reasoning seems the same to me):


  define void @test3_addrspacecast() {
    %A = alloca %T
    %a = bitcast %T* %A to i8*
    %tmp = load i8, i8 addrspace(1)* addrspacecast (%T* @G to i8 addrspace(1)*) 
; Load from Addrspace 1
    store i8 %tmp, i8* %a ; Store to the alloca in Addrspace 0, now %A and %G
have the same content.
    call void @bar(i8* %a) readonly
    ret void
  }


If I understand correctly, there is no load involved here after optimization.
Before the optimization, the function bar() takes a pointer to the memory
location @A, in address space 0. After the optimization, the function still
takes a pointer to address space 0. So this part is valid.

Since the LangRef for addrspacecast says " if the address space conversion
is legal then both result and operand refer to the same memory location”, the
optimizer is free to assume that a load to @G through either addrspace 0 or 1
yields the same value, and thus can figure that the alloca value is the same as
@G.
Since bar() is read-only, exchanging a pointer for another one in the same
address space, when the pointers point to the same value, seems valid to me.

— 
Mehdi

> 
> I can (and have) fixed this as part of the patch I’m working on, but since
this implies that a couple of existing regression tests would be incorrect I
just wanted to double-check that I’m not misinterpreting something.
> 
> CCing Matt Arsenault who added those specific tests originally in r207054.
> 
> James
> 
>> On 9 Nov 2016, at 21:11, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>> 
>> Hi,
>> 
>>> On Nov 9, 2016, at 10:13 AM, James Price via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>> 
>>> Hi,
>>> 
>>> I’ve recently encountered an issue where the `instcombine` pass
replaces an `llvm.memcpy` between two distinct address spaces with an
`addrspacecast` instruction.
>>> 
>>> As an example, see the trivial OpenCL kernel attached. I’m
compiling like this:
>>> 
>>>   clang -cc1 -triple spir64-unknown-unknown -x cl -O0 -emit-llvm
array_init.cl -o before.ll
>>> 
>>> This yields an `llvm.memcpy` to copy the array initialiser data
from the global variable (in `addrspace(2)`) to the `alloca` result (in
`addrspace(0)`).
>>> 
>>> I then apply the `instcombine` pass via:
>>> 
>>>   opt -S -instcombine before.ll -o after.ll
>>> 
>>> This results in the memcpy being nuked, and the `addrspace(2)` data
is now accessed directly via an `addrspacecast` to `addrspace(0)`.
>>> 
>>> 
>>> It seems to me that this sort of optimisation is only valid if it
is guaranteed that the two address spaces alias for the given target triple
(which for SPIR, they do not).
>> 
>> I’m not sure “alias” is the right consideration here.
>> 
>>> 
>>> 
>>> This particular optimisation is coming from lines ~290-300 of
InstCombineLoadStoreAlloca.cpp, although I suspect this isn’t the only case
where this might happen.
>>> 
>>> Adding a check to only perform this replacement if the two address
spaces are equal fixes the issue for me, but this is probably too conservative
since many targets with flat address spaces will probably benefit from this
optimisation.
>>> It feels like passes should query the target about whether two
address spaces alias before introducing an `addrspacecast`, but I’m not familiar
enough with LLVM internals to know if this is information that is easy to make
available (if it isn’t already).
>>> 
>>> Is there something we can do here to avoid this sort of
optimisation causing problems for targets with segmented address spaces?
>> 
>> This is a bug, we can’t assume any memory layout I believe.
>> 
>> The memcpy is supposed to be equivalent to a sequence of load and
store. Here we are just failing to keep the property that the load is performed
through addrspace(2). The fix is not “trivial” though, the transformation can
only be performed if the address of the alloca does not escape, and some
rewriting of the uses is needed to propagate the address space through GEPs for
instance.
>> 
>> — 
>> Mehdi
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170102/2e25eb9e/attachment.html>

James Price via llvm-dev

2017-Jan-03 01:24 UTC

head link

[llvm-dev] Optimisation passes introducing address space casts

> On 2 Jan 2017, at 22:52, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
> Since the LangRef for addrspacecast says " if the address space
conversion is legal then both result and operand refer to the same memory
location”, the optimizer is free to assume that a load to @G through either
addrspace 0 or 1 yields the same value, and thus can figure that the alloca
value is the same as @G.
> Since bar() is read-only, exchanging a pointer for another one in the same
address space, when the pointers point to the same value, seems valid to me.
OK, that makes sense, thanks for the explanation.

With that in mind, I *think* these two new tests are all that is needed to
expose the bug that I’m trying to fix (they both fail with 3.9). I’ve got them
passing with a patch to 4.0 so I’ll try and post something for review soon.

  define void @test10(i64 %index, i32* %out) {
    %A = alloca %U, align 16
    %a = bitcast %U* %A to i8*
    call void @llvm.memcpy.p0i8.p1i8.i64(i8* %a, i8 addrspace(1)* bitcast ([2 x
%U] addrspace(1)* @I to i8 addrspace(1)*), i64 20, i32 16, i1 false)
    %tmp1 = bitcast i8* %a to i32*
    %tmp2 = getelementptr i32, i32* %tmp1, i64 %index
    %tmp3 = load i32, i32* %tmp2
    store i32 %tmp3, i32* %out
  ; CHECK-LABEL: @test10(
  ; CHECK-NOT: alloca
  ; CHECK-NOT: memcpy
  ; CHECK-NOT: addrspacecast
  ; Ensure that load still happens through addrspace(1)
  ; CHECK: load i32, i32 addrspace(1)*
    ret void
  }

  define void @test12() {
    %A = alloca %U, align 16
    %a = bitcast %U* %A to i8*
    call void @llvm.memcpy.p0i8.p1i8.i64(i8* %a, i8 addrspace(1)* bitcast ([2 x
%U] addrspace(1)* @I to i8 addrspace(1)*), i64 20, i32 16, i1 false)
    call void @bar(i8* %a) readonly
  ; Must retain memcpy as source address space doesn't match function
argument.
  ; CHECK-LABEL: @test12(
  ; CHECK: llvm.memcpy
  ; CHECK: bar
    ret void
  }

James
> — 
> Mehdi
> 
> 
>> 
>> I can (and have) fixed this as part of the patch I’m working on, but
since this implies that a couple of existing regression tests would be incorrect
I just wanted to double-check that I’m not misinterpreting something.
>> 
>> CCing Matt Arsenault who added those specific tests originally in
r207054.
>> 
>> James
>> 
>>> On 9 Nov 2016, at 21:11, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>>> 
>>> Hi,
>>> 
>>>> On Nov 9, 2016, at 10:13 AM, James Price via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I’ve recently encountered an issue where the `instcombine` pass
replaces an `llvm.memcpy` between two distinct address spaces with an
`addrspacecast` instruction.
>>>> 
>>>> As an example, see the trivial OpenCL kernel attached. I’m
compiling like this:
>>>> 
>>>>   clang -cc1 -triple spir64-unknown-unknown -x cl -O0
-emit-llvm array_init.cl -o before.ll
>>>> 
>>>> This yields an `llvm.memcpy` to copy the array initialiser data
from the global variable (in `addrspace(2)`) to the `alloca` result (in
`addrspace(0)`).
>>>> 
>>>> I then apply the `instcombine` pass via:
>>>> 
>>>>   opt -S -instcombine before.ll -o after.ll
>>>> 
>>>> This results in the memcpy being nuked, and the `addrspace(2)`
data is now accessed directly via an `addrspacecast` to `addrspace(0)`.
>>>> 
>>>> 
>>>> It seems to me that this sort of optimisation is only valid if
it is guaranteed that the two address spaces alias for the given target triple
(which for SPIR, they do not).
>>> 
>>> I’m not sure “alias” is the right consideration here.
>>> 
>>>> 
>>>> 
>>>> This particular optimisation is coming from lines ~290-300 of
InstCombineLoadStoreAlloca.cpp, although I suspect this isn’t the only case
where this might happen.
>>>> 
>>>> Adding a check to only perform this replacement if the two
address spaces are equal fixes the issue for me, but this is probably too
conservative since many targets with flat address spaces will probably benefit
from this optimisation.
>>>> It feels like passes should query the target about whether two
address spaces alias before introducing an `addrspacecast`, but I’m not familiar
enough with LLVM internals to know if this is information that is easy to make
available (if it isn’t already).
>>>> 
>>>> Is there something we can do here to avoid this sort of
optimisation causing problems for targets with segmented address spaces?
>>> 
>>> This is a bug, we can’t assume any memory layout I believe.
>>> 
>>> The memcpy is supposed to be equivalent to a sequence of load and
store. Here we are just failing to keep the property that the load is performed
through addrspace(2). The fix is not “trivial” though, the transformation can
only be performed if the address of the alloca does not escape, and some
rewriting of the uses is needed to propagate the address space through GEPs for
instance.
>>> 
>>> — 
>>> Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170103/2f092781/attachment.html>

James Price via llvm-dev

2017-Jan-03 22:32 UTC

head link

[llvm-dev] Optimisation passes introducing address space casts

OK, I’ve hit one more existing regression test that I’m weary of:

    define void @test2_addrspacecast() {
      %A = alloca %T
      %B = alloca %T
      %a = addrspacecast %T* %A to i8 addrspace(1)*
      %b = addrspacecast %T* %B to i8 addrspace(1)*
      call void @llvm.memcpy.p1i8.p0i8.i64(i8 addrspace(1)* %a, i8* bitcast (%T*
@G to i8*), i64 124, i32 4, i1 false)
      call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %b, i8 addrspace(1)*
%a, i64 124, i32 4, i1 false)
      call void @bar_as1(i8 addrspace(1)* %b)
      ret void
    }

This time there is no pre-existing addrspacecast on the global @G to convince
the optimiser that loading from @G through anything other than addrspace(0) is
legal, but the current instcombine code produces this (which is what the test
expects):

    %B1 = alloca [124 x i8], align 8
    %B1.sub = getelementptr inbounds [124 x i8], [124 x i8]* %B1, i64 0, i64 0
    %b = addrspacecast i8* %B1.sub to i8 addrspace(1)*
    call void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* %b, i8 addrspace(1)*
addrspacecast (i8* getelementptr inbounds (%T, %T* @G, i64 0, i32 0) to i8
addrspace(1)*), i64 124, i32 8, i1 false)
    call void @bar_as1(i8 addrspace(1)* %b)

Surely this load from @G via addrspace(1) is not valid, since there is no prior
addrspacecast in the unoptimised IR that would allow the optimiser to conclude
that this is legal?

James
> On 2 Jan 2017, at 22:52, Mehdi Amini <mehdi.amini at apple.com>
wrote:
> 
>> 
>> On Jan 2, 2017, at 2:12 PM, James Price <J.Price at bristol.ac.uk
<mailto:J.Price at bristol.ac.uk>> wrote:
>> 
>> Hi Mehdi,
>> 
>> Thanks for the reply - I’ve finally got round to trying to fix this
based on your suggestion.
> 
> Glad to read this!
> 
>> I’ve got something that mostly works, but I just wanted to double-check
something about the regression tests before I post a patch.
>> 
>>> The memcpy is supposed to be equivalent to a sequence of load and
store. Here we are just failing to keep the property that the load is performed
through addrspace(2).
>> 
>> 
>> Based on this comment, I am suspicious of the validity of a couple of
existing instcombine regression tests in `memcpy-from-global.ll`. Specifically,
there are two tests that look like this:
>> 
>>   define void @test3_addrspacecast() {
>>     %A = alloca %T
>>     %a = bitcast %T* %A to i8*
>>     call void @llvm.memcpy.p0i8.p1i8.i64(i8* %a, i8 addrspace(1)*
addrspacecast (%T* @G to i8 addrspace(1)*), i64 124, i32 4, i1 false)
>>     call void @bar(i8* %a) readonly
>>   ; CHECK-LABEL: @test3_addrspacecast(
>>   ; CHECK-NEXT: call void @bar(i8* getelementptr inbounds (%T, %T* @G,
i64 0, i32 0))
>>     ret void
>>   }
>> 
>> Here, there is a global variable in `addrspace(0)`, which is passed as
a source operand to `llvm.memcpy` via an `addrspacecast` to `addrspace(1)`. The
memcpy destination operand (an `alloca` in `addrspace(0)`) is then passed to a
function. The test asserts that this memcpy should be removed and the global
should just be passed directly to the function, but doesn’t this lose the
property that the load should be performed through `addrspace(1)`, as per your
comment above?
> 
> Let’s desugarize memcpy to a sequence of load-store as my previous comment
implied (simplified for a single i8, otherwise we need a loop or a load of T,
but the reasoning seems the same to me):
> 
> 
>   define void @test3_addrspacecast() {
>     %A = alloca %T
>     %a = bitcast %T* %A to i8*
>     %tmp = load i8, i8 addrspace(1)* addrspacecast (%T* @G to i8
addrspace(1)*)  ; Load from Addrspace 1
>     store i8 %tmp, i8* %a ; Store to the alloca in Addrspace 0, now %A and
%G have the same content.
>     call void @bar(i8* %a) readonly
>     ret void
>   }
> 
> 
> If I understand correctly, there is no load involved here after
optimization. Before the optimization, the function bar() takes a pointer to the
memory location @A, in address space 0. After the optimization, the function
still takes a pointer to address space 0. So this part is valid.
> 
> Since the LangRef for addrspacecast says " if the address space
conversion is legal then both result and operand refer to the same memory
location”, the optimizer is free to assume that a load to @G through either
addrspace 0 or 1 yields the same value, and thus can figure that the alloca
value is the same as @G.
> Since bar() is read-only, exchanging a pointer for another one in the same
address space, when the pointers point to the same value, seems valid to me.
> 
> — 
> Mehdi
> 
> 
>> 
>> I can (and have) fixed this as part of the patch I’m working on, but
since this implies that a couple of existing regression tests would be incorrect
I just wanted to double-check that I’m not misinterpreting something.
>> 
>> CCing Matt Arsenault who added those specific tests originally in
r207054.
>> 
>> James
>> 
>>> On 9 Nov 2016, at 21:11, Mehdi Amini <mehdi.amini at apple.com
<mailto:mehdi.amini at apple.com>> wrote:
>>> 
>>> Hi,
>>> 
>>>> On Nov 9, 2016, at 10:13 AM, James Price via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I’ve recently encountered an issue where the `instcombine` pass
replaces an `llvm.memcpy` between two distinct address spaces with an
`addrspacecast` instruction.
>>>> 
>>>> As an example, see the trivial OpenCL kernel attached. I’m
compiling like this:
>>>> 
>>>>   clang -cc1 -triple spir64-unknown-unknown -x cl -O0
-emit-llvm array_init.cl -o before.ll
>>>> 
>>>> This yields an `llvm.memcpy` to copy the array initialiser data
from the global variable (in `addrspace(2)`) to the `alloca` result (in
`addrspace(0)`).
>>>> 
>>>> I then apply the `instcombine` pass via:
>>>> 
>>>>   opt -S -instcombine before.ll -o after.ll
>>>> 
>>>> This results in the memcpy being nuked, and the `addrspace(2)`
data is now accessed directly via an `addrspacecast` to `addrspace(0)`.
>>>> 
>>>> 
>>>> It seems to me that this sort of optimisation is only valid if
it is guaranteed that the two address spaces alias for the given target triple
(which for SPIR, they do not).
>>> 
>>> I’m not sure “alias” is the right consideration here.
>>> 
>>>> 
>>>> 
>>>> This particular optimisation is coming from lines ~290-300 of
InstCombineLoadStoreAlloca.cpp, although I suspect this isn’t the only case
where this might happen.
>>>> 
>>>> Adding a check to only perform this replacement if the two
address spaces are equal fixes the issue for me, but this is probably too
conservative since many targets with flat address spaces will probably benefit
from this optimisation.
>>>> It feels like passes should query the target about whether two
address spaces alias before introducing an `addrspacecast`, but I’m not familiar
enough with LLVM internals to know if this is information that is easy to make
available (if it isn’t already).
>>>> 
>>>> Is there something we can do here to avoid this sort of
optimisation causing problems for targets with segmented address spaces?
>>> 
>>> This is a bug, we can’t assume any memory layout I believe.
>>> 
>>> The memcpy is supposed to be equivalent to a sequence of load and
store. Here we are just failing to keep the property that the load is performed
through addrspace(2). The fix is not “trivial” though, the transformation can
only be performed if the address of the alloca does not escape, and some
rewriting of the uses is needed to propagate the address space through GEPs for
instance.
>>> 
>>> — 
>>> Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170103/accbc999/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

llvm dev - Jan 2017 - Optimisation passes introducing address space casts

[llvm-dev] Optimisation passes introducing address space casts

[llvm-dev] Optimisation passes introducing address space casts

[llvm-dev] Optimisation passes introducing address space casts

[llvm-dev] Optimisation passes introducing address space casts

Apparently Analagous Threads