thr3ads.net - llvm dev - [LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?) [Jan 2013]

If this information is useful, please help other people find it:
Share via:

Dmitry Mikushin

2013-Jan-04 00:52 UTC

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

Hi,

Here's another case, different in high-level, but similar in low-level.
When Fortran allocatable array is defined in module, its actual dimensions
are kept in internal structure. Loads originated from reading these
dimensions confuse Polly on any use of this array.

Attachments:
1) Sample Fortran source code (to be compiled with and without -DMODULE to
see failing and working version, respectively).
2) LLVM IR for both cases right before Polly analysis (initialization loop
"array = 2")

Below is diff for quick look:

marcusmae at M17xR4:~/forge/kernelgen/tests/behavior/module_array$ diff -u
array.loop.9.ll module_array.loop.9.ll
--- array.loop.9.ll    2013-01-04 01:37:40.312259953 +0100
+++ module_array.loop.9.ll    2013-01-04 01:37:50.036259544 +0100
@@ -12,34 +12,39 @@
   br label %"10.cloned"

 "16.cloned":                                      ; preds =
%"15.cloned"
-  %1 = add i64 %indvar3, 1
-  %exitcond12 = icmp eq i64 %1, 64
-  br i1 %exitcond12, label %"17.exitStub", label
%"10.cloned"
+  %1 = add i64 %indvar1, 1
+  %exitcond9 = icmp eq i64 %1, 64
+  br i1 %exitcond9, label %"17.exitStub", label
%"10.cloned"

 "10.cloned":                                      ; preds =
%"Loop
Function Root.split", %"16.cloned"
-  %indvar3 = phi i64 [ 0, %"Loop Function Root.split" ], [ %1,
%"16.cloned" ]
-  %2 = mul i64 %indvar3, 4096
+  %indvar1 = phi i64 [ 0, %"Loop Function Root.split" ], [ %1,
%"16.cloned" ]
+  %indvar.next2 = add i64 %indvar1, 1
+  %2 = load i64* inttoptr (i64 47280713800 to i64*), align 8
+  %3 = mul i64 %2, %indvar.next2
+  %4 = add i64 %3, -4160
   br label %"12.cloned"

 "15.cloned":                                      ; preds =
%"14.cloned"
-  %3 = add i64 %indvar1, 1
-  %exitcond = icmp eq i64 %3, 64
+  %5 = add i64 %indvar3, 1
+  %exitcond = icmp eq i64 %5, 64
   br i1 %exitcond, label %"16.cloned", label %"12.cloned"

 "12.cloned":                                      ; preds =
%"10.cloned",
%"15.cloned"
-  %indvar1 = phi i64 [ 0, %"10.cloned" ], [ %3,
%"15.cloned" ]
-  %4 = mul i64 %indvar1, 64
-  %5 = add i64 %2, %4
+  %indvar3 = phi i64 [ 0, %"10.cloned" ], [ %5,
%"15.cloned" ]
+  %indvar.next4 = add i64 %indvar3, 1
+  %6 = load i64* inttoptr (i64 47280713776 to i64*), align 16
+  %7 = mul i64 %6, %indvar.next4
+  %8 = add i64 %4, %7
   br label %"14.cloned"

 "14.cloned":                                      ; preds =
%"12.cloned",
%"14.cloned"
-  %indvar = phi i64 [ 0, %"12.cloned" ], [ %7, %"14.cloned"
]
-  %6 = add i64 %5, %indvar
-  %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*),
i64 %6
+  %indvar = phi i64 [ 0, %"12.cloned" ], [ %10,
%"14.cloned" ]
+  %9 = add i64 %8, %indvar
+  %scevgep = getelementptr float* inttoptr (i64 47246749696 to float*),
i64 %9
   store float 2.000000e+00, float* %scevgep, align 4
-  %7 = add i64 %indvar, 1
-  %exitcond9 = icmp eq i64 %7, 64
-  br i1 %exitcond9, label %"15.cloned", label %"14.cloned"
+  %10 = add i64 %indvar, 1
+  %exitcond7 = icmp eq i64 %10, 64
+  br i1 %exitcond7, label %"15.cloned", label %"14.cloned"

 "17.exitStub":                                    ; preds =
%"16.cloned"
   ret void

2013/1/2 Dmitry Mikushin <dmitry at kernelgen.org>
> Hi Duncan & Tobi,
>
> Thanks a lot for your interest, and for pointing out differences in GIMPLE
> I missed.
>
> Attached is simplified test case. Is it good?
>
> Tobi, regarding runtime alias analysis: in KernelGen we already do it
> along with runtime values substitution. For example:
>
> <------------------ __kernelgen_main_loop_17: compile started
> --------------------->
>     Integer args substituted:
>         offset = 32, ptrValue = 47248855040
>         offset = 40, ptrValue = 47246749696
>         offset = 48, ptrValue = 47247802368
>         offset = 16, value = 64
>         offset = 20, value = 64
>         offset = 24, value = 64
> MemoryAccess to pointer: float* inttoptr (i64 47246749696 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47246749696 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >>
47246749696 + 16384i0 + 256i1 + 4i2 and o0 <= 47246749699 + 16384i0 + 256i1
> + 4i2 }
> MemoryAccess to pointer: float* inttoptr (i64 47247802368 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47247802368 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >>
47247802368 + 16384i0 + 256i1 + 4i2 and o0 <= 47247802371 + 16384i0 + 256i1
> + 4i2 }
> MemoryAccess to pointer: float* inttoptr (i64 47248855040 to float*)
>     { Stmt__12_cloned_[i0, i1, i2] -> MemRef_nttoptr (i64 47248855040 to
> float*)[4096i0 + 64i1 + i2] }
>         allocSize: 4 storeSize: 4
>     replacedBy: { Stmt__12_cloned_[i0, i1, i2] -> NULL[o0] : o0 >>
47248855040 + 16384i0 + 256i1 + 4i2 and o0 <= 47248855043 + 16384i0 + 256i1
> + 4i2 }
>
>     Number of good nested parallel loops: 3
>     Average size of loops: 64 64 64
>
> <------------------------------ Scop: end
> ----------------------------------->
>
> <------------------------------ Scop: start
> --------------------------------->
> <------------------- Cloog AST of Scop ------------------->
> for (c2=0;c2<=63;c2++) {
>   for (c4=0;c4<=63;c4++) {
>     for (c6=0;c6<=63;c6++) {
>       Stmt__12_cloned_(c2,c4,c6);
>     }
>   }
> }
> <--------------------------------------------------------->
>     Context:
>     {  :  }
>     Statements {
>         Stmt__12_cloned_
>             Domain :>                 { Stmt__12_cloned_[i0, i1, i2] :
i0 >= 0 and i0 <= 63 and
> i1 >= 0 and i1 <= 63 and i2 >= 0 and i2 <= 63 };
>             Scattering :>                 { Stmt__12_cloned_[i0, i1, i2]
-> scattering[0, i0, 0, i1,
> 0, i2, 0] };
>             ReadAccess :>                 { Stmt__12_cloned_[i0, i1, i2]
-> NULL[o0] : o0 >> 47246749696 + 16384i0 + 256i1 + 4i2 and o0 <=
47246749699 + 16384i0 + 256i1
> + 4i2 };
>             ReadAccess :>                 { Stmt__12_cloned_[i0, i1, i2]
-> NULL[o0] : o0 >> 47247802368 + 16384i0 + 256i1 + 4i2 and o0 <=
47247802371 + 16384i0 + 256i1
> + 4i2 };
>             WriteAccess :>                 { Stmt__12_cloned_[i0, i1,
i2] -> NULL[o0] : o0 >> 47248855040 + 16384i0 + 256i1 + 4i2 and o0
<= 47248855043 + 16384i0 + 256i1
> + 4i2 };
>     }
> <------------------------------ Scop: end
> ----------------------------------->
> <------------------------------ Scop: dependences
> --------------------------->
> Write after read dependences:
>     {  }
> Read after write dependences:
>     {  }
> Write after write dependences:
>     {  }
>     loop is parallel
>         loop is parallel
>             loop is parallel
> <------------------------------ Scop: dependences end
> ----------------------->
> 1 polly-detect - Number of regions that a valid part of Scop
> <------------------ __kernelgen_main_loop_17: compile completed
> ------------------->
>
> It works pretty well in many situations, but in this particular case it
> does not help. Those problematic "Fortran scalar values referred by
> pointers" (FSVRPs) can only substituted (replaced by actual value)
after
> proper memory analysis. According to current design, memory analysis
> operates on SCoPs, but Polly is already unable to detect SCoP for the whole
> group of nested loops due to presence of those FSVRPs. So, chicken and egg
> problem.
>
> - D.
>
>
> 2013/1/2 Tobias Grosser <tobias at grosser.es>
>
>> On 01/01/2013 02:45 PM, Duncan Sands wrote:
>>
>>> Hi Dmitry,
>>>
>>>
>>>> In our compiler we use a modified version LLVM Polly, which is
very
>>>> sensitive to
>>>> proper code generation. Among the number of limitations, the
loop region
>>>> (enclosed by phi node on induction variable and branch) is
required to
>>>> be free
>>>> of additional memory-dependent branches. In other words, there
must be
>>>> no
>>>> conditional "br" instructions below phi nodes. The
problem we are
>>>> facing is that
>>>> from *identical* GIMPLE for 3d loop used in different contexts
>>>> DragonEgg may
>>>> generate LLVM IR either conforming the described limitation, or
>>>> violating it.
>>>>
>>>
>>> the gimple isn't the same at all (see below).  The differences
are
>>> directly
>>> reflected in the unoptimized LLVM IR, turning up as additional
memory
>>> loads
>>> in the "bad" version.  In addition, the Fortran isn't
really the same
>>> either:
>>> Fortran semantics allows the compiler to assume that the parameters
of
>>> your
>>> new function "compute" (which are all passed in by
reference, i.e. as
>>> pointers)
>>> do not alias each other or anything else in sight (i.e. they get
the
>>> "restrict"
>>> qualifier in the gimple, noalias in the LLVM IR).  Thus by
factorizing
>>> the loop
>>> into "compute" you are actually giving the compiler more
information.
>>>
>>> Summary:
>>>    (1) as far as I can see the unoptimized LLVM IR is a direct
>>> reflection of
>>> the gimple: the differences for the loop part come directly from
>>> differences
>>> in the gimple;
>>>    (2) the optimizers do a better good when you have
"compute" partly
>>> because you
>>> provided them with additional aliasing information; this better
optimized
>>> version then gets inlined into MAIN__.
>>>    (3) this leaves the question of whether in the bad version it is
>>> logically
>>> possible for the optimizers to deduce the same aliasing information
as is
>>> handed to them for free in the good version.  To work this out it
would
>>> be
>>> nice to have a smaller testcase.
>>>
>>
>> I would also be interested in a minimal test case. If e.g. only the
alias
>> check is missing, we could introduce run-time alias checks such that
Polly
>> would be able to optimize both versions. It is probably not as simple,
but
>> a reduced test case would make it easier to figure out the exact
problems.
>>
>> Thanks
>> Tobi
>>
>>
>> ______________________________**_________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>
http://lists.cs.uiuc.edu/**mailman/listinfo/llvmdev<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130104/5659f506/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: array.loop.9.ll
Type: application/octet-stream
Size: 1913 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130104/5659f506/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: module_array.loop.9.ll
Type: application/octet-stream
Size: 2146 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130104/5659f506/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: module_array.F90
Type: application/octet-stream
Size: 2685 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130104/5659f506/attachment-0002.obj>

Duncan Sands

2013-Jan-04 14:49 UTC

head link

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

Hi Dmitry,

 > Here's another case, different in high-level, but similar in
low-level.

I think it's completely different at low-level too!

  When> Fortran allocatable array is defined in module, its actual dimensions are
kept
> in internal structure. Loads originated from reading these dimensions
confuse
> Polly on any use of this array.
>
> Attachments:
> 1) Sample Fortran source code (to be compiled with and without -DMODULE to
see
> failing and working version, respectively).
> 2) LLVM IR for both cases right before Polly analysis (initialization loop
> "array = 2")
>
> Below is diff for quick look:
The problem here seems to be:

- at the start of @main the externally visible global variable @__mod_MOD_array
is initialized with explicit values;
- later, functions like _gfortran_st_write are called, which as far as the
optimizers know may be modifying the contents of @__mod_MOD_array;
- thus after these calls the array contents are reloaded.

Remarks:

(1) Does @__mod_MOD_array have to be externally visible?  If you give it
internal linkage and reoptimize then all of the loads you don't like go
away.
It is the Fortran front-end that gives it external linkage, but maybe that's
a bug in the Fortran front-end?

(2) If it does have to be externally visible, maybe LLVM's alias analysis
can be
taught stuff about _gfortran_st_write and friends, which I think only read/write
memory via their arguments (so won't be sneakily accessing
@__mod_MOD_array).

Ciao, Duncan.

Duncan Sands

2013-Jan-04 14:58 UTC

head link

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

PS: Another possibility is to do link-time optimization, since at that point the
optimizers are capable of finding out if that global is used anywhere else or
not.

Dmitry Mikushin

2013-Jan-04 16:23 UTC

head link

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

Hi Duncan,

Thanks for looking into this!
> (1) Does @__mod_MOD_array have to be externally visible?
Fortran modules by definition share their entire contents with user sites
(functions, other modules), unless some module member is explicitly marked
private. So, all data coming from modules should be essentially global by
default.
> (2) If it does have to be externally visible, maybe LLVM's alias
analysiscan be
taught stuff about _gfortran_st_write and friends
> Another possibility is to do link-time optimization
Right, but this is a limited solution, so is LTO. The parallizing compiler
with are developing must be able to handle the general case, otherwise
people would never be happy with it.
> I think it's completely different at low-level too!
I meant these cases are common with respect to unexpected loads. Trying to
eliminate them at compile-time using function-specific knowledge may suite
only well-known APIs, e.g. C99 math, libgfortran, etc. But for general case
this approach leads nowhere. I agree here with Tobi, that runtime analysis
injection may help very well, if inserted *after* memory analysis from ISL
is available, but *before* Polly starts to drop away invalid SCoPs. This is
currently impossible, because ISL only works on valid scopes supplied by
Polly. I'm looking for design routes to let ISL work on regions, that are
not yet valid SCoPs, but may become valid, if according to memory analysis
we are eligible to replace problematic loads with constants.

- D.

2013/1/4 Duncan Sands <baldrick at free.fr>
> PS: Another possibility is to do link-time optimization, since at that
> point the
> optimizers are capable of finding out if that global is used anywhere else
> or
> not.
>

2013/1/4 Duncan Sands <baldrick at free.fr>
> Hi Dmitry,
>
>
> > Here's another case, different in high-level, but similar in
low-level.
>
> I think it's completely different at low-level too!
>
>
>  When
>
>> Fortran allocatable array is defined in module, its actual dimensions
are
>> kept
>> in internal structure. Loads originated from reading these dimensions
>> confuse
>> Polly on any use of this array.
>>
>> Attachments:
>> 1) Sample Fortran source code (to be compiled with and without -DMODULE
>> to see
>> failing and working version, respectively).
>> 2) LLVM IR for both cases right before Polly analysis (initialization
loop
>> "array = 2")
>>
>> Below is diff for quick look:
>>
>
> The problem here seems to be:
>
> - at the start of @main the externally visible global variable
> @__mod_MOD_array
> is initialized with explicit values;
> - later, functions like _gfortran_st_write are called, which as far as the
> optimizers know may be modifying the contents of @__mod_MOD_array;
> - thus after these calls the array contents are reloaded.
>
> Remarks:
>
> (1) Does @__mod_MOD_array have to be externally visible?  If you give it
> internal linkage and reoptimize then all of the loads you don't like go
> away.
> It is the Fortran front-end that gives it external linkage, but maybe
> that's
> a bug in the Fortran front-end?
>
> (2) If it does have to be externally visible, maybe LLVM's alias
analysis
> can be
> taught stuff about _gfortran_st_write and friends, which I think only
> read/write
> memory via their arguments (so won't be sneakily accessing
> @__mod_MOD_array).
>
> Ciao, Duncan.
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20130104/ab63de3e/attachment.html>

Duncan Sands

2013-Jan-08 16:11 UTC

head link

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

On 04/01/13 15:58, Duncan Sands wrote:> PS: Another possibility is to do link-time optimization, since at that
point the
> optimizers are capable of finding out if that global is used anywhere else
or
> not.
By the way, do the GCC optimizers get this, i.e. remove the loads?

Ciao, Duncan.

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Jan 2013 - [LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

[LLVMdev] [Polly] Aliasing problems escalation (WAS: Re: [DragonEgg] [Polly] Should we expect DragonEgg to produce identical LLVM IR for identical GIMPLE?)

Apparently Analagous Threads