thr3ads.net - llvm dev - [LLVMdev] ScheduleDAGInstrs.cpp [Dec 2014]

If this information is useful, please help other people find it:
Share via:

Jonas Paulsson

2014-Dec-16 09:50 UTC

[LLVMdev] ScheduleDAGInstrs.cpp

Hi,

Thank you for the reply.
>It looks to me like we can choose any subset of edges here and be correct.
We're basically trying to prune/pinch the DAG edges here. They can easily
blow up with AA sched. I would guess that isCtrl() edges are good ones to bypass
because they could be a low-latecy edges, whereas true data dependencies from a
load are expected to be >higher latency, so they are good candidates to stop
the search.
I cannot say I understand your answer, unfortunately, and to clarify all this
eventually, I will explain my point of view:

I think there are cases where adjustSchedDeps() MUST make a call to
MIsNeedChainEdge() between to specific SUs, or the DAG will remain illegal. Here
is my little example again, where SU(0) and SU(2) have a store - load memory
dependence which must be modelled with an edge.

SUnits:
SU(0) Store to address A
SU(1) Store to address B
SU(2) Load from address A

DAG building, bottom-up:
SU(2) gets pushed onto PendingLoads.
SU(1) becomes AliasChain. A MayAlias edge is added between SU(2) and SU(1).
SU(0) becomes AliasChain. TII->areMemAccessesTriviallyDisjoint() says that 
SU(0) and SU(1) are disjoint, and gets no edge, and SU(1) is inserted into
RejectMemNodes (SU(1) and SU(2) were not trivially disjoint). At this moment
there is a missing dep between SU(0) and SU(2), and therefore RejectMemNodes
MUST be checked so that SU(2) is found and added as a successor to SU(0). In
this case,all deps which are isNormalMemory() or isBarrier() MUST be followed.

So following "any subset of edges" would not work, as there is only an
Order edge between SU(1) and SU(2).

Regarding data, anti and output edges, I would say they should *not* be
followed, since then MIsNeedChainEdge() will be called also on non-memory
instructions. This is not good, since such an instruction does not have a memory
operand, and MIsNeedChainEdge() will return true conservatively.

My humble guess is then that only memory deps are relevant during DAG building
when traversing successors in iterateChainSucc(), because there is only interest
in memory deps and those edges have been added already going bottom-up. They are
absolutely needed, at least because TII->areMemAccessesTriviallyDisjoint()
does not guarantee to be consistent, as it was not in the example above.

Sergei?

Best regards,

Jonas Paulsson


-----Original Message-----
From: Andrew Trick [mailto:atrick at apple.com] 
Sent: den 16 december 2014 08:15
To: Jonas Paulsson
Cc: Hal Finkel; Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom
Stellard; Sergei Larin
Subject: Re: ScheduleDAGInstrs.cpp

> On Dec 14, 2014, at 3:17 AM, Jonas Paulsson <jonas.paulsson at
ericsson.com> wrote:
> 
> Hello again,
> 
> Sorry -- I think I found the problem somewhere else. I was a bit 
> confused and missed the fact that adjustChainDeps() is called a few 
> lines down and does just what I wanted :-)
> 
> I would like to instead ask another question:
> 
> Why is I->isCtrl() used in code like
> 
> // Iterate over chain dependencies only.
> for (SUnit::const_succ_iterator I = SUb->Succs.begin(), E =
SUb->Succs.end();
>     I != E; ++I)
>  if (I->isCtrl())
>    iterateChainSucc (AA, MFI, SUa, I->getSUnit(), ExitSU, Depth, 
> Visited);
> 
> ?
> 
> I thought only chain edges are relevant, and would instead use if 
> (J->getKind() == SDep::Order)
> 
> I got strange edges, from memory accesses to normal operation instructions
that do not touch memory, because also anti and output edges are followed.
You may be right. I’ve never debugged this code, because I haven’t enabled AA
scheduler. CCing the author, Sergei.

It looks to me like we can choose any subset of edges here and be correct.
We're basically trying to prune/pinch the DAG edges here. They can easily
blow up with AA sched. I would guess that isCtrl() edges are good ones to bypass
because they could be a low-latecy edges, whereas true data dependencies from a
load are expected to be higher latency, so they are good candidates to stop the
search.

If Sergei are someone can confirm this is the case, we should add a comment. If
you can determine experimentally that some other condition results in smaller
DAGs without sacrificing performance, that's fine with me.

-Andy
> Best regards,
> 
> Jonas Paulsson
> 
> 
> 
> 
> -----Original Message-----
> From: Jonas Paulsson
> Sent: den 13 december 2014 14:47
> To: 'Hal Finkel'
> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
> Stellard; Andrew Trick
> Subject: RE: ScheduleDAGInstrs.cpp
> 
> Hi,
> 
> Thank you for your reply. I have tried Tom Stellards patch, but it did not
resolve my particular problem.
> 
> I have tried to expose this problem on public target without success. I am
still however fearing that the scheduler is incorrect.
> 
> I have a (bit unoptimized) case where the LLVM I/R code looks 
> something like
> 
> %_tmp255 = load i64** %cse6.103
> store i64 %_tmp250, i64* %_tmp255
> %_tmp256 = load i64** %cse6.103
> %_tmp257 = load i64* %_tmp256
> 
> which means the same address is loaded twice, first to store a value, and
then to load it back.
> 
> The MachineInstructions to load an i64 are first pseudos and are then 
> split post regalloc and become
> 
> SU(0): addrReg0 = load *fp(-50)   // Two address registers with same value
> SU(1): addrReg1 = load *fp(-50)
> SU(2): store dataReg0:lo, *addrReg0(2) // Store a register in two 
> parts
> SU(3): store dataReg0:hi, *addrReg0
> SU(4): dataReg1:lo = load *addrReg1(2) // Load a register in two parts
> SU(5): dataReg1:hi = load *addrReg1
> 
> Since the addresses are loaded from memory, the underlying Objs become
empty for all MIs.
> 
> SU(4) and SU(5) get pushed to PendingLoads.
> SU(3) becomes AliasChain, and edges are added to SU(4) and SU(5).
PendingLoads is cleared.
> SU(2) becomes AliasChain. This time
TII->areMemAccessesTriviallyDisjoint() return true, since it can see that
addrReg0 with and without offset are disjoint. This was not the case between
SU(3) and SU(4), since they were different address registers (with the same
value). The result is that there is no edge added between SU(2) and SU(3) and
not to SU(4) since PendlingLoads was cleared earlier. SU(2) and SU(4) are
however a store and a load against the same address.
> 
> The problem is that TII->areMemAccessesTriviallyDisjoint() is incomplete
and therefore inconsistent depending on if the two memory accesses are trivial
to analyze or not.
> 
> Since TII->areMemAccessesTriviallyDisjoint() are called when resetting
AliasChain, I think it is necessary to call iterateChainSucs() on the old
AliasChain, to make sure there are no SUs succeeding the old AliasChain (and are
thus not in RejectMemNodes) that have a dependency on the new AliasChain.
> 
> Or am I missing something? Very interested in your opinion.
> 
> I am thinking of something like
> 
>    new_alias_chain:
>      // Chain all possibly aliasing memory references through SU.
>      if (AliasChain) {
>        unsigned ChainLatency = 0;
>        if (AliasChain->getInstr()->mayLoad())           ChainLatency
= TrueMemOrderLatency;
>         addChainDependency(AAForDep, MFI, SU, AliasChain, RejectMemNodes,
>                            ChainLatency);
> +
> +        // Since TII->areMemAccessesTriviallyDisjoint() is not
> +        // complete, iterate all chain successors of old
> +        // AliasChain. Otherwise, if an address is stored in two
> +        // registers, and only one of the registers is trivially
> +        // disjoint to old AliasChain, a dependency might be missed.
> +        // SU(0: Address A) -> old AliasChain (no edge) SU(2: Address
A)
> +        // SU(0) and SU(2) get a missed dependency.
> +        SmallPtrSet<const SUnit*, 16> Visited;
> +        unsigned Depth = 0;
> +        for (SUnit::const_succ_iterator J = AliasChain->Succs.begin(),
> +               JE = AliasChain->Succs.end(); J != JE; ++J)
> +          if (J->isCtrl())
> +            iterateChainSucc (AA, MFI, SU, J->getSUnit(),
> +                              &ExitSU, &Depth, Visited);
>       }
>       AliasChain = SU;
> 
> 
> Best regards,
> 
> Jonas Paulsson
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: den 9 december 2014 19:43
> To: Jonas Paulsson
> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
> Stellard; Andrew Trick
> Subject: Re: ScheduleDAGInstrs.cpp
> 
> Hi Jonas,
> 
> Tom Stellard (cc'd) just recently fixed a bug whereby dependencies on
the barrier chain would not be added when the SU had no underlying objects
(r223717). Does this fix the problem you're highlighting?
> 
> I've also cc'd Sanjin who developed the
areMemAccessesTriviallyDisjoint() callback (which I'll note is a
relatively-recent addition -- added on Sept. 8th of this year).
> 
> In short, if r223717 does not resolve the problem, it sounds like you
likely found a bug. What target are you using? I'm trying to get some idea
of what your areMemAccessesTriviallyDisjoint() callback is doing.
> 
> -Hal
> 
> ----- Original Message -----
>> From: "Jonas Paulsson" <jonas.paulsson at ericsson.com>
>> To: hfinkel at anl.gov, llvmdev at cs.uiuc.edu
>> Cc: "Mattias Eriksson V" <mattias.v.eriksson at
ericsson.com>
>> Sent: Monday, December 8, 2014 8:06:27 AM
>> Subject: ScheduleDAGInstrs.cpp
>> 
>> Hi,
>> 
>> Can anyone help me to understand the
>> ScheduleDAGInstrs::buildSchedGraph() method?
>> 
>> 
>> 
>> I find the handling of AliasChain is disturbing since:
>> 
>> 
>> 
>> 1. A new alias chain add deps to all possibly aliasing SUs, and then 
>> clears those lists.
>> 
>> 2. When AliasChain is present, the addChainDependency() method is 
>> called,
>> 
>> but the target hook areMemAccessesTriviallyDisjoint() called inside
>> 
>> MIsNeedChainEdge() allows this edge to be skipped.
>> 
>> 
>> 
>> This means that I get a case where
>> 
>> 
>> 
>> SU0
>> 
>> 
>> 
>> SU1, AliasChain
>> 
>> /\
>> 
>> / \ // Aliasing memory accesses
>> 
>> SU2
>> 
>> 
>> 
>> all SUs have memory operands, but the underlying objects vectors are 
>> empty.
>> 
>> SU1 became AliasChain, added dep to SU2, and cleared the lists for 
>> aliasing memory accesses.
>> 
>> SU0 has a dependency towards SU2, but towards SU1 it is trivially 
>> disjoint, and therefore it gets
>> 
>> no dependency to SU1 and * neither to SU2 *. The AliasChain concept 
>> is bypassed.
>> 
>> 
>> 
>> I don’t understand how it can first be assumed that an SU becomes 
>> AliasChain, and then an SU
>> 
>> with lower NodeNum that may alias is allowed to skip its dep to the 
>> AliasChain?
>> 
>> The BarrierChain is never skipped because addPred() is called 
>> directly, and I don’t see how it is
>> 
>> safe to skip the AliasChain for aliasing SUs, I think it should 
>> always be added? In other words,
>> 
>> it may be that SU0 has a dep towards SU2 in the example, but not 
>> towards SU1, so therefore it
>> 
>> is not safe to skip this dep.
>> 
>> 
>> 
>> Calling addPred() instead of addChainDependency() for AliasChain 
>> should fix this, however it
>> 
>> would lead to overly constrained DAGs compared to putting the edges 
>> where they actually should
>> 
>> be (SU0<-SU2 instead of SU0<-SU1).
>> 
>> 
>> 
>> Most thankful for any help,
>> 
>> 
>> 
>> Jonas Paulsson
>> 
>> 
>> 
>> 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory

Andrew Trick

2014-Dec-16 16:45 UTC

head link

[LLVMdev] ScheduleDAGInstrs.cpp

> On Dec 16, 2014, at 1:50 AM, Jonas Paulsson <jonas.paulsson at
ericsson.com> wrote:
> 
> Hi,
> 
> Thank you for the reply.
> 
>> It looks to me like we can choose any subset of edges here and be
correct. We're basically trying to prune/pinch the DAG edges here. They can
easily blow up with AA sched. I would guess that isCtrl() edges are good ones to
bypass because they could be a low-latecy edges, whereas true data dependencies
from a load are expected to be >higher latency, so they are good candidates
to stop the search.
I can see this works differently than I thought it did last night. I thought the
algorithm would skip a node, then propagate the potential dependence downward to
successors. I thought that if it didn’t propagate dependence through a node,
then it would conservatively add an edge to that node. But now I can see that’s
not happening.

So I’m going to assume the following: all visited memory instructions must be
either in RejectNodes, PendingLoads, or transitively dependent on the
AliasChain. i.e. Adding a non-chain dependence edge to a store does not allow us
to remove the store from RejectNodes. IOW you can have RejectNodes depending on
other RejectNodes. That changes my answer!
> I cannot say I understand your answer, unfortunately, and to clarify all
this eventually, I will explain my point of view:
> 
> I think there are cases where adjustSchedDeps() MUST make a call to
MIsNeedChainEdge() between to specific SUs, or the DAG will remain illegal. Here
is my little example again, where SU(0) and SU(2) have a store - load memory
dependence which must be modelled with an edge.
> 
> SUnits:
> SU(0) Store to address A
> SU(1) Store to address B
> SU(2) Load from address A
> 
> DAG building, bottom-up:
> SU(2) gets pushed onto PendingLoads.
> SU(1) becomes AliasChain. A MayAlias edge is added between SU(2) and SU(1).
> SU(0) becomes AliasChain. TII->areMemAccessesTriviallyDisjoint() says
that  SU(0) and SU(1) are disjoint, and gets no edge, and SU(1) is inserted into
RejectMemNodes (SU(1) and SU(2) were not trivially disjoint). At this moment
there is a missing dep between SU(0) and SU(2), and therefore RejectMemNodes
MUST be checked so that SU(2) is found and added as a successor to SU(0). In
this case,all deps which are isNormalMemory() or isBarrier() MUST be followed.
> 
> So following "any subset of edges" would not work, as there is
only an Order edge between SU(1) and SU(2).
Yes. If you’re not going to conservatively add an edge to nodes that aren’t
followed, then you must follow any edge that may have resulted in the successor
being removed from AliasChain, RejectNodes, or PendingLoads.
> Regarding data, anti and output edges, I would say they should *not* be
followed, since then MIsNeedChainEdge() will be called also on non-memory
instructions. This is not good, since such an instruction does not have a memory
operand, and MIsNeedChainEdge() will return true conservatively.
Yes, I can see now that we don’t “peek through” non-memory instructions, so it
doesn’t make sense to follow non-chain edges.
> My humble guess is then that only memory deps are relevant during DAG
building when traversing successors in iterateChainSucc(), because there is only
interest in memory deps and those edges have been added already going bottom-up.
They are absolutely needed, at least because
TII->areMemAccessesTriviallyDisjoint() does not guarantee to be consistent,
as it was not in the example above.
I think you have a pretty good understanding of the algorithm. I could be
missing something, but your proposal makes sense. If you write a patch, please
add comments explaining your reasoning. And of course make sure the test suite
passes with -enable-aa-sched-mi.

-Andy
> Sergei?
> 
> Best regards,
> 
> Jonas Paulsson
> 
> 
> -----Original Message-----
> From: Andrew Trick [mailto:atrick at apple.com] 
> Sent: den 16 december 2014 08:15
> To: Jonas Paulsson
> Cc: Hal Finkel; Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric;
Tom Stellard; Sergei Larin
> Subject: Re: ScheduleDAGInstrs.cpp
> 
> 
>> On Dec 14, 2014, at 3:17 AM, Jonas Paulsson <jonas.paulsson at
ericsson.com> wrote:
>> 
>> Hello again,
>> 
>> Sorry -- I think I found the problem somewhere else. I was a bit 
>> confused and missed the fact that adjustChainDeps() is called a few 
>> lines down and does just what I wanted :-)
>> 
>> I would like to instead ask another question:
>> 
>> Why is I->isCtrl() used in code like
>> 
>> // Iterate over chain dependencies only.
>> for (SUnit::const_succ_iterator I = SUb->Succs.begin(), E =
SUb->Succs.end();
>>    I != E; ++I)
>> if (I->isCtrl())
>>   iterateChainSucc (AA, MFI, SUa, I->getSUnit(), ExitSU, Depth, 
>> Visited);
>> 
>> ?
>> 
>> I thought only chain edges are relevant, and would instead use if 
>> (J->getKind() == SDep::Order)
>> 
>> I got strange edges, from memory accesses to normal operation
instructions that do not touch memory, because also anti and output edges are
followed.
> 
> You may be right. I’ve never debugged this code, because I haven’t enabled
AA scheduler. CCing the author, Sergei.
> 
> It looks to me like we can choose any subset of edges here and be correct.
We're basically trying to prune/pinch the DAG edges here. They can easily
blow up with AA sched. I would guess that isCtrl() edges are good ones to bypass
because they could be a low-latecy edges, whereas true data dependencies from a
load are expected to be higher latency, so they are good candidates to stop the
search.
> 
> If Sergei are someone can confirm this is the case, we should add a
comment. If you can determine experimentally that some other condition results
in smaller DAGs without sacrificing performance, that's fine with me.
> 
> -Andy
> 
>> Best regards,
>> 
>> Jonas Paulsson
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Jonas Paulsson
>> Sent: den 13 december 2014 14:47
>> To: 'Hal Finkel'
>> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
>> Stellard; Andrew Trick
>> Subject: RE: ScheduleDAGInstrs.cpp
>> 
>> Hi,
>> 
>> Thank you for your reply. I have tried Tom Stellards patch, but it did
not resolve my particular problem.
>> 
>> I have tried to expose this problem on public target without success. I
am still however fearing that the scheduler is incorrect.
>> 
>> I have a (bit unoptimized) case where the LLVM I/R code looks 
>> something like
>> 
>> %_tmp255 = load i64** %cse6.103
>> store i64 %_tmp250, i64* %_tmp255
>> %_tmp256 = load i64** %cse6.103
>> %_tmp257 = load i64* %_tmp256
>> 
>> which means the same address is loaded twice, first to store a value,
and then to load it back.
>> 
>> The MachineInstructions to load an i64 are first pseudos and are then 
>> split post regalloc and become
>> 
>> SU(0): addrReg0 = load *fp(-50)   // Two address registers with same
value
>> SU(1): addrReg1 = load *fp(-50)
>> SU(2): store dataReg0:lo, *addrReg0(2) // Store a register in two 
>> parts
>> SU(3): store dataReg0:hi, *addrReg0
>> SU(4): dataReg1:lo = load *addrReg1(2) // Load a register in two parts
>> SU(5): dataReg1:hi = load *addrReg1
>> 
>> Since the addresses are loaded from memory, the underlying Objs become
empty for all MIs.
>> 
>> SU(4) and SU(5) get pushed to PendingLoads.
>> SU(3) becomes AliasChain, and edges are added to SU(4) and SU(5).
PendingLoads is cleared.
>> SU(2) becomes AliasChain. This time
TII->areMemAccessesTriviallyDisjoint() return true, since it can see that
addrReg0 with and without offset are disjoint. This was not the case between
SU(3) and SU(4), since they were different address registers (with the same
value). The result is that there is no edge added between SU(2) and SU(3) and
not to SU(4) since PendlingLoads was cleared earlier. SU(2) and SU(4) are
however a store and a load against the same address.
>> 
>> The problem is that TII->areMemAccessesTriviallyDisjoint() is
incomplete and therefore inconsistent depending on if the two memory accesses
are trivial to analyze or not.
>> 
>> Since TII->areMemAccessesTriviallyDisjoint() are called when
resetting AliasChain, I think it is necessary to call iterateChainSucs() on the
old AliasChain, to make sure there are no SUs succeeding the old AliasChain (and
are thus not in RejectMemNodes) that have a dependency on the new AliasChain.
>> 
>> Or am I missing something? Very interested in your opinion.
>> 
>> I am thinking of something like
>> 
>>   new_alias_chain:
>>     // Chain all possibly aliasing memory references through SU.
>>     if (AliasChain) {
>>       unsigned ChainLatency = 0;
>>       if (AliasChain->getInstr()->mayLoad())          
ChainLatency = TrueMemOrderLatency;
>>        addChainDependency(AAForDep, MFI, SU, AliasChain,
RejectMemNodes,
>>                           ChainLatency);
>> +
>> +        // Since TII->areMemAccessesTriviallyDisjoint() is not
>> +        // complete, iterate all chain successors of old
>> +        // AliasChain. Otherwise, if an address is stored in two
>> +        // registers, and only one of the registers is trivially
>> +        // disjoint to old AliasChain, a dependency might be missed.
>> +        // SU(0: Address A) -> old AliasChain (no edge) SU(2:
Address A)
>> +        // SU(0) and SU(2) get a missed dependency.
>> +        SmallPtrSet<const SUnit*, 16> Visited;
>> +        unsigned Depth = 0;
>> +        for (SUnit::const_succ_iterator J =
AliasChain->Succs.begin(),
>> +               JE = AliasChain->Succs.end(); J != JE; ++J)
>> +          if (J->isCtrl())
>> +            iterateChainSucc (AA, MFI, SU, J->getSUnit(),
>> +                              &ExitSU, &Depth, Visited);
>>      }
>>      AliasChain = SU;
>> 
>> 
>> Best regards,
>> 
>> Jonas Paulsson
>> 
>> -----Original Message-----
>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>> Sent: den 9 december 2014 19:43
>> To: Jonas Paulsson
>> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
>> Stellard; Andrew Trick
>> Subject: Re: ScheduleDAGInstrs.cpp
>> 
>> Hi Jonas,
>> 
>> Tom Stellard (cc'd) just recently fixed a bug whereby dependencies
on the barrier chain would not be added when the SU had no underlying objects
(r223717). Does this fix the problem you're highlighting?
>> 
>> I've also cc'd Sanjin who developed the
areMemAccessesTriviallyDisjoint() callback (which I'll note is a
relatively-recent addition -- added on Sept. 8th of this year).
>> 
>> In short, if r223717 does not resolve the problem, it sounds like you
likely found a bug. What target are you using? I'm trying to get some idea
of what your areMemAccessesTriviallyDisjoint() callback is doing.
>> 
>> -Hal
>> 
>> ----- Original Message -----
>>> From: "Jonas Paulsson" <jonas.paulsson at
ericsson.com>
>>> To: hfinkel at anl.gov, llvmdev at cs.uiuc.edu
>>> Cc: "Mattias Eriksson V" <mattias.v.eriksson at
ericsson.com>
>>> Sent: Monday, December 8, 2014 8:06:27 AM
>>> Subject: ScheduleDAGInstrs.cpp
>>> 
>>> Hi,
>>> 
>>> Can anyone help me to understand the
>>> ScheduleDAGInstrs::buildSchedGraph() method?
>>> 
>>> 
>>> 
>>> I find the handling of AliasChain is disturbing since:
>>> 
>>> 
>>> 
>>> 1. A new alias chain add deps to all possibly aliasing SUs, and
then
>>> clears those lists.
>>> 
>>> 2. When AliasChain is present, the addChainDependency() method is 
>>> called,
>>> 
>>> but the target hook areMemAccessesTriviallyDisjoint() called inside
>>> 
>>> MIsNeedChainEdge() allows this edge to be skipped.
>>> 
>>> 
>>> 
>>> This means that I get a case where
>>> 
>>> 
>>> 
>>> SU0
>>> 
>>> 
>>> 
>>> SU1, AliasChain
>>> 
>>> /\
>>> 
>>> / \ // Aliasing memory accesses
>>> 
>>> SU2
>>> 
>>> 
>>> 
>>> all SUs have memory operands, but the underlying objects vectors
are
>>> empty.
>>> 
>>> SU1 became AliasChain, added dep to SU2, and cleared the lists for 
>>> aliasing memory accesses.
>>> 
>>> SU0 has a dependency towards SU2, but towards SU1 it is trivially 
>>> disjoint, and therefore it gets
>>> 
>>> no dependency to SU1 and * neither to SU2 *. The AliasChain concept
>>> is bypassed.
>>> 
>>> 
>>> 
>>> I don’t understand how it can first be assumed that an SU becomes 
>>> AliasChain, and then an SU
>>> 
>>> with lower NodeNum that may alias is allowed to skip its dep to the
>>> AliasChain?
>>> 
>>> The BarrierChain is never skipped because addPred() is called 
>>> directly, and I don’t see how it is
>>> 
>>> safe to skip the AliasChain for aliasing SUs, I think it should 
>>> always be added? In other words,
>>> 
>>> it may be that SU0 has a dep towards SU2 in the example, but not 
>>> towards SU1, so therefore it
>>> 
>>> is not safe to skip this dep.
>>> 
>>> 
>>> 
>>> Calling addPred() instead of addChainDependency() for AliasChain 
>>> should fix this, however it
>>> 
>>> would lead to overly constrained DAGs compared to putting the edges
>>> where they actually should
>>> 
>>> be (SU0<-SU2 instead of SU0<-SU1).
>>> 
>>> 
>>> 
>>> Most thankful for any help,
>>> 
>>> 
>>> 
>>> Jonas Paulsson
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>

Sanjin Sijaric

2014-Dec-18 11:28 UTC

head link

[LLVMdev] ScheduleDAGInstrs.cpp

Hi,

The callback areMemAccessesTriviallyDisjoint should catch trivial cases that
would otherwise be discarded by the call to isUnsafeMemoryObject from
MIsNeedChainEdge.  So, we would never hit the AA checks later on in
MIsNeedChainEdge.  Could we pass the AA information to isUnsafeMemoryObject, and
only call isIdentifiedObject from isUnsafeMemoryObject if AA isn't being
used?  However, if AA is being used, we'll go into aliasing checks inside
MIsNeedChainEdge and let the dependency be handled there.  Then there may not be
a need for the areMemAccessesTriviallyDisjoint callback when when AA is on.  I
think Andy inquired a while ago about why the aliasing checks didn't catch
these cases.  This may still result in the problem that Jonas is seeing, as the
same edges may now get discarded due to aliasing checks.

Jonas,

Are you working on the AArch64 target?  Sorry if I missed that information in
the email chain.

Thanks,
Sanjin

-----Original Message-----
From: Andrew Trick [mailto:atrick at apple.com] 
Sent: Tuesday, December 16, 2014 8:46 AM
To: Jonas Paulsson
Cc: Hal Finkel; Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom
Stellard; Sergei Larin
Subject: Re: ScheduleDAGInstrs.cpp

> On Dec 16, 2014, at 1:50 AM, Jonas Paulsson <jonas.paulsson at
ericsson.com> wrote:
> 
> Hi,
> 
> Thank you for the reply.
> 
>> It looks to me like we can choose any subset of edges here and be
correct. We're basically trying to prune/pinch the DAG edges here. They can
easily blow up with AA sched. I would guess that isCtrl() edges are good ones to
bypass because they could be a low-latecy edges, whereas true data dependencies
from a load are expected to be >higher latency, so they are good candidates
to stop the search.
I can see this works differently than I thought it did last night. I thought the
algorithm would skip a node, then propagate the potential dependence downward to
successors. I thought that if it didn’t propagate dependence through a node,
then it would conservatively add an edge to that node. But now I can see that’s
not happening.

So I’m going to assume the following: all visited memory instructions must be
either in RejectNodes, PendingLoads, or transitively dependent on the
AliasChain. i.e. Adding a non-chain dependence edge to a store does not allow us
to remove the store from RejectNodes. IOW you can have RejectNodes depending on
other RejectNodes. That changes my answer!
> I cannot say I understand your answer, unfortunately, and to clarify all
this eventually, I will explain my point of view:
> 
> I think there are cases where adjustSchedDeps() MUST make a call to
MIsNeedChainEdge() between to specific SUs, or the DAG will remain illegal. Here
is my little example again, where SU(0) and SU(2) have a store - load memory
dependence which must be modelled with an edge.
> 
> SUnits:
> SU(0) Store to address A
> SU(1) Store to address B
> SU(2) Load from address A
> 
> DAG building, bottom-up:
> SU(2) gets pushed onto PendingLoads.
> SU(1) becomes AliasChain. A MayAlias edge is added between SU(2) and SU(1).
> SU(0) becomes AliasChain. TII->areMemAccessesTriviallyDisjoint() says
that  SU(0) and SU(1) are disjoint, and gets no edge, and SU(1) is inserted into
RejectMemNodes (SU(1) and SU(2) were not trivially disjoint). At this moment
there is a missing dep between SU(0) and SU(2), and therefore RejectMemNodes
MUST be checked so that SU(2) is found and added as a successor to SU(0). In
this case,all deps which are isNormalMemory() or isBarrier() MUST be followed.
> 
> So following "any subset of edges" would not work, as there is
only an Order edge between SU(1) and SU(2).
Yes. If you’re not going to conservatively add an edge to nodes that aren’t
followed, then you must follow any edge that may have resulted in the successor
being removed from AliasChain, RejectNodes, or PendingLoads.
> Regarding data, anti and output edges, I would say they should *not* be
followed, since then MIsNeedChainEdge() will be called also on non-memory
instructions. This is not good, since such an instruction does not have a memory
operand, and MIsNeedChainEdge() will return true conservatively.
Yes, I can see now that we don’t “peek through” non-memory instructions, so it
doesn’t make sense to follow non-chain edges.
> My humble guess is then that only memory deps are relevant during DAG
building when traversing successors in iterateChainSucc(), because there is only
interest in memory deps and those edges have been added already going bottom-up.
They are absolutely needed, at least because
TII->areMemAccessesTriviallyDisjoint() does not guarantee to be consistent,
as it was not in the example above.
I think you have a pretty good understanding of the algorithm. I could be
missing something, but your proposal makes sense. If you write a patch, please
add comments explaining your reasoning. And of course make sure the test suite
passes with -enable-aa-sched-mi.

-Andy
> Sergei?
> 
> Best regards,
> 
> Jonas Paulsson
> 
> 
> -----Original Message-----
> From: Andrew Trick [mailto:atrick at apple.com]
> Sent: den 16 december 2014 08:15
> To: Jonas Paulsson
> Cc: Hal Finkel; Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin 
> Sijaric; Tom Stellard; Sergei Larin
> Subject: Re: ScheduleDAGInstrs.cpp
> 
> 
>> On Dec 14, 2014, at 3:17 AM, Jonas Paulsson <jonas.paulsson at
ericsson.com> wrote:
>> 
>> Hello again,
>> 
>> Sorry -- I think I found the problem somewhere else. I was a bit 
>> confused and missed the fact that adjustChainDeps() is called a few 
>> lines down and does just what I wanted :-)
>> 
>> I would like to instead ask another question:
>> 
>> Why is I->isCtrl() used in code like
>> 
>> // Iterate over chain dependencies only.
>> for (SUnit::const_succ_iterator I = SUb->Succs.begin(), E =
SUb->Succs.end();
>>    I != E; ++I)
>> if (I->isCtrl())
>>   iterateChainSucc (AA, MFI, SUa, I->getSUnit(), ExitSU, Depth, 
>> Visited);
>> 
>> ?
>> 
>> I thought only chain edges are relevant, and would instead use if
>> (J->getKind() == SDep::Order)
>> 
>> I got strange edges, from memory accesses to normal operation
instructions that do not touch memory, because also anti and output edges are
followed.
> 
> You may be right. I’ve never debugged this code, because I haven’t enabled
AA scheduler. CCing the author, Sergei.
> 
> It looks to me like we can choose any subset of edges here and be correct.
We're basically trying to prune/pinch the DAG edges here. They can easily
blow up with AA sched. I would guess that isCtrl() edges are good ones to bypass
because they could be a low-latecy edges, whereas true data dependencies from a
load are expected to be higher latency, so they are good candidates to stop the
search.
> 
> If Sergei are someone can confirm this is the case, we should add a
comment. If you can determine experimentally that some other condition results
in smaller DAGs without sacrificing performance, that's fine with me.
> 
> -Andy
> 
>> Best regards,
>> 
>> Jonas Paulsson
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Jonas Paulsson
>> Sent: den 13 december 2014 14:47
>> To: 'Hal Finkel'
>> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
>> Stellard; Andrew Trick
>> Subject: RE: ScheduleDAGInstrs.cpp
>> 
>> Hi,
>> 
>> Thank you for your reply. I have tried Tom Stellards patch, but it did
not resolve my particular problem.
>> 
>> I have tried to expose this problem on public target without success. I
am still however fearing that the scheduler is incorrect.
>> 
>> I have a (bit unoptimized) case where the LLVM I/R code looks 
>> something like
>> 
>> %_tmp255 = load i64** %cse6.103
>> store i64 %_tmp250, i64* %_tmp255
>> %_tmp256 = load i64** %cse6.103
>> %_tmp257 = load i64* %_tmp256
>> 
>> which means the same address is loaded twice, first to store a value,
and then to load it back.
>> 
>> The MachineInstructions to load an i64 are first pseudos and are then 
>> split post regalloc and become
>> 
>> SU(0): addrReg0 = load *fp(-50)   // Two address registers with same
value
>> SU(1): addrReg1 = load *fp(-50)
>> SU(2): store dataReg0:lo, *addrReg0(2) // Store a register in two 
>> parts
>> SU(3): store dataReg0:hi, *addrReg0
>> SU(4): dataReg1:lo = load *addrReg1(2) // Load a register in two 
>> parts
>> SU(5): dataReg1:hi = load *addrReg1
>> 
>> Since the addresses are loaded from memory, the underlying Objs become
empty for all MIs.
>> 
>> SU(4) and SU(5) get pushed to PendingLoads.
>> SU(3) becomes AliasChain, and edges are added to SU(4) and SU(5).
PendingLoads is cleared.
>> SU(2) becomes AliasChain. This time
TII->areMemAccessesTriviallyDisjoint() return true, since it can see that
addrReg0 with and without offset are disjoint. This was not the case between
SU(3) and SU(4), since they were different address registers (with the same
value). The result is that there is no edge added between SU(2) and SU(3) and
not to SU(4) since PendlingLoads was cleared earlier. SU(2) and SU(4) are
however a store and a load against the same address.
>> 
>> The problem is that TII->areMemAccessesTriviallyDisjoint() is
incomplete and therefore inconsistent depending on if the two memory accesses
are trivial to analyze or not.
>> 
>> Since TII->areMemAccessesTriviallyDisjoint() are called when
resetting AliasChain, I think it is necessary to call iterateChainSucs() on the
old AliasChain, to make sure there are no SUs succeeding the old AliasChain (and
are thus not in RejectMemNodes) that have a dependency on the new AliasChain.
>> 
>> Or am I missing something? Very interested in your opinion.
>> 
>> I am thinking of something like
>> 
>>   new_alias_chain:
>>     // Chain all possibly aliasing memory references through SU.
>>     if (AliasChain) {
>>       unsigned ChainLatency = 0;
>>       if (AliasChain->getInstr()->mayLoad())          
ChainLatency = TrueMemOrderLatency;
>>        addChainDependency(AAForDep, MFI, SU, AliasChain,
RejectMemNodes,
>>                           ChainLatency);
>> +
>> +        // Since TII->areMemAccessesTriviallyDisjoint() is not
>> +        // complete, iterate all chain successors of old
>> +        // AliasChain. Otherwise, if an address is stored in two
>> +        // registers, and only one of the registers is trivially
>> +        // disjoint to old AliasChain, a dependency might be missed.
>> +        // SU(0: Address A) -> old AliasChain (no edge) SU(2:
Address A)
>> +        // SU(0) and SU(2) get a missed dependency.
>> +        SmallPtrSet<const SUnit*, 16> Visited;
>> +        unsigned Depth = 0;
>> +        for (SUnit::const_succ_iterator J =
AliasChain->Succs.begin(),
>> +               JE = AliasChain->Succs.end(); J != JE; ++J)
>> +          if (J->isCtrl())
>> +            iterateChainSucc (AA, MFI, SU, J->getSUnit(),
>> +                              &ExitSU, &Depth, Visited);
>>      }
>>      AliasChain = SU;
>> 
>> 
>> Best regards,
>> 
>> Jonas Paulsson
>> 
>> -----Original Message-----
>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>> Sent: den 9 december 2014 19:43
>> To: Jonas Paulsson
>> Cc: Mattias Eriksson V; llvmdev at cs.uiuc.edu; Sanjin Sijaric; Tom 
>> Stellard; Andrew Trick
>> Subject: Re: ScheduleDAGInstrs.cpp
>> 
>> Hi Jonas,
>> 
>> Tom Stellard (cc'd) just recently fixed a bug whereby dependencies
on the barrier chain would not be added when the SU had no underlying objects
(r223717). Does this fix the problem you're highlighting?
>> 
>> I've also cc'd Sanjin who developed the
areMemAccessesTriviallyDisjoint() callback (which I'll note is a
relatively-recent addition -- added on Sept. 8th of this year).
>> 
>> In short, if r223717 does not resolve the problem, it sounds like you
likely found a bug. What target are you using? I'm trying to get some idea
of what your areMemAccessesTriviallyDisjoint() callback is doing.
>> 
>> -Hal
>> 
>> ----- Original Message -----
>>> From: "Jonas Paulsson" <jonas.paulsson at
ericsson.com>
>>> To: hfinkel at anl.gov, llvmdev at cs.uiuc.edu
>>> Cc: "Mattias Eriksson V" <mattias.v.eriksson at
ericsson.com>
>>> Sent: Monday, December 8, 2014 8:06:27 AM
>>> Subject: ScheduleDAGInstrs.cpp
>>> 
>>> Hi,
>>> 
>>> Can anyone help me to understand the
>>> ScheduleDAGInstrs::buildSchedGraph() method?
>>> 
>>> 
>>> 
>>> I find the handling of AliasChain is disturbing since:
>>> 
>>> 
>>> 
>>> 1. A new alias chain add deps to all possibly aliasing SUs, and
then
>>> clears those lists.
>>> 
>>> 2. When AliasChain is present, the addChainDependency() method is 
>>> called,
>>> 
>>> but the target hook areMemAccessesTriviallyDisjoint() called inside
>>> 
>>> MIsNeedChainEdge() allows this edge to be skipped.
>>> 
>>> 
>>> 
>>> This means that I get a case where
>>> 
>>> 
>>> 
>>> SU0
>>> 
>>> 
>>> 
>>> SU1, AliasChain
>>> 
>>> /\
>>> 
>>> / \ // Aliasing memory accesses
>>> 
>>> SU2
>>> 
>>> 
>>> 
>>> all SUs have memory operands, but the underlying objects vectors
are
>>> empty.
>>> 
>>> SU1 became AliasChain, added dep to SU2, and cleared the lists for 
>>> aliasing memory accesses.
>>> 
>>> SU0 has a dependency towards SU2, but towards SU1 it is trivially 
>>> disjoint, and therefore it gets
>>> 
>>> no dependency to SU1 and * neither to SU2 *. The AliasChain concept
>>> is bypassed.
>>> 
>>> 
>>> 
>>> I don’t understand how it can first be assumed that an SU becomes 
>>> AliasChain, and then an SU
>>> 
>>> with lower NodeNum that may alias is allowed to skip its dep to the
>>> AliasChain?
>>> 
>>> The BarrierChain is never skipped because addPred() is called 
>>> directly, and I don’t see how it is
>>> 
>>> safe to skip the AliasChain for aliasing SUs, I think it should 
>>> always be added? In other words,
>>> 
>>> it may be that SU0 has a dep towards SU2 in the example, but not 
>>> towards SU1, so therefore it
>>> 
>>> is not safe to skip this dep.
>>> 
>>> 
>>> 
>>> Calling addPred() instead of addChainDependency() for AliasChain 
>>> should fix this, however it
>>> 
>>> would lead to overly constrained DAGs compared to putting the edges
>>> where they actually should
>>> 
>>> be (SU0<-SU2 instead of SU0<-SU1).
>>> 
>>> 
>>> 
>>> Most thankful for any help,
>>> 
>>> 
>>> 
>>> Jonas Paulsson
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>

Jonas Paulsson

2014-Dec-19 14:51 UTC

head link

[LLVMdev] ScheduleDAGInstrs.cpp

Hi,

I write again regarding buildSchedGraph(), as I am still not happy about things
there.

I have found at least two examples which do not work out:

1)

SU(2) Store "Value A"

SU(1) Store "Value A"

SU(0) Load  "Value A"

If MIsNeedChainEdge() returns false for SU(0) and SU(1), SU(0) is inserted into
RejectedMemNodes and removed from its MemUses SU list, as this list is cleared.
Therefore SU(2) must be handled with adjustChainDeps(), because it needs an edge
from SU(0).
For some reason adjustChainDeps() was only called for may-aliasing stores. I
think this is wrong, as a store will clear the MemUses SU list also in the
non-aliasing case.

If MIsNeedChainEdge() returns true for SU(0) and SU(1), what happens if
MIsNeedChainEdge() returns false between SU(2) and SU(1)? The dependency against
SU(0) will not be checked, because it is not in RejectMemNodes, nor in the
MemUses SU list.

2)

SU(2) Load  "Value A"

SU(1) Store "Value A"

SU(0) Store "Value A"

If not using alias analysis, then the MemDefs list is cleared after *maybe*
having inserted an edge from SU(0) to SU(1) with addChainDependency(). If there
was not an edge inserted, then SU(2) must get a chance to check its deps against
SU(0) by use of adjustChainDeps(). Again, this is only done if MayAlias is true.

I suspect therefore that areMemAccessesTriviallyDisjoint() is currently used in
a very dangerous way, since it might get lucky in just one of two (equivalent)
queries, and a dependency might be missed later on as seen above.

Or am I missing something here?

Senjin, I am working on an out-of-tree target. The problem I had was the first
scenario in #1, and I fixed it by moving the adjustChainDeps() call to be called
also in the non-aliasing case.

Another question I have is regarding adjustChainDeps(). If MIsNeedChainEdge()
returns true, is it still then always necessary to continue recursively with
successors? If this is a latency issue, perhaps it would be ok to do
"continue" after adding the pred if the latency is non-zero? Right now
this seems to be always zero, or one in the store-load case per the comment on
line 856.

On line 947: "A store to a specific PseudoSourceValue" should probably
say "... a specific Value / PseudoSourceValue", right? The same goes
for line 1029.

/Jonas

llvm dev - Dec 2014 - [LLVMdev] ScheduleDAGInstrs.cpp

[LLVMdev] ScheduleDAGInstrs.cpp

[LLVMdev] ScheduleDAGInstrs.cpp

[LLVMdev] ScheduleDAGInstrs.cpp

[LLVMdev] ScheduleDAGInstrs.cpp