thr3ads.net - llvm dev - [LLVMdev] "Anti" scheduling with OoO cores? [Nov 2014]

If this information is useful, please help other people find it:
Share via:

James Molloy

2014-Nov-02 12:46 UTC

[LLVMdev] "Anti" scheduling with OoO cores?

Hi Andy, Dave,

I've been doing a bit of experimentation trying to understand the
schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).

FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and
FDIVSrr). This seems to be already semi-modelled, with a
"ResourceCycles=[18]" line in the SchedWriteRes for this instruction.
This
doesn't seem to work (a poor schedule is produced) so I changed it to also
require another resource that I modelled as unbuffered (BufferSize=0), in
the hope that this would "block" other FDIVs... no joy.

Then I noticed that the MicroOpBufferSize is set to 128, which is wildly
high as Cortex-A57 has separated smaller reorder buffers, not one larger
reorder buffer.

Even reducing it down to "2" made no effect, the divs were scheduled
in a
clump together. But "1" and "0" (denoting in-order) produced
a nice
schedule.

I'd expect an OoO machine with a buffer of 2 ops would produce a very
similar schedule as an in-order machine. So where am I going wrong?

Sample attached - I'd expect the FDIVs to be equally spread across the
MULs. (The extension to this I want to model is that we can have 2
S-register FDIVs in parallel but only one D-reg FDIV, and never both, but
that can wait until I've understood what's going on here!).

Cheers,

James
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141102/373c3770/attachment.html>

James Molloy

2014-Nov-02 13:02 UTC

head link

[LLVMdev] "Anti" scheduling with OoO cores?

... now with added sample!

On 2 November 2014 12:46, James Molloy <james at jamesmolloy.co.uk> wrote:
> Hi Andy, Dave,
>
> I've been doing a bit of experimentation trying to understand the
> schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).
>
> FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and
> FDIVSrr). This seems to be already semi-modelled, with a
> "ResourceCycles=[18]" line in the SchedWriteRes for this
instruction. This
> doesn't seem to work (a poor schedule is produced) so I changed it to
also
> require another resource that I modelled as unbuffered (BufferSize=0), in
> the hope that this would "block" other FDIVs... no joy.
>
> Then I noticed that the MicroOpBufferSize is set to 128, which is wildly
> high as Cortex-A57 has separated smaller reorder buffers, not one larger
> reorder buffer.
>
> Even reducing it down to "2" made no effect, the divs were
scheduled in a
> clump together. But "1" and "0" (denoting in-order)
produced a nice
> schedule.
>
> I'd expect an OoO machine with a buffer of 2 ops would produce a very
> similar schedule as an in-order machine. So where am I going wrong?
>
> Sample attached - I'd expect the FDIVs to be equally spread across the
> MULs. (The extension to this I want to model is that we can have 2
> S-register FDIVs in parallel but only one D-reg FDIV, and never both, but
> that can wait until I've understood what's going on here!).
>
> Cheers,
>
> James
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141102/a947cb90/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-scheduling.c
Type: text/x-csrc
Size: 563 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141102/a947cb90/attachment.c>

Andrew Trick

2014-Nov-04 08:34 UTC

head link

[LLVMdev] "Anti" scheduling with OoO cores?

> On Nov 2, 2014, at 4:46 AM, James Molloy <james at jamesmolloy.co.uk>
wrote:
> 
> Hi Andy, Dave,
> 
> I've been doing a bit of experimentation trying to understand the
schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).
> 
> FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and
FDIVSrr). This seems to be already semi-modelled, with a
"ResourceCycles=[18]" line in the SchedWriteRes for this instruction.
Pretty typical - we should be able to handle this.
> This doesn't seem to work (a poor schedule is produced) so I changed it
to also require another resource that I modelled as unbuffered (BufferSize=0),
in the hope that this would "block" other FDIVs... no joy.
That should create a hazard that blocks scheduling of the FDIVs. So that was the
right thing to do, assuming that’s what you want - register pressure could
suffer in some cases.

ResourceCycles is an ordered list. It’s only going to stall if the unbuffered
resource is the one taking 18 cycles. You didn’t attach your patch though, so I
can’t be sure what your actually did...
> Then I noticed that the MicroOpBufferSize is set to 128, which is wildly
high as Cortex-A57 has separated smaller reorder buffers, not one larger reorder
buffer.
> Even reducing it down to "2" made no effect, the divs were
scheduled in a clump together. But "1" and "0" (denoting
in-order) produced a nice schedule.
There’s a huge difference between 0, 1, and > 1. Beyond that, the generic
scheduler only cares in some cases of very tight loops. Your example is straight
line code so it won’t matter. You could model buffers on the individual
resources instead to be more precise, but I don’t think it will matter much
unless you start customizing heuristics by plugging in a new scheduling
strategy.
> I'd expect an OoO machine with a buffer of 2 ops would produce a very
similar schedule as an in-order machine. So where am I going wrong?
See above. The machine model is much more precise than the scheduler’s internal
model. It would be possible to approximately simulate the behavior of the
reorder buffer, but since most OoO machines have such large buffers now, it’s
not worth adding the cost and complexity to the generic scheduler. At least I
wasn’t able to find real examples where it mattered.
> Sample attached - I'd expect the FDIVs to be equally spread across the
MULs.
The stalls should be modeled as long as the FDIV uses an unbuffered resource for
18 cycles and the MUL does not use the same resource at all. But the way
in-order hazards work in the scheduler, you may end up with three MULs strangely
smashed between two FDIVs.

To get a more even dispersement, you can try BufferSize=1. That basically
prioritizes for latency, but is very sensitive to a bunch of heuristics.
> (The extension to this I want to model is that we can have 2 S-register
FDIVs in parallel but only one D-reg FDIV, and never both, but that can wait
until I've understood what's going on here!).
Hmm. The implementation of inorder scheduling with the new machine model is
pretty lame. It was a quick fix to get something working. It needs to be
extended so that it separately counts cycles for multiple units of the same
resource. It would be straightforward enough to do that. I can’t really
volunteer at the moment though.

-Andy
> 
> Cheers,
> 
> James

James Molloy

2014-Nov-04 13:54 UTC

head link

[LLVMdev] "Anti" scheduling with OoO cores?

Hi Andy,

Thanks for the reply!
> This doesn't seem to work (a poor schedule is produced) so I changed it
> to also require another resource that I modelled as unbuffered
> (BufferSize=0), in the hope that this would "block" other
FDIVs... no joy.
> That should create a hazard that blocks scheduling of the FDIVs. So that
> was the right thing to do, assuming that’s what you want - register
> pressure could suffer in some cases.

This didn't work. From looking at the misched output, it seemed to see the
unbuffered resource use, then assume nothing could be done for 18 cycles
and then carry on 18 cycles in, resulting in the FDIVs being clustered in
the final schedule.

 The machine model is much more precise than the scheduler’s
internal> model. It would be possible to approximately simulate the behavior of the
> reorder buffer, but since most OoO machines have such large buffers now,
> it’s not worth adding the cost and complexity to the generic scheduler. At
> least I wasn’t able to find real examples where it mattered.

I think this is the real crux of the matter. Cortex-A57 doesn't have a
unified reorder buffer at all. It has a separate 8-entry reorder buffer per
pipeline. So scheduling really matters, and putting 8 dependent operations
in a row can completely kill the out-of-order execution. Every dependent
operation we put in eats up a queue slot, so scheduling really can make a
difference. If we changed the machine model to have a MicroOpBufferSize of
"small", and modelled a buffersize of 8 on each of the pipeline
resources -
how much of that information would the generic scheduler use?

(Also, what is "small". It's out of order so "2"? But
it's not massively
out of order so maybe model it as in-order ("0")? We do still want to
consider register pressure though... ("1")?)

Hmm. The implementation of inorder scheduling with the new machine model
is> pretty lame.

OK, this needs to be added. That's fair enough.

Cheers,

James

On 4 November 2014 08:34, Andrew Trick <atrick at apple.com> wrote:
>
> > On Nov 2, 2014, at 4:46 AM, James Molloy <james at
jamesmolloy.co.uk>
> wrote:
> >
> > Hi Andy, Dave,
> >
> > I've been doing a bit of experimentation trying to understand the
> schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).
> >
> > FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and
> FDIVSrr). This seems to be already semi-modelled, with a
> "ResourceCycles=[18]" line in the SchedWriteRes for this
instruction.
>
> Pretty typical - we should be able to handle this.
>
> > This doesn't seem to work (a poor schedule is produced) so I
changed it
> to also require another resource that I modelled as unbuffered
> (BufferSize=0), in the hope that this would "block" other
FDIVs... no joy.
>
> That should create a hazard that blocks scheduling of the FDIVs. So that
> was the right thing to do, assuming that’s what you want - register
> pressure could suffer in some cases.
>
> ResourceCycles is an ordered list. It’s only going to stall if the
> unbuffered resource is the one taking 18 cycles. You didn’t attach your
> patch though, so I can’t be sure what your actually did...
>
> > Then I noticed that the MicroOpBufferSize is set to 128, which is
wildly
> high as Cortex-A57 has separated smaller reorder buffers, not one larger
> reorder buffer.
> > Even reducing it down to "2" made no effect, the divs were
scheduled in
> a clump together. But "1" and "0" (denoting in-order)
produced a nice
> schedule.
>
> There’s a huge difference between 0, 1, and > 1. Beyond that, the
generic
> scheduler only cares in some cases of very tight loops. Your example is
> straight line code so it won’t matter. You could model buffers on the
> individual resources instead to be more precise, but I don’t think it will
> matter much unless you start customizing heuristics by plugging in a new
> scheduling strategy.
>
> > I'd expect an OoO machine with a buffer of 2 ops would produce a
very
> similar schedule as an in-order machine. So where am I going wrong?
>
> See above. The machine model is much more precise than the scheduler’s
> internal model. It would be possible to approximately simulate the behavior
> of the reorder buffer, but since most OoO machines have such large buffers
> now, it’s not worth adding the cost and complexity to the generic
> scheduler. At least I wasn’t able to find real examples where it mattered.
>
> > Sample attached - I'd expect the FDIVs to be equally spread across
the
> MULs.
>
> The stalls should be modeled as long as the FDIV uses an unbuffered
> resource for 18 cycles and the MUL does not use the same resource at all.
> But the way in-order hazards work in the scheduler, you may end up with
> three MULs strangely smashed between two FDIVs.
>
> To get a more even dispersement, you can try BufferSize=1. That basically
> prioritizes for latency, but is very sensitive to a bunch of heuristics.
>
> > (The extension to this I want to model is that we can have 2
S-register
> FDIVs in parallel but only one D-reg FDIV, and never both, but that can
> wait until I've understood what's going on here!).
>
> Hmm. The implementation of inorder scheduling with the new machine model
> is pretty lame. It was a quick fix to get something working. It needs to be
> extended so that it separately counts cycles for multiple units of the same
> resource. It would be straightforward enough to do that. I can’t really
> volunteer at the moment though.
>
> -Andy
>
> >
> > Cheers,
> >
> > James
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20141104/11cceb6a/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - Nov 2014 - [LLVMdev] "Anti" scheduling with OoO cores?

[LLVMdev] "Anti" scheduling with OoO cores?

[LLVMdev] "Anti" scheduling with OoO cores?

[LLVMdev] "Anti" scheduling with OoO cores?

[LLVMdev] "Anti" scheduling with OoO cores?

Apparently Analagous Threads