thr3ads.net - llvm dev - [LLVMdev] Adjusting Load Latencies [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2012-Mar-02 17:01 UTC

[LLVMdev] Adjusting Load Latencies

Hello,

I am interested in writing an analysis pass that looks at the stride
used for loads in a loop and passes that information down so that it
can be used by the instruction scheduler. The reason is that if the
load stride is greater than the cache line size, then I would expect
the load to always miss the cache, and, as a result, the scheduler
should use a much larger effective latency when scheduling the load and
its dependencies. Cache-miss metadata might also be a good supplemental
option. I can add methods to TLI that can convert the access stride
information into effective latency information, but what is the best
way to annotate the loads so that the information will be available to
the SDNodes?

Has anyone tried something like this before?

A related issue is automatically adding prefetching to loops. The
trick here is to accurately estimate the number of cycles the loop
body will take the execute (so that you prefetch the correct amount
ahead). This information is not really available until instruction
scheduling, and so prefetch adding cannot really complete until just
before MC generation (the prefetch instructions can be scheduled, but
their constant offset needs to be held free for a while). In addition,
estimating the number of cycles also requires relatively accurate
load/store latiencies, and this, in turn, requires cache-miss latencies
to be accounted for (which must then account for the prefetches).

If anyone has thoughts on these ideas, I would like to hear them.

Thanks again,
Hal

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Andrew Trick

2012-Mar-02 21:49 UTC

head link

[LLVMdev] Adjusting Load Latencies

On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> Hello,
> 
> I am interested in writing an analysis pass that looks at the stride
> used for loads in a loop and passes that information down so that it
> can be used by the instruction scheduler. The reason is that if the
> load stride is greater than the cache line size, then I would expect
> the load to always miss the cache, and, as a result, the scheduler
> should use a much larger effective latency when scheduling the load and
> its dependencies. Cache-miss metadata might also be a good supplemental
> option. I can add methods to TLI that can convert the access stride
> information into effective latency information, but what is the best
> way to annotate the loads so that the information will be available to
> the SDNodes?
> 
> Has anyone tried something like this before?
> 
> A related issue is automatically adding prefetching to loops. The
> trick here is to accurately estimate the number of cycles the loop
> body will take the execute (so that you prefetch the correct amount
> ahead). This information is not really available until instruction
> scheduling, and so prefetch adding cannot really complete until just
> before MC generation (the prefetch instructions can be scheduled, but
> their constant offset needs to be held free for a while). In addition,
> estimating the number of cycles also requires relatively accurate
> load/store latiencies, and this, in turn, requires cache-miss latencies
> to be accounted for (which must then account for the prefetches).
> 
> If anyone has thoughts on these ideas, I would like to hear them.

If you annotate loads with their expected latency, the upcoming MachineScheduler
will be able to use the information. In the short term (next couple months),
you're free to hack the SDScheduler as well.

Although the scheduler can use the information, I don't think it can do much
good with it scheduling for mainstream targets. It would be more interesting
scheduling for an in-order machine without a hardware prefetch unit.

An acyclic instruction scheduler can schedule for L1 and L2 latency at most. But
the out-of-order engine should be able to compensate for these latencies.

L2 misses within a high trip count loop will benefit greatly from stride
prefetching. But regular strides should already be handled in hardware.

So my suggestions are:

1) If you have an in-order machine, your workload actually fits in L2, and you
care deeply about every stall cycle, it may be useful for the scheduler
distinguish between expected L1 vs L2 latency. Try to issue multiple L1 misses
in parallel or cover their latency. You can consider offsets from aligned
objects in addition to induction variable strides.

2) If you have a machine without hardware prefetching, you really need to insert
prefetches. This is a much bigger bang for the buck than scheduling for L1/L2
latency. To cover the latency of an L2 miss, which is what really matters, you
need to prefetch many iterations ahead. Rather that trying to predict the number
of cycles each iteration takes, you're better off prefetching as many
iterations ahead as possible up to the hardware's limit on outstanding
loads. If the loop has a constant trip count, you can probably do something
clever. Otherwise I think branch profiling could help by telling you which loops
have a very high trip count. I think prefetch insertion is more closely tied to
loop unrolling than instruction scheduling.

-Andy

Hal Finkel

2012-Mar-02 22:59 UTC

head link

[LLVMdev] Adjusting Load Latencies

On Fri, 02 Mar 2012 13:49:48 -0800
Andrew Trick <atrick at apple.com> wrote:
> On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > Hello,
> > 
> > I am interested in writing an analysis pass that looks at the stride
> > used for loads in a loop and passes that information down so that it
> > can be used by the instruction scheduler. The reason is that if the
> > load stride is greater than the cache line size, then I would expect
> > the load to always miss the cache, and, as a result, the scheduler
> > should use a much larger effective latency when scheduling the load
> > and its dependencies. Cache-miss metadata might also be a good
> > supplemental option. I can add methods to TLI that can convert the
> > access stride information into effective latency information, but
> > what is the best way to annotate the loads so that the information
> > will be available to the SDNodes?
> > 
> > Has anyone tried something like this before?
> > 
> > A related issue is automatically adding prefetching to loops. The
> > trick here is to accurately estimate the number of cycles the loop
> > body will take the execute (so that you prefetch the correct amount
> > ahead). This information is not really available until instruction
> > scheduling, and so prefetch adding cannot really complete until just
> > before MC generation (the prefetch instructions can be scheduled,
> > but their constant offset needs to be held free for a while). In
> > addition, estimating the number of cycles also requires relatively
> > accurate load/store latiencies, and this, in turn, requires
> > cache-miss latencies to be accounted for (which must then account
> > for the prefetches).
> > 
> > If anyone has thoughts on these ideas, I would like to hear them.
> 
> 
Andy,

Thank you for writing such a detailed response.
> If you annotate loads with their expected latency, the upcoming
> MachineScheduler will be able to use the information. In the short
> term (next couple months), you're free to hack the SDScheduler as
> well.
Alright, sounds good. If I add metadata to the load, can I get to it
thought the Value * in the associated MachineMemOperand object? 
> 
> Although the scheduler can use the information, I don't think it can
> do much good with it scheduling for mainstream targets. It would be
> more interesting scheduling for an in-order machine without a
> hardware prefetch unit.
> 
> An acyclic instruction scheduler can schedule for L1 and L2 latency
> at most. But the out-of-order engine should be able to compensate for
> these latencies.
> 
> L2 misses within a high trip count loop will benefit greatly from
> stride prefetching. But regular strides should already be handled in
> hardware.
I agree. For the machine with which I'm working, the hardware prefetch
unit only works if you access N consecutive cache lines. Any pattern
that does not do that will need explicit prefetch instructions. Also, I
need to be careful not to prefetch too much because the request buffer
is fairly small (it handles < 10 outstanding requests).
> 
> So my suggestions are:
> 
> 1) If you have an in-order machine, your workload actually fits in
> L2, and you care deeply about every stall cycle, it may be useful for
> the scheduler distinguish between expected L1 vs L2 latency. Try to
> issue multiple L1 misses in parallel or cover their latency. You can
> consider offsets from aligned objects in addition to induction
> variable strides.
Issuing multiple L1 misses in parallel is exactly what I would like to
do. Offsets from aligned objects is also a good idea.
 > 
> 2) If you have a machine without hardware prefetching, you really
> need to insert prefetches. This is a much bigger bang for the buck
> than scheduling for L1/L2 latency. To cover the latency of an L2
> miss, which is what really matters, you need to prefetch many
> iterations ahead. Rather that trying to predict the number of cycles
> each iteration takes, you're better off prefetching as many
> iterations ahead as possible up to the hardware's limit on
> outstanding loads. If the loop has a constant trip count, you can
> probably do something clever. Otherwise I think branch profiling
> could help by telling you which loops have a very high trip count. I
> think prefetch insertion is more closely tied to loop unrolling than
> instruction scheduling.
I'm afraid that "as many iterations ahead as possible" may turn
out to
be too few if I start at the very next iteration because the request
buffer is small. Nevertheless, it is certainly worth a try.

Thanks again,
Hal
> 
> -Andy


-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Mar 2012 - [LLVMdev] Adjusting Load Latencies

[LLVMdev] Adjusting Load Latencies

[LLVMdev] Adjusting Load Latencies

[LLVMdev] Adjusting Load Latencies

Maybe Matching Threads