Hello, I am interested in writing an analysis pass that looks at the stride used for loads in a loop and passes that information down so that it can be used by the instruction scheduler. The reason is that if the load stride is greater than the cache line size, then I would expect the load to always miss the cache, and, as a result, the scheduler should use a much larger effective latency when scheduling the load and its dependencies. Cache-miss metadata might also be a good supplemental option. I can add methods to TLI that can convert the access stride information into effective latency information, but what is the best way to annotate the loads so that the information will be available to the SDNodes? Has anyone tried something like this before? A related issue is automatically adding prefetching to loops. The trick here is to accurately estimate the number of cycles the loop body will take the execute (so that you prefetch the correct amount ahead). This information is not really available until instruction scheduling, and so prefetch adding cannot really complete until just before MC generation (the prefetch instructions can be scheduled, but their constant offset needs to be held free for a while). In addition, estimating the number of cycles also requires relatively accurate load/store latiencies, and this, in turn, requires cache-miss latencies to be accounted for (which must then account for the prefetches). If anyone has thoughts on these ideas, I would like to hear them. Thanks again, Hal -- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:> Hello, > > I am interested in writing an analysis pass that looks at the stride > used for loads in a loop and passes that information down so that it > can be used by the instruction scheduler. The reason is that if the > load stride is greater than the cache line size, then I would expect > the load to always miss the cache, and, as a result, the scheduler > should use a much larger effective latency when scheduling the load and > its dependencies. Cache-miss metadata might also be a good supplemental > option. I can add methods to TLI that can convert the access stride > information into effective latency information, but what is the best > way to annotate the loads so that the information will be available to > the SDNodes? > > Has anyone tried something like this before? > > A related issue is automatically adding prefetching to loops. The > trick here is to accurately estimate the number of cycles the loop > body will take the execute (so that you prefetch the correct amount > ahead). This information is not really available until instruction > scheduling, and so prefetch adding cannot really complete until just > before MC generation (the prefetch instructions can be scheduled, but > their constant offset needs to be held free for a while). In addition, > estimating the number of cycles also requires relatively accurate > load/store latiencies, and this, in turn, requires cache-miss latencies > to be accounted for (which must then account for the prefetches). > > If anyone has thoughts on these ideas, I would like to hear them.If you annotate loads with their expected latency, the upcoming MachineScheduler will be able to use the information. In the short term (next couple months), you're free to hack the SDScheduler as well. Although the scheduler can use the information, I don't think it can do much good with it scheduling for mainstream targets. It would be more interesting scheduling for an in-order machine without a hardware prefetch unit. An acyclic instruction scheduler can schedule for L1 and L2 latency at most. But the out-of-order engine should be able to compensate for these latencies. L2 misses within a high trip count loop will benefit greatly from stride prefetching. But regular strides should already be handled in hardware. So my suggestions are: 1) If you have an in-order machine, your workload actually fits in L2, and you care deeply about every stall cycle, it may be useful for the scheduler distinguish between expected L1 vs L2 latency. Try to issue multiple L1 misses in parallel or cover their latency. You can consider offsets from aligned objects in addition to induction variable strides. 2) If you have a machine without hardware prefetching, you really need to insert prefetches. This is a much bigger bang for the buck than scheduling for L1/L2 latency. To cover the latency of an L2 miss, which is what really matters, you need to prefetch many iterations ahead. Rather that trying to predict the number of cycles each iteration takes, you're better off prefetching as many iterations ahead as possible up to the hardware's limit on outstanding loads. If the loop has a constant trip count, you can probably do something clever. Otherwise I think branch profiling could help by telling you which loops have a very high trip count. I think prefetch insertion is more closely tied to loop unrolling than instruction scheduling. -Andy
On Fri, 02 Mar 2012 13:49:48 -0800 Andrew Trick <atrick at apple.com> wrote:> On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote: > > > Hello, > > > > I am interested in writing an analysis pass that looks at the stride > > used for loads in a loop and passes that information down so that it > > can be used by the instruction scheduler. The reason is that if the > > load stride is greater than the cache line size, then I would expect > > the load to always miss the cache, and, as a result, the scheduler > > should use a much larger effective latency when scheduling the load > > and its dependencies. Cache-miss metadata might also be a good > > supplemental option. I can add methods to TLI that can convert the > > access stride information into effective latency information, but > > what is the best way to annotate the loads so that the information > > will be available to the SDNodes? > > > > Has anyone tried something like this before? > > > > A related issue is automatically adding prefetching to loops. The > > trick here is to accurately estimate the number of cycles the loop > > body will take the execute (so that you prefetch the correct amount > > ahead). This information is not really available until instruction > > scheduling, and so prefetch adding cannot really complete until just > > before MC generation (the prefetch instructions can be scheduled, > > but their constant offset needs to be held free for a while). In > > addition, estimating the number of cycles also requires relatively > > accurate load/store latiencies, and this, in turn, requires > > cache-miss latencies to be accounted for (which must then account > > for the prefetches). > > > > If anyone has thoughts on these ideas, I would like to hear them. > >Andy, Thank you for writing such a detailed response.> If you annotate loads with their expected latency, the upcoming > MachineScheduler will be able to use the information. In the short > term (next couple months), you're free to hack the SDScheduler as > well.Alright, sounds good. If I add metadata to the load, can I get to it thought the Value * in the associated MachineMemOperand object?> > Although the scheduler can use the information, I don't think it can > do much good with it scheduling for mainstream targets. It would be > more interesting scheduling for an in-order machine without a > hardware prefetch unit. > > An acyclic instruction scheduler can schedule for L1 and L2 latency > at most. But the out-of-order engine should be able to compensate for > these latencies. > > L2 misses within a high trip count loop will benefit greatly from > stride prefetching. But regular strides should already be handled in > hardware.I agree. For the machine with which I'm working, the hardware prefetch unit only works if you access N consecutive cache lines. Any pattern that does not do that will need explicit prefetch instructions. Also, I need to be careful not to prefetch too much because the request buffer is fairly small (it handles < 10 outstanding requests).> > So my suggestions are: > > 1) If you have an in-order machine, your workload actually fits in > L2, and you care deeply about every stall cycle, it may be useful for > the scheduler distinguish between expected L1 vs L2 latency. Try to > issue multiple L1 misses in parallel or cover their latency. You can > consider offsets from aligned objects in addition to induction > variable strides.Issuing multiple L1 misses in parallel is exactly what I would like to do. Offsets from aligned objects is also a good idea.> > 2) If you have a machine without hardware prefetching, you really > need to insert prefetches. This is a much bigger bang for the buck > than scheduling for L1/L2 latency. To cover the latency of an L2 > miss, which is what really matters, you need to prefetch many > iterations ahead. Rather that trying to predict the number of cycles > each iteration takes, you're better off prefetching as many > iterations ahead as possible up to the hardware's limit on > outstanding loads. If the loop has a constant trip count, you can > probably do something clever. Otherwise I think branch profiling > could help by telling you which loops have a very high trip count. I > think prefetch insertion is more closely tied to loop unrolling than > instruction scheduling.I'm afraid that "as many iterations ahead as possible" may turn out to be too few if I start at the very next iteration because the request buffer is small. Nevertheless, it is certainly worth a try. Thanks again, Hal> > -Andy-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory