Adam Nemet via llvm-dev
2015-Dec-23 17:42 UTC
[llvm-dev] RFC: Extend PowerPC SW prefetching pass to other targets
Hi, I’d like to add SW prefetching capability for our ARM64 micro-architectures. My immediate goal is to add support for constant large-strided accesses (>= 2KB) that are problematic for the HW prefetcher to handle. The direct motivation is 433.milc in SPECfp 2006. The benchmark iterates through very small matrices and multiplies them with a vector. However the matrix is part of a large structure so the stride is large and as a result, we miss in L1 on every new matrix. My plan is to take Hal’s PowerPC prefetching pass[1] and make it available for other targets on an opt-in basis. Specifically, move the pass under lib/Transform/Scalar and add a TTI interface to query the target parameters. The information a target would have to provide to opt in are: the stride threshold, cache line size and how many iterations ahead the prefetching should occur for a given loop. For OOO architectures, the latter is pretty hard to estimate. You pretty much have to compute II in the software pipelining sense. I think that I will just use the instruction count to estimate a ResII with possibly checking that there are no recurrences in the loop other than the short ones for the induction variable. This may err on the side of issuing the prefetches earlier than necessary but hopefully not too early to cause any cache thrashing. The current pass operates on LLMV IR so besides having SCEVs to work with, we can also check recurrences across memory. Please let me know if you have any comments. Thanks, Adam [1] http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150216/260805.html <http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150216/260805.html> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151223/2b7900f8/attachment.html>
Hal Finkel via llvm-dev
2016-Jan-05 20:18 UTC
[llvm-dev] RFC: Extend PowerPC SW prefetching pass to other targets
Hi Adam, Your plan makes perfect sense to me. As you point out, estimating how far ahead to prefetch is the difficult part, especially on OOO targets. One thing you might want to consider is, instead of using the 'user costs' from TTI, as is currently done, which are geared more toward generating coats for the inliner and unroller, is to use the costs we provide to the vectorizer. These costs should be more-directly related to throughput, and could provide a better estimate of cycles per iteration. -Hal ----- Original Message -----> From: "Adam Nemet" <anemet at apple.com> > To: llvm-dev at lists.llvm.org, "Hal Finkel" <hfinkel at anl.gov> > Sent: Wednesday, December 23, 2015 11:42:18 AM > Subject: RFC: Extend PowerPC SW prefetching pass to other targets > > Hi, > > > I’d like to add SW prefetching capability for our ARM64 > micro-architectures. My immediate goal is to add support for > constant large-strided accesses (>= 2KB) that are problematic for > the HW prefetcher to handle. > > > The direct motivation is 433.milc in SPECfp 2006. The benchmark > iterates through very small matrices and multiplies them with a > vector. However the matrix is part of a large structure so the > stride is large and as a result, we miss in L1 on every new matrix. > > > My plan is to take Hal’s PowerPC prefetching pass[1] and make it > available for other targets on an opt-in basis. Specifically, move > the pass under lib/Transform/Scalar and add a TTI interface to query > the target parameters. The information a target would have to > provide to opt in are: the stride threshold, cache line size and how > many iterations ahead the prefetching should occur for a given loop. > > > For OOO architectures, the latter is pretty hard to estimate. You > pretty much have to compute II in the software pipelining sense. I > think that I will just use the instruction count to estimate a ResII > with possibly checking that there are no recurrences in the loop > other than the short ones for the induction variable. > > > This may err on the side of issuing the prefetches earlier than > necessary but hopefully not too early to cause any cache thrashing. > > > The current pass operates on LLMV IR so besides having SCEVs to work > with, we can also check recurrences across memory. > > > Please let me know if you have any comments. > > > Thanks, > Adam > > > [1] > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20150216/260805.html > >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory