Sean Silva via llvm-dev
2016-Nov-27 11:23 UTC
[llvm-dev] A couple metrics of LLD/ELF's performance
These numbers were collected on Rafael's clang-fsds test case (however, I removed -O3 and --gc-sections) with a command like: ``` sudo perf record --event=cache-misses --call-graph=dwarf -- /home/sean/pg/llvm/release/bin/ld.lld @response.txt -o /tmp/t --no-threads ``` And then ``` sudo perf report --no-children --sort dso,srcfile ``` One annoying thing about these numbers from perf is that they don't sum to 100% usually; so just treat the numbers as relative to each other. Overall I'm not very happy with perf. I don't fully trust its output. Also, keep in mind that clang-fsds doesn't have debug info, so the heavy string handling costs don't show up in this profile. --event=cycles This is the perf default and correlates with overall runtime. One interesting thing this shows is that LLD is currently quite bottlenecked on the kernel. https://reviews.llvm.org/P7944 These other metrics are harder to improve. Improving these metrics will require macro-scale optimizations to our data structures and IO. This means that we should be aware of them so that we avoid going into a local minimum of performance. --event=cache-misses I believe these are L2 misses. getOffset shows up here quite a bit. One useful purpose for this metric is that since L2 is core-private (my CPU is an i7-6700HQ, but this will apply to all recent big intel cores), it won't contend with other cores for the L3 cache. So misses here are where cores start to feel each other's presence. https://reviews.llvm.org/P7943 --event=LLC-load-misses These are misses in last level cache (LLC). I.e. times that we have to go to DRAM (SLOOOW). The getVA codepath show up strongly and we see the memcpy into the output. We may want to consider a nontemporal memcpy to at least avoid polluting the cache. These misses contend on the DRAM bus (although currently it may be underutilized and so adding more parallelism will help to keep it busy, but only up to a point). https://reviews.llvm.org/P7947 --event=dTLB-load-misses These are dTLB misses for loads (on my machine, it corresponds to any time that the hardware page table walker kicks in: https://github.com/torvalds/linux/blob/f92b7604149a55cb601fc0b52911b1e11f0f2514/arch/x86/events/intel/core.c#L434 ). Here we also see the getVA codepath (which is basically doing a random lookup into a huge hash table, so it will DTLB miss) and the memcpy into the output. https://reviews.llvm.org/P7945 --event=minor-faults This metric essentially shows where new pages of memory are touched and have to be either allocated by the kernel or it has to do a page table fixup. Here we see the memcpy into the output is a huge part. Also obviously lots of minor faults as malloc allocates memory from the kernel. https://reviews.llvm.org/P7946 -- Sean Silva -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161127/0f23f8df/attachment.html>
Rui Ueyama via llvm-dev
2016-Nov-27 17:57 UTC
[llvm-dev] A couple metrics of LLD/ELF's performance
On Sun, Nov 27, 2016 at 3:23 AM, Sean Silva <chisophugis at gmail.com> wrote:> These numbers were collected on Rafael's clang-fsds test case (however, I > removed -O3 and --gc-sections) with a command like: > ``` > sudo perf record --event=cache-misses --call-graph=dwarf -- > /home/sean/pg/llvm/release/bin/ld.lld @response.txt -o /tmp/t --no-threads > ``` > > And then > ``` > sudo perf report --no-children --sort dso,srcfile > ``` > > One annoying thing about these numbers from perf is that they don't sum to > 100% usually; so just treat the numbers as relative to each other. Overall > I'm not very happy with perf. I don't fully trust its output. > Also, keep in mind that clang-fsds doesn't have debug info, so the heavy > string handling costs don't show up in this profile. > > > > --event=cycles > This is the perf default and correlates with overall runtime. One > interesting thing this shows is that LLD is currently quite bottlenecked on > the kernel. > https://reviews.llvm.org/P7944 > > These other metrics are harder to improve. Improving these metrics will > require macro-scale optimizations to our data structures and IO. This means > that we should be aware of them so that we avoid going into a local minimum > of performance. > > --event=cache-misses > I believe these are L2 misses. getOffset shows up here quite a bit. > One useful purpose for this metric is that since L2 is core-private (my > CPU is an i7-6700HQ, but this will apply to all recent big intel cores), it > won't contend with other cores for the L3 cache. So misses here are where > cores start to feel each other's presence. > https://reviews.llvm.org/P7943 > > --event=LLC-load-misses > These are misses in last level cache (LLC). I.e. times that we have to go > to DRAM (SLOOOW). > The getVA codepath show up strongly and we see the memcpy into the output. > We may want to consider a nontemporal memcpy to at least avoid polluting > the cache. > These misses contend on the DRAM bus (although currently it may be > underutilized and so adding more parallelism will help to keep it busy, but > only up to a point). > https://reviews.llvm.org/P7947 >Will nontemporal memcpy make any difference? After we memcpy an input section to an output section, we apply relocation to the output section. So we write to the same memory region twice. Is there any easy way to experiment that? I guess writing a well-optimized nontemporal memcpy for an experiment is not an easy task, so I wonder if there's already any code for that. --event=dTLB-load-misses> These are dTLB misses for loads (on my machine, it corresponds to any time > that the hardware page table walker kicks in: https://github.com/torvalds/ > linux/blob/f92b7604149a55cb601fc0b52911b1e11f0f2514/arch/x86/events/ > intel/core.c#L434). > Here we also see the getVA codepath (which is basically doing a random > lookup into a huge hash table, so it will DTLB miss) and the memcpy into > the output. > https://reviews.llvm.org/P7945 > > --event=minor-faults > This metric essentially shows where new pages of memory are touched and > have to be either allocated by the kernel or it has to do a page table > fixup. > Here we see the memcpy into the output is a huge part. Also obviously lots > of minor faults as malloc allocates memory from the kernel. > https://reviews.llvm.org/P7946 > > > -- Sean Silva >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161127/805f2780/attachment.html>
mats petersson via llvm-dev
2016-Nov-27 18:45 UTC
[llvm-dev] A couple metrics of LLD/ELF's performance
On 27 November 2016 at 17:57, Rui Ueyama via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On Sun, Nov 27, 2016 at 3:23 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> These numbers were collected on Rafael's clang-fsds test case (however, I >> removed -O3 and --gc-sections) with a command like: >> ``` >> sudo perf record --event=cache-misses --call-graph=dwarf -- >> /home/sean/pg/llvm/release/bin/ld.lld @response.txt -o /tmp/t >> --no-threads >> ``` >> >> And then >> ``` >> sudo perf report --no-children --sort dso,srcfile >> ``` >> >> One annoying thing about these numbers from perf is that they don't sum >> to 100% usually; so just treat the numbers as relative to each other. >> Overall I'm not very happy with perf. I don't fully trust its output. >> Also, keep in mind that clang-fsds doesn't have debug info, so the heavy >> string handling costs don't show up in this profile. >> >> >> >> --event=cycles >> This is the perf default and correlates with overall runtime. One >> interesting thing this shows is that LLD is currently quite bottlenecked on >> the kernel. >> https://reviews.llvm.org/P7944 >> >> These other metrics are harder to improve. Improving these metrics will >> require macro-scale optimizations to our data structures and IO. This means >> that we should be aware of them so that we avoid going into a local minimum >> of performance. >> >> --event=cache-misses >> I believe these are L2 misses. getOffset shows up here quite a bit. >> One useful purpose for this metric is that since L2 is core-private (my >> CPU is an i7-6700HQ, but this will apply to all recent big intel cores), it >> won't contend with other cores for the L3 cache. So misses here are where >> cores start to feel each other's presence. >> https://reviews.llvm.org/P7943 >> >> --event=LLC-load-misses >> These are misses in last level cache (LLC). I.e. times that we have to go >> to DRAM (SLOOOW). >> The getVA codepath show up strongly and we see the memcpy into the >> output. We may want to consider a nontemporal memcpy to at least avoid >> polluting the cache. >> These misses contend on the DRAM bus (although currently it may be >> underutilized and so adding more parallelism will help to keep it busy, but >> only up to a point). >> https://reviews.llvm.org/P7947 >> > > Will nontemporal memcpy make any difference? After we memcpy an input > section to an output section, we apply relocation to the output section. So > we write to the same memory region twice. > > Is there any easy way to experiment that? I guess writing a well-optimized > nontemporal memcpy for an experiment is not an easy task, so I wonder if > there's already any code for that. >>From past experience (and this is going back quite some time - but I havespent a little bit of time doing similar things later, and the results were pretty similar), the non-temporal store works well when the total size processed is bigger than the cache (L2 in this case, I'd say). If you memcpy a section that is bigger than the L2 cache-size, then doing a non-temporal copy will most likely give better results, even if you continue to work on the same big lump of memory. The code doesn't necessarily have to be highly optimised at this point. Something like: if (size < cacheSize) memcpy(dest, src, size); else memcpy_nontemporal(dest, src, size); I think memcpy_nontemporal() can be written using intrinsics, and not needing to be hugely optimised, since it will run at "memory speed" rather than "cache-speed". (If you know alignment is right, you could use vector NT load/store instructions to get better throughput, but it's still largely based on "how fast can the memory deliver/receive the data"). At least as an experimennt In an ideal world, doing all the processing on small chunks at a time (something that easily fits in L1 cache), then moving to the next small chunk, instead of doing the work on a large block and then going over that whole thing again. I do realize this may not be viable for this particular project. I have only vague notions of how LLD and other linkers work, my main reason for entering this discussion was the idea of non-temporal memcpy. -- Mats> > --event=dTLB-load-misses >> These are dTLB misses for loads (on my machine, it corresponds to any >> time that the hardware page table walker kicks in: >> https://github.com/torvalds/linux/blob/f92b7604149a55cb601fc >> 0b52911b1e11f0f2514/arch/x86/events/intel/core.c#L434). >> Here we also see the getVA codepath (which is basically doing a random >> lookup into a huge hash table, so it will DTLB miss) and the memcpy into >> the output. >> https://reviews.llvm.org/P7945 >> >> --event=minor-faults >> This metric essentially shows where new pages of memory are touched and >> have to be either allocated by the kernel or it has to do a page table >> fixup. >> Here we see the memcpy into the output is a huge part. Also obviously >> lots of minor faults as malloc allocates memory from the kernel. >> https://reviews.llvm.org/P7946 >> >> >> -- Sean Silva >> > > > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161127/b5087a9b/attachment.html>
Sean Silva via llvm-dev
2016-Nov-27 23:51 UTC
[llvm-dev] A couple metrics of LLD/ELF's performance
On Sun, Nov 27, 2016 at 9:57 AM, Rui Ueyama <ruiu at google.com> wrote:> On Sun, Nov 27, 2016 at 3:23 AM, Sean Silva <chisophugis at gmail.com> wrote: > >> These numbers were collected on Rafael's clang-fsds test case (however, I >> removed -O3 and --gc-sections) with a command like: >> ``` >> sudo perf record --event=cache-misses --call-graph=dwarf -- >> /home/sean/pg/llvm/release/bin/ld.lld @response.txt -o /tmp/t >> --no-threads >> ``` >> >> And then >> ``` >> sudo perf report --no-children --sort dso,srcfile >> ``` >> >> One annoying thing about these numbers from perf is that they don't sum >> to 100% usually; so just treat the numbers as relative to each other. >> Overall I'm not very happy with perf. I don't fully trust its output. >> Also, keep in mind that clang-fsds doesn't have debug info, so the heavy >> string handling costs don't show up in this profile. >> >> >> >> --event=cycles >> This is the perf default and correlates with overall runtime. One >> interesting thing this shows is that LLD is currently quite bottlenecked on >> the kernel. >> https://reviews.llvm.org/P7944 >> >> These other metrics are harder to improve. Improving these metrics will >> require macro-scale optimizations to our data structures and IO. This means >> that we should be aware of them so that we avoid going into a local minimum >> of performance. >> >> --event=cache-misses >> I believe these are L2 misses. getOffset shows up here quite a bit. >> One useful purpose for this metric is that since L2 is core-private (my >> CPU is an i7-6700HQ, but this will apply to all recent big intel cores), it >> won't contend with other cores for the L3 cache. So misses here are where >> cores start to feel each other's presence. >> https://reviews.llvm.org/P7943 >> >> --event=LLC-load-misses >> These are misses in last level cache (LLC). I.e. times that we have to go >> to DRAM (SLOOOW). >> The getVA codepath show up strongly and we see the memcpy into the >> output. We may want to consider a nontemporal memcpy to at least avoid >> polluting the cache. >> These misses contend on the DRAM bus (although currently it may be >> underutilized and so adding more parallelism will help to keep it busy, but >> only up to a point). >> https://reviews.llvm.org/P7947 >> > > Will nontemporal memcpy make any difference? After we memcpy an input > section to an output section, we apply relocation to the output section. So > we write to the same memory region twice. >You can do nontemporal loads from the input at least. Also, like mats said, doing the final copy+relocate in blocks could allow using nontemporal stores more easily. For some sections, we know we won't relocate them (such as strings) and so we can use nontemporal stores as well without any special handling. -- Sean Silva> > Is there any easy way to experiment that? I guess writing a well-optimized > nontemporal memcpy for an experiment is not an easy task, so I wonder if > there's already any code for that. > > --event=dTLB-load-misses >> These are dTLB misses for loads (on my machine, it corresponds to any >> time that the hardware page table walker kicks in: >> https://github.com/torvalds/linux/blob/f92b7604149a55cb601fc >> 0b52911b1e11f0f2514/arch/x86/events/intel/core.c#L434). >> Here we also see the getVA codepath (which is basically doing a random >> lookup into a huge hash table, so it will DTLB miss) and the memcpy into >> the output. >> https://reviews.llvm.org/P7945 >> >> --event=minor-faults >> This metric essentially shows where new pages of memory are touched and >> have to be either allocated by the kernel or it has to do a page table >> fixup. >> Here we see the memcpy into the output is a huge part. Also obviously >> lots of minor faults as malloc allocates memory from the kernel. >> https://reviews.llvm.org/P7946 >> >> >> -- Sean Silva >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161127/ea8109e7/attachment.html>