Kostya Serebryany
2014-Apr-17 12:21 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Hi, The current design of -fprofile-instr-generate has the same fundamental flaw as the old gcc's gcov instrumentation: it has contention on counters. A trivial synthetic test case was described here: http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html For the problem to appear we need to have a hot function that is simultaneously executed by multiple threads -- then we will have high contention on the racy profile counters. Such situation is not necessary very frequent, but when it happens -fprofile-instr-generate becomes barely usable due to huge slowdown (5x-10x) An example is the multi-threaded vp9 video encoder. git clone https://chromium.googlesource.com/webm/libvpx cd libvpx/ F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ $F" LD="clang++ $F" ./configure make -j32 # get sample video from from https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m When running single-threaded, -fprofile-instr-generate adds reasonable ~15% overhead (8.5 vs 10 seconds) When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 seconds). I am not saying that this flaw is a showstopper, but with the continued move towards multithreading it will be hurting more and more users of coverage and PGO. AFAICT, most of our PGO users simply can not run their software in single-threaded mode, and some of them surely have hot functions running in all threads at once. At the very least we should document this problem, but better try fixing it. Some ideas: - per-thread counters. Solves the problem at huge cost in RAM per-thread - 8-bit per-thread counters, dumping into central counters on overflow. - per-cpu counters (not portable, requires very modern kernel with lots of patches) - sharded counters: each counter represented as N counters sitting in different cache lines. Every thread accesses the counter with index TID%N. Solves the problem partially, better with larger values of N, but then again it costs RAM. - reduce contention on hot counters by not incrementing them if they are big enough: {if (counter < 65536) counter++}; This reduces the accuracy though. Is that bad for PGO? - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) == 0) counter++; Other thoughts? --kcc -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/8cff753e/attachment.html>
Yaron Keren
2014-Apr-17 14:10 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
If accuracy is not critical, incrementing the counters without any guards might be good enough. Hot areas will still be hot and cold areas will not be affected. Yaron 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com>:> Hi, > > The current design of -fprofile-instr-generate has the same fundamental > flaw > as the old gcc's gcov instrumentation: it has contention on counters. > A trivial synthetic test case was described here: > http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html > > For the problem to appear we need to have a hot function that is > simultaneously executed > by multiple threads -- then we will have high contention on the racy > profile counters. > > Such situation is not necessary very frequent, but when it happens > -fprofile-instr-generate becomes barely usable due to huge slowdown > (5x-10x) > > An example is the multi-threaded vp9 video encoder. > > git clone https://chromium.googlesource.com/webm/libvpx > cd libvpx/ > F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ > $F" LD="clang++ $F" ./configure > make -j32 > # get sample video from from > https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m > time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m > > When running single-threaded, -fprofile-instr-generate adds reasonable > ~15% overhead > (8.5 vs 10 seconds) > When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 > seconds). > > I am not saying that this flaw is a showstopper, but with the continued > move > towards multithreading it will be hurting more and more users of coverage > and PGO. > AFAICT, most of our PGO users simply can not run their software in > single-threaded mode, > and some of them surely have hot functions running in all threads at once. > > At the very least we should document this problem, but better try fixing > it. > > Some ideas: > > - per-thread counters. Solves the problem at huge cost in RAM per-thread > - 8-bit per-thread counters, dumping into central counters on overflow. > - per-cpu counters (not portable, requires very modern kernel with lots of > patches) > - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index TID%N. > Solves the problem partially, better with larger values of N, but then > again it costs RAM. > - reduce contention on hot counters by not incrementing them if they are > big enough: > {if (counter < 65536) counter++}; This reduces the accuracy though. Is > that bad for PGO? > - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) > == 0) counter++; > > Other thoughts? > > --kcc > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/9f304c6c/attachment.html>
Kostya Serebryany
2014-Apr-17 14:13 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 6:10 PM, Yaron Keren <yaron.keren at gmail.com> wrote:> If accuracy is not critical, incrementing the counters without any guards > might be good enough. >No. Contention on the counters leads to 5x-10x slowdown. This is never good enough. --kcc Hot areas will still be hot and cold areas will not be affected.> > Yaron > > > > 2014-04-17 15:21 GMT+03:00 Kostya Serebryany <kcc at google.com>: > >> Hi, >> >> The current design of -fprofile-instr-generate has the same fundamental >> flaw >> as the old gcc's gcov instrumentation: it has contention on counters. >> A trivial synthetic test case was described here: >> http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html >> >> For the problem to appear we need to have a hot function that is >> simultaneously executed >> by multiple threads -- then we will have high contention on the racy >> profile counters. >> >> Such situation is not necessary very frequent, but when it happens >> -fprofile-instr-generate becomes barely usable due to huge slowdown >> (5x-10x) >> >> An example is the multi-threaded vp9 video encoder. >> >> git clone https://chromium.googlesource.com/webm/libvpx >> cd libvpx/ >> F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" >> CXX="clang++ $F" LD="clang++ $F" ./configure >> make -j32 >> # get sample video from from >> https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m >> time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m >> >> When running single-threaded, -fprofile-instr-generate adds reasonable >> ~15% overhead >> (8.5 vs 10 seconds) >> When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 >> seconds). >> >> I am not saying that this flaw is a showstopper, but with the continued >> move >> towards multithreading it will be hurting more and more users of coverage >> and PGO. >> AFAICT, most of our PGO users simply can not run their software in >> single-threaded mode, >> and some of them surely have hot functions running in all threads at >> once. >> >> At the very least we should document this problem, but better try fixing >> it. >> >> Some ideas: >> >> - per-thread counters. Solves the problem at huge cost in RAM per-thread >> - 8-bit per-thread counters, dumping into central counters on overflow. >> - per-cpu counters (not portable, requires very modern kernel with lots >> of patches) >> - sharded counters: each counter represented as N counters sitting in >> different cache lines. Every thread accesses the counter with index TID%N. >> Solves the problem partially, better with larger values of N, but then >> again it costs RAM. >> - reduce contention on hot counters by not incrementing them if they are >> big enough: >> {if (counter < 65536) counter++}; This reduces the accuracy though. Is >> that bad for PGO? >> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) >> == 0) counter++; >> >> Other thoughts? >> >> --kcc >> >> >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/d112c0ba/attachment.html>
Xinliang David Li
2014-Apr-17 17:38 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote:> Hi, > > The current design of -fprofile-instr-generate has the same fundamental > flaw > as the old gcc's gcov instrumentation: it has contention on counters. > A trivial synthetic test case was described here: > http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-October/066116.html > > For the problem to appear we need to have a hot function that is > simultaneously executed > by multiple threads -- then we will have high contention on the racy > profile counters. > > Such situation is not necessary very frequent, but when it happens > -fprofile-instr-generate becomes barely usable due to huge slowdown > (5x-10x) >We have seen even larger slowdowns, but it is uncommon, nor have I heard many complaints about it.> > An example is the multi-threaded vp9 video encoder. > > git clone https://chromium.googlesource.com/webm/libvpx > cd libvpx/ > F="-no-integrated-as -fprofile-instr-generate"; CC="clang $F" CXX="clang++ > $F" LD="clang++ $F" ./configure > make -j32 > # get sample video from from > https://media.xiph.org/video/derf/y4m/akiyo_cif.y4m > time ./vpxenc -o /dev/null -j 8 akiyo_cif.y4m > > When running single-threaded, -fprofile-instr-generate adds reasonable > ~15% overhead > (8.5 vs 10 seconds) > When running with 8 threads, it has 7x overhead (3.5 seconds vs 26 > seconds). > > I am not saying that this flaw is a showstopper, but with the continued > move > towards multithreading it will be hurting more and more users of coverage > and PGO. > AFAICT, most of our PGO users simply can not run their software in > single-threaded mode, > and some of them surely have hot functions running in all threads at once. > > At the very least we should document this problem, but better try fixing > it. > >Users can also select smaller (but still representative) training data set to solve the problem.> Some ideas: > > - per-thread counters. Solves the problem at huge cost in RAM per-thread >It is not practical. Especially for TLS counters -- it creates huge pressure on stack memory.> - 8-bit per-thread counters, dumping into central counters on overflow. >The overflow will happen very quickly with 8bit counter.> - per-cpu counters (not portable, requires very modern kernel with lots of > patches) > - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index TID%N. > Solves the problem partially, better with larger values of N, but then > again it costs RAM. >Interesting idea. This may work well.> - reduce contention on hot counters by not incrementing them if they are > big enough: > {if (counter < 65536) counter++}; This reduces the accuracy though. Is > that bad for PGO? >yes, it will be bad.> - self-cooling logarithmic counters: if ((fast_random() % (1 << counter)) > == 0) counter++; > >Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high. In Google GCC, we implemented another technique which proves to be very effective -- it is called FDO sampling. Basically counters will be updated every N samples. David> Other thoughts? > > --kcc > > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/8b0f6b2b/attachment.html>
Duncan P. N. Exon Smith
2014-Apr-17 17:58 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On 2014-Apr-17, at 10:38, Xinliang David Li <xinliangli at gmail.com> wrote:> > Another idea is to use stack local counters per function -- synced up with global counters on entry and exit. the problem with it is for deeply recursive calls, stack pressure can be too high.I think they'd need to be synced with global counters before function calls as well, since any function call can call "exit()".
Chandler Carruth
2014-Apr-17 20:06 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Having thought a bit about the best strategy to solve this, I think we should use a tradeoff of memory to reduce contention. I don't really like any of the other options as much, if we can get that one to work. Here is my specific suggestion: On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote:> - per-cpu counters (not portable, requires very modern kernel with lots of > patches) > - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index TID%N. > Solves the problem partially, better with larger values of N, but then > again it costs RAM. >I think we should combine these somewhat. At an abstract level, I think we should design the profiling to support up to N shards of counters. I think we should have a dynamic number of shards if possible. The goal here is that if you never need multiple shards (single threaded) you pay essentially zero cost. I would have a global number of shards that changes rarely, and re-compute it on entry to each function with something along the lines of: if (thread-ID != main's thread-ID && shard_count == 1) { shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES)); // if shard_count changed with this, we can also call a library routine here that does the work of allocating the actual extra shards. } MAX is a fixed cap so even on systems with 100s of cores we don't do something silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we only have a small number of threads in the program. NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we don't have the number of threads, we just use the number of cores. If we don't have the number of cores, we can just guess 8 (or something). Then, we can gracefully fall back on the following strategies to pick an index into the shards: - Per-core non-atomic counter updates (if we support them) using restartable sequences - Use a core-ID as the index into the shards to statistically minimize the contention, and do the increments atomically so it remains correct even if the core-ID load happens before a core migration and the counter increment occurs afterward - Use (thread-ID % number of cores) if we don't have support for getting a core-ID from the target OS. This will still have a reasonable distribution I suspect, even if not perfect. Finally, I wouldn't merge on shutdown if possible. I would just write larger raw profile data for multithreaded runs, and let the profdata tool merge them. If this is still too much memory, then I would suggest doing the above, but doing it independently for each function so that only those functions actually called via multithreaded code end up sharding their counters. I think this would be reasonably straightforward to implement, not significantly grow the cost of single-threaded instrumentation, and largely mitigate the contention on the counters. It can benefit from advanced hooks into the OS when those are available, but seems pretty easy to implement on roughly any OS with reasonable tradeoffs. The only really hard requirement is the ability to get a thread-id, but I think that is fairly reasonable (C++ even makes this essentially mandatory). Thoughts? -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/c6280cd5/attachment.html>
Justin Bogner
2014-Apr-17 20:27 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Chandler Carruth <chandlerc at google.com> writes:> Having thought a bit about the best strategy to solve this, I think we should > use a tradeoff of memory to reduce contention. I don't really like any of the > other options as much, if we can get that one to work. Here is my specific > suggestion: > > On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote: > > - per-cpu counters (not portable, requires very modern kernel with lots of > patches) > - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index TID%N. > Solves the problem partially, better with larger values of N, but then > again it costs RAM. > > I think we should combine these somewhat. > > At an abstract level, I think we should design the profiling to support up to > N shards of counters. > > I think we should have a dynamic number of shards if possible. The goal here > is that if you never need multiple shards (single threaded) you pay > essentially zero cost. I would have a global number of shards that changes > rarely, and re-compute it on entry to each function with something along the > lines of: > > if (thread-ID != main's thread-ID && shard_count == 1) { > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, NUMBER_OF_CORES)); > // if shard_count changed with this, we can also call a library routine here > that does the work of allocating the actual extra shards. > }Is it possible to hook on something more clever than function entry? Choosing to do this based on thread creation or something like that could make this extra check very cheap.> MAX is a fixed cap so even on systems with 100s of cores we don't do something > silly. NUBER_OF_THREADS, if supported on the OS, can limit the shards when we > only have a small number of threads in the program. NUMBER_OF_CORES, if > supported on the OS, can limit the shards. If we don't have the number of > threads, we just use the number of cores. If we don't have the number of > cores, we can just guess 8 (or something).I like the general idea of this approach. The obvious TLS approach worried me in that it explodes memory usage by much more than you'll generally need.> Then, we can gracefully fall back on the following strategies to pick an index > into the shards: > > - Per-core non-atomic counter updates (if we support them) using restartable > sequences > - Use a core-ID as the index into the shards to statistically minimize the > contention, and do the increments atomically so it remains correct even if the > core-ID load happens before a core migration and the counter increment occurs > afterward > - Use (thread-ID % number of cores) if we don't have support for getting a > core-ID from the target OS. This will still have a reasonable distribution I > suspect, even if not perfect. > > Finally, I wouldn't merge on shutdown if possible. I would just write larger > raw profile data for multithreaded runs, and let the profdata tool merge them.Agreed. Doing the merging offline is a clear win. The space overhead is short lived enough that it shouldn't be a big issue.> If this is still too much memory, then I would suggest doing the above, but > doing it independently for each function so that only those functions actually > called via multithreaded code end up sharding their counters.I'd worry that doing this per-function might be too fine of granularity, but there are certainly use cases where large chunks of the code are only exercised by one (of many) threads, so something along these lines could be worthwhile.> I think this would be reasonably straightforward to implement, not > significantly grow the cost of single-threaded instrumentation, and largely > mitigate the contention on the counters. It can benefit from advanced hooks > into the OS when those are available, but seems pretty easy to implement on > roughly any OS with reasonable tradeoffs. The only really hard requirement is > the ability to get a thread-id, but I think that is fairly reasonable (C++ > even makes this essentially mandatory). > > Thoughts?
Xinliang David Li
2014-Apr-17 20:51 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
Good thinking, but why do you think runtime selection of shard count is better than compile time selection? For single threaded apps, shard count is always 1, so why paying the penalty to check thread id each time function is entered? For multi-threaded apps, I would expect MAX to be smaller than NUM_OF_CORES to avoid excessive memory consumption, then you always end up with N =MAX. If MAX is larger than NUM_OF_CORES, for large MT apps, the # of threads tends to be larger than NUM_OF_CORES, so it also ends up with N =MAX. For rare cases, the shard count may switch between MAX and NUM_OF_CORES, but you also pay the penalty to reallocate/memcpy counter arrays each time it changes. Making N non compile time constant also makes the indexing more expensive. Of course we can ignore thread migration and do CSE on it. David On Thu, Apr 17, 2014 at 1:06 PM, Chandler Carruth <chandlerc at google.com>wrote:> Having thought a bit about the best strategy to solve this, I think we > should use a tradeoff of memory to reduce contention. I don't really like > any of the other options as much, if we can get that one to work. Here is > my specific suggestion: > > On Thu, Apr 17, 2014 at 5:21 AM, Kostya Serebryany <kcc at google.com> wrote: > >> - per-cpu counters (not portable, requires very modern kernel with lots >> of patches) >> - sharded counters: each counter represented as N counters sitting in >> different cache lines. Every thread accesses the counter with index TID%N. >> Solves the problem partially, better with larger values of N, but then >> again it costs RAM. >> > > I think we should combine these somewhat. > > At an abstract level, I think we should design the profiling to support up > to N shards of counters. > > I think we should have a dynamic number of shards if possible. The goal > here is that if you never need multiple shards (single threaded) you pay > essentially zero cost. I would have a global number of shards that changes > rarely, and re-compute it on entry to each function with something along > the lines of: > > if (thread-ID != main's thread-ID && shard_count == 1) { > shard_count = std::min(MAX, std::max(NUMBER_OF_THREADS, > NUMBER_OF_CORES)); > // if shard_count changed with this, we can also call a library routine > here that does the work of allocating the actual extra shards. > } > > MAX is a fixed cap so even on systems with 100s of cores we don't do > something silly. NUBER_OF_THREADS, if supported on the OS, can limit the > shards when we only have a small number of threads in the program. > NUMBER_OF_CORES, if supported on the OS, can limit the shards. If we don't > have the number of threads, we just use the number of cores. If we don't > have the number of cores, we can just guess 8 (or something). >> > Then, we can gracefully fall back on the following strategies to pick an > index into the shards: > > - Per-core non-atomic counter updates (if we support them) using > restartable sequences > - Use a core-ID as the index into the shards to statistically minimize the > contention, and do the increments atomically so it remains correct even if > the core-ID load happens before a core migration and the counter increment > occurs afterward > - Use (thread-ID % number of cores) if we don't have support for getting a > core-ID from the target OS. This will still have a reasonable distribution > I suspect, even if not perfect. > > > Finally, I wouldn't merge on shutdown if possible. I would just write > larger raw profile data for multithreaded runs, and let the profdata tool > merge them. > > > If this is still too much memory, then I would suggest doing the above, > but doing it independently for each function so that only those functions > actually called via multithreaded code end up sharding their counters. > > > I think this would be reasonably straightforward to implement, not > significantly grow the cost of single-threaded instrumentation, and largely > mitigate the contention on the counters. It can benefit from advanced hooks > into the OS when those are available, but seems pretty easy to implement on > roughly any OS with reasonable tradeoffs. The only really hard requirement > is the ability to get a thread-id, but I think that is fairly reasonable > (C++ even makes this essentially mandatory). > > Thoughts? > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140417/c614ecb2/attachment.html>
Joerg Sonnenberger
2014-Apr-18 01:24 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote:> - sharded counters: each counter represented as N counters sitting in > different cache lines. Every thread accesses the counter with index TID%N. > Solves the problem partially, better with larger values of N, but then > again it costs RAM.I'd strongly go with this schema with one tweak. Use the stack pointer as base value with some fudging to not just use the lowest bits. It is typically easier to get. Joerg
Kostya Serebryany
2014-Apr-18 07:25 UTC
[LLVMdev] multithreaded performance disaster with -fprofile-instr-generate (contention on profile counters)
On Fri, Apr 18, 2014 at 5:24 AM, Joerg Sonnenberger <joerg at britannica.bec.de> wrote:> On Thu, Apr 17, 2014 at 04:21:46PM +0400, Kostya Serebryany wrote: > > - sharded counters: each counter represented as N counters sitting in > > different cache lines. Every thread accesses the counter with index > TID%N. > > Solves the problem partially, better with larger values of N, but then > > again it costs RAM. > > I'd strongly go with this schema with one tweak. Use the stack pointer > as base value with some fudging to not just use the lowest bits. It is > typically easier to get. >Indeed, several middle bits of %sp may be used instead of TID%N. This would heavily depend on the pthread implementation (how it allocates stacks) though. It may be tricky to come up with the same constant scheme across all platforms.> > Joerg > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/458ecc1a/attachment.html>