Kostya Serebryany
2013-Nov-20 05:01 UTC
[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall < David.Chisnall at cl.cam.ac.uk> wrote:> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com> wrote: > > > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M < > michael.m.kuperstein at intel.com> wrote: > >> What I’m trying to say is that according to my understanding of the > C++11 memory model, even in that small reproducer, the store to g and the > load from g are in fact a data race. > >> > >> (This is regardless of the fact the load is protected by a branch that > is not taken.) > > > > My understanding of the standard is quite the opposite. > > I agree. It is a potential data race, if the branch is taken then it > definitely is a race because the standard explicitly does not define an > ordering on loads and stores of non-atomic variables unless some > happens-before relationship is established by some other operation (which > does not occur in this example). In the case where the branch is not > taken, then there is no data race because in the C++ abstract machine there > is no load, whether there is one in the underlying concrete machine or not. > > I believe that this transform is valid, because there are only two > possible cases here: > > - Some other code has established a happens-before relationship between > the load and the stores, giving a well-defined ordering, however this code > is not visible in the function foo() and so the load may happen anywhere. > > - No other code establishes a happens-before relationship, and therefore > the order of the load and the store is completely undefined and as long as > the load doesn't trap it is completely safe to hoist it. > > HOWEVER, I would dispute describing this transform as an optimisation in > this case.You have a good point: evaluating such transformation on single-threaded benchmarks (e.g. SPEC) makes no sense if we care about multi-threaded code. And evaluating performance of anything like this on a multi-threaded code is hard too. And it's very easy to prepare a synthetic test case where this optimization will cause 4x slowdown (see below). But if we disable this transformation someone surely will come and complain :) Another data point: gcc 4.6.3 does this too. ==> if-bench.cc <=#include <pthread.h> #include <unistd.h> extern int foo(int); extern void bar(int); #ifndef N #define N (1UL << 30) #endif void *Thread(void *p) { for (long i = 0; i < N; i++) foo(0); return 0; } int main() { pthread_t t[3]; pthread_create(&t[0], 0, Thread, 0); pthread_create(&t[1], 0, Thread, 0); pthread_create(&t[2], 0, Thread, 0); for (long i = 0; i < N; i++) bar(i); pthread_join(t[0], 0); pthread_join(t[1], 0); pthread_join(t[2], 0); } ==> g.cc <=int g; int foo(int cond) { if (cond) return g; // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster. return 0; } void bar(int i) { g = i; } % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out --kcc> If the two threads are running on different cores, then we have likely > introduced some false sharing and have forced the two cores to exchange > some bus traffic for cache coherency, even though is is completely valid in > the C++ abstract machine model for the load of g to return a stale cache > value. This is only a real problem on strongly ordered architectures > (e.g. x86), but given the relative cost of a cache shootdown and everything > else in this test case (with the exception of the thread creation), I > wouldn't be surprised if it ended up slowing things down. Especially given > that a modern superscalar CPU will speculatively execute the load ANYWAY if > it can do so from cache, and if it can't then the performance improvement > from doing it before the branch will likely be negligible. > > For single-core, in-order, single-issue architectures, or multicore, > weakly ordered, in-order, single-issue architectures, it's probably a win... > > David > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/c953f891/attachment.html>
Kostya Serebryany
2013-Nov-20 11:04 UTC
[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
FTR: http://llvm-reviews.chandlerc.com/D2227 sent for review to fix the tsan-hostile load speculation. --kcc On Wed, Nov 20, 2013 at 9:01 AM, Kostya Serebryany <kcc at google.com> wrote:> > > > On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall < > David.Chisnall at cl.cam.ac.uk> wrote: > >> On 19 Nov 2013, at 17:58, Kostya Serebryany <kcc at google.com> wrote: >> >> > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M < >> michael.m.kuperstein at intel.com> wrote: >> >> What I’m trying to say is that according to my understanding of the >> C++11 memory model, even in that small reproducer, the store to g and the >> load from g are in fact a data race. >> >> >> >> (This is regardless of the fact the load is protected by a branch that >> is not taken.) >> > >> > My understanding of the standard is quite the opposite. >> >> I agree. It is a potential data race, if the branch is taken then it >> definitely is a race because the standard explicitly does not define an >> ordering on loads and stores of non-atomic variables unless some >> happens-before relationship is established by some other operation (which >> does not occur in this example). In the case where the branch is not >> taken, then there is no data race because in the C++ abstract machine there >> is no load, whether there is one in the underlying concrete machine or not. >> >> I believe that this transform is valid, because there are only two >> possible cases here: >> >> - Some other code has established a happens-before relationship between >> the load and the stores, giving a well-defined ordering, however this code >> is not visible in the function foo() and so the load may happen anywhere. >> >> - No other code establishes a happens-before relationship, and therefore >> the order of the load and the store is completely undefined and as long as >> the load doesn't trap it is completely safe to hoist it. >> >> HOWEVER, I would dispute describing this transform as an optimisation in >> this case. > > > You have a good point: evaluating such transformation on single-threaded > benchmarks (e.g. SPEC) makes no sense > if we care about multi-threaded code. And evaluating performance of > anything like this on a multi-threaded code is hard too. > And it's very easy to prepare a synthetic test case where this > optimization will cause 4x slowdown (see below). > But if we disable this transformation someone surely will come and > complain :) > Another data point: gcc 4.6.3 does this too. > > > ==> if-bench.cc <=> #include <pthread.h> > #include <unistd.h> > > extern int foo(int); > extern void bar(int); > > #ifndef N > #define N (1UL << 30) > #endif > > void *Thread(void *p) { > for (long i = 0; i < N; i++) > foo(0); > return 0; > } > > int main() { > pthread_t t[3]; > pthread_create(&t[0], 0, Thread, 0); > pthread_create(&t[1], 0, Thread, 0); > pthread_create(&t[2], 0, Thread, 0); > for (long i = 0; i < N; i++) > bar(i); > pthread_join(t[0], 0); > pthread_join(t[1], 0); > pthread_join(t[2], 0); > } > > ==> g.cc <=> int g; > > int foo(int cond) { > if (cond) > return g; > // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster. > return 0; > } > > void bar(int i) { g = i; } > > > % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out > > > --kcc > > > >> If the two threads are running on different cores, then we have likely >> introduced some false sharing and have forced the two cores to exchange >> some bus traffic for cache coherency, even though is is completely valid in >> the C++ abstract machine model for the load of g to return a stale cache >> value. This is only a real problem on strongly ordered architectures >> (e.g. x86), but given the relative cost of a cache shootdown and everything >> else in this test case (with the exception of the thread creation), I >> wouldn't be surprised if it ended up slowing things down. Especially given >> that a modern superscalar CPU will speculatively execute the load ANYWAY if >> it can do so from cache, and if it can't then the performance improvement >> from doing it before the branch will likely be negligible. >> >> For single-core, in-order, single-issue architectures, or multicore, >> weakly ordered, in-order, single-issue architectures, it's probably a win... >> >> David >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131120/935ed959/attachment.html>
Hal Finkel
2013-Nov-21 07:14 UTC
[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
----- Original Message -----> From: "Kostya Serebryany" <kcc at google.com> > To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > Sent: Tuesday, November 19, 2013 11:01:12 PM > Subject: Re: [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with > MemorySanitizer) > > > > > > > > > On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall < > David.Chisnall at cl.cam.ac.uk > wrote: > > > > On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com > wrote: > > > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M < > > michael.m.kuperstein at intel.com > wrote: > >> What I’m trying to say is that according to my understanding of > >> the C++11 memory model, even in that small reproducer, the store > >> to g and the load from g are in fact a data race. > >> > >> (This is regardless of the fact the load is protected by a branch > >> that is not taken.) > > > > My understanding of the standard is quite the opposite. > > I agree. It is a potential data race, if the branch is taken then it > definitely is a race because the standard explicitly does not define > an ordering on loads and stores of non-atomic variables unless some > happens-before relationship is established by some other operation > (which does not occur in this example). In the case where the branch > is not taken, then there is no data race because in the C++ abstract > machine there is no load, whether there is one in the underlying > concrete machine or not. > > I believe that this transform is valid, because there are only two > possible cases here: > > - Some other code has established a happens-before relationship > between the load and the stores, giving a well-defined ordering, > however this code is not visible in the function foo() and so the > load may happen anywhere. > > - No other code establishes a happens-before relationship, and > therefore the order of the load and the store is completely > undefined and as long as the load doesn't trap it is completely safe > to hoist it. > > HOWEVER, I would dispute describing this transform as an optimisation > in this case. > > > You have a good point: evaluating such transformation on > single-threaded benchmarks (e.g. SPEC) makes no sense > if we care about multi-threaded code. And evaluating performance of > anything like this on a multi-threaded code is hard too. > And it's very easy to prepare a synthetic test case where this > optimization will cause 4x slowdown (see below).Out of curiosity, can you explain why? Is this cache-line bouncing? Or just an increase in overall memory-bandwidth use? -Hal> But if we disable this transformation someone surely will come and > complain :) > Another data point: gcc 4.6.3 does this too. > > > > > > ==> if-bench.cc <=> #include <pthread.h> > #include <unistd.h> > > > extern int foo(int); > extern void bar(int); > > > #ifndef N > #define N (1UL << 30) > #endif > > > void *Thread(void *p) { > for (long i = 0; i < N; i++) > foo(0); > return 0; > } > > > int main() { > pthread_t t[3]; > pthread_create(&t[0], 0, Thread, 0); > pthread_create(&t[1], 0, Thread, 0); > pthread_create(&t[2], 0, Thread, 0); > for (long i = 0; i < N; i++) > bar(i); > pthread_join(t[0], 0); > pthread_join(t[1], 0); > pthread_join(t[2], 0); > } > > > ==> g.cc <=> int g; > > > int foo(int cond) { > if (cond) > return g; > // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster. > return 0; > } > > > void bar(int i) { g = i; } > > > > > > % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out > > > > > --kcc > > > > > If the two threads are running on different cores, then we have > likely introduced some false sharing and have forced the two cores > to exchange some bus traffic for cache coherency, even though is is > completely valid in the C++ abstract machine model for the load of g > to return a stale cache value. This is only a real problem on > strongly ordered architectures (e.g. x86), but given the relative > cost of a cache shootdown and everything else in this test case > (with the exception of the thread creation), I wouldn't be surprised > if it ended up slowing things down. Especially given that a modern > superscalar CPU will speculatively execute the load ANYWAY if it can > do so from cache, and if it can't then the performance improvement > from doing it before the branch will likely be negligible. > > For single-core, in-order, single-issue architectures, or multicore, > weakly ordered, in-order, single-issue architectures, it's probably > a win... > > David > > > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
Kostya Serebryany
2013-Nov-21 07:38 UTC
[LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
On Thu, Nov 21, 2013 at 11:14 AM, Hal Finkel <hfinkel at anl.gov> wrote:> ----- Original Message ----- > > From: "Kostya Serebryany" <kcc at google.com> > > To: "David Chisnall" <David.Chisnall at cl.cam.ac.uk> > > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu> > > Sent: Tuesday, November 19, 2013 11:01:12 PM > > Subject: Re: [LLVMdev] Curiosity about transform changes under > Sanitizers (Was: [PATCH] Disable branch folding with > > MemorySanitizer) > > > > > > > > > > > > > > > > > > On Tue, Nov 19, 2013 at 10:20 PM, David Chisnall < > > David.Chisnall at cl.cam.ac.uk > wrote: > > > > > > > > On 19 Nov 2013, at 17:58, Kostya Serebryany < kcc at google.com > wrote: > > > > > On Tue, Nov 19, 2013 at 9:56 PM, Kuperstein, Michael M < > > > michael.m.kuperstein at intel.com > wrote: > > >> What I’m trying to say is that according to my understanding of > > >> the C++11 memory model, even in that small reproducer, the store > > >> to g and the load from g are in fact a data race. > > >> > > >> (This is regardless of the fact the load is protected by a branch > > >> that is not taken.) > > > > > > My understanding of the standard is quite the opposite. > > > > I agree. It is a potential data race, if the branch is taken then it > > definitely is a race because the standard explicitly does not define > > an ordering on loads and stores of non-atomic variables unless some > > happens-before relationship is established by some other operation > > (which does not occur in this example). In the case where the branch > > is not taken, then there is no data race because in the C++ abstract > > machine there is no load, whether there is one in the underlying > > concrete machine or not. > > > > I believe that this transform is valid, because there are only two > > possible cases here: > > > > - Some other code has established a happens-before relationship > > between the load and the stores, giving a well-defined ordering, > > however this code is not visible in the function foo() and so the > > load may happen anywhere. > > > > - No other code establishes a happens-before relationship, and > > therefore the order of the load and the store is completely > > undefined and as long as the load doesn't trap it is completely safe > > to hoist it. > > > > HOWEVER, I would dispute describing this transform as an optimisation > > in this case. > > > > > > You have a good point: evaluating such transformation on > > single-threaded benchmarks (e.g. SPEC) makes no sense > > if we care about multi-threaded code. And evaluating performance of > > anything like this on a multi-threaded code is hard too. > > And it's very easy to prepare a synthetic test case where this > > optimization will cause 4x slowdown (see below). > > Out of curiosity, can you explain why? Is this cache-line bouncing? Or > just an increase in overall memory-bandwidth use? >Exactly, this is cache-line bouncing> > -Hal > > > But if we disable this transformation someone surely will come and > > complain :) > > Another data point: gcc 4.6.3 does this too. > > > > > > > > > > > > ==> if-bench.cc <=> > #include <pthread.h> > > #include <unistd.h> > > > > > > extern int foo(int); > > extern void bar(int); > > > > > > #ifndef N > > #define N (1UL << 30) > > #endif > > > > > > void *Thread(void *p) { > > for (long i = 0; i < N; i++) > > foo(0); > > return 0; > > } > > > > > > int main() { > > pthread_t t[3]; > > pthread_create(&t[0], 0, Thread, 0); > > pthread_create(&t[1], 0, Thread, 0); > > pthread_create(&t[2], 0, Thread, 0); > > for (long i = 0; i < N; i++) > > bar(i); > > pthread_join(t[0], 0); > > pthread_join(t[1], 0); > > pthread_join(t[2], 0); > > } > > > > > > ==> g.cc <=> > int g; > > > > > > int foo(int cond) { > > if (cond) > > return g; > > // If we replace 'g' with g*g*g*g, the benchmark becomes 4x faster. > > return 0; > > } > > > > > > void bar(int i) { g = i; } > > > > > > > > > > > > % clang if-bench.cc g.cc -lpthread -O1 && time ./a.out > > > > > > > > > > --kcc > > > > > > > > > > If the two threads are running on different cores, then we have > > likely introduced some false sharing and have forced the two cores > > to exchange some bus traffic for cache coherency, even though is is > > completely valid in the C++ abstract machine model for the load of g > > to return a stale cache value. This is only a real problem on > > strongly ordered architectures (e.g. x86), but given the relative > > cost of a cache shootdown and everything else in this test case > > (with the exception of the thread creation), I wouldn't be surprised > > if it ended up slowing things down. Especially given that a modern > > superscalar CPU will speculatively execute the load ANYWAY if it can > > do so from cache, and if it can't then the performance improvement > > from doing it before the branch will likely be negligible. > > > > For single-core, in-order, single-issue architectures, or multicore, > > weakly ordered, in-order, single-issue architectures, it's probably > > a win... > > > > David > > > > > > > > _______________________________________________ > > LLVM Developers mailing list > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131121/4f70b5b3/attachment.html>
Apparently Analagous Threads
- [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
- [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
- [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
- [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)
- [LLVMdev] Curiosity about transform changes under Sanitizers (Was: [PATCH] Disable branch folding with MemorySanitizer)