I think gcc is right. It inserted a branch for n == 0 (the cbz at the top), so that's not a problem. In all other regards, this is safe: if you examine the sequence of loads and stores, it eliminated all but the first load and all but the last store. How's that unsafe? If I had to guess, the bug here is that LLVM doesn't want to hoist the load over the condition (which it is right to want to avoid) but it fails to realize that the condition is true for the first iteration. -Filip> On Sep 2, 2014, at 4:23 PM, Robin Morisset <morisset at google.com> wrote: > > The problem here is that doing this can introduce a race which is undefined at the IR level. > In the example you gave above I suspect that this is a bug in GCC. I may have missed something in the assembly, but it appears that gcc loads a copy of globalvar in w3, does stuff with w3 and then stores w3 in globalvar. This is unsound in the case where the loop is run with n = 0. > For an example, if you have a thread run foo(0,0) in a loop (so not doing anything) and another thread doing: > for (int i = 0 ; i <1000000 ; ++i) { > globalvar = i; > assert(globalvar == i); > } > the code should never assert. But if you host the load/store out of the loop (as GCC appears to do), then occasionaly there will be something like > thread 2: globalvar = i (= 42) > thread 1: w3 = globalvar (= 42) > thread 2: ++i > thread 2: globalvar = i (= 43) > thread 1: globalvar =w3 (= 42) > thread 2: assert(globalvar == i) > And it will assert (42 == 43) and crash. > > Shorter answer: > - speculatively executing stores is unsound because the value may have been modified behind your back by another thread. > - speculatively executing loads might not be observable in some specific case but may always introduce races, thus introducing undefined behaviour and breaking assumptions that other passes may rely on. > > Best regards, > Robin > > >> On Tue, Sep 2, 2014 at 3:46 PM, Balaram Makam <bmakam at codeaurora.org> wrote: >> All, >> >> If we can speculatively execute a load instruction, why isn’t it safe to hoist it out by promoting it to a scalar in LICM pass? >> >> >> There is a comment in LICM pass that if a load/store is conditional then it is not safe because it would break the LLVM concurrency model (See commit 73bfa4a). >> It has an IR test for checking this in test/Transforms/LICM/scalar-promote-memmodel.ll >> >> However, I have a sample code where GCC is able to promote the memory to scalar and hoist/sink load/store out of loop but LLVM cannot. >> Is GCC being aggressive here or LLVM missing out an opportunity? >> >> Here is my sample code: >> >> extern int globalvar; >> void foo(int n , int incr) { >> unsigned int i; >> for (i = 0 ; i < n; i += incr ) { >> if (i < n/2) >> globalvar += incr; >> } >> return; >> } >> >> GCC output: >> >> $ aarch64-linux-gnu-g++ -S -o - -O3 -ffast-math -march=armv8-a+simd test.cpp >> .arch armv8-a+fp+simd >> .file "test.cpp" >> .text >> .align 2 >> .global _Z3fooii >> .type _Z3fooii, %function >> _Z3fooii: >> .LFB0: >> .cfi_startproc >> cbz w0, .L1 >> adrp x6, globalvar >> add w5, w0, w0, lsr 31 >> ldr w3, [x6,#:lo12:globalvar] <== hoist load of globalvar >> mov w2, 0 >> asr w5, w5, 1 >> .L4: >> cmp w5, w2 >> add w2, w2, w1 >> add w4, w3, w1 >> csel w3, w4, w3, hi >> cmp w2, w0 >> bcc .L4 >> str w3, [x6,#:lo12:globalvar] <== sink store of globalvar >> .L1: >> ret >> .cfi_endproc >> .LFE0: >> .size _Z3fooii, .-_Z3fooii >> .ident "GCC: (crosstool-NG linaro-1.13.1-4.8-2014.01 - Linaro GCC 2013.11) 4.9.0" >> >> LLVM output: >> >> $ clang-aarch64-x++ -S -o - -O3 -ffast-math -fslp-vectorize test.cpp >> .text >> .file "test.cpp" >> .globl _Z3fooii >> .align 2 >> .type _Z3fooii, at function >> _Z3fooii: // @_Z3fooii >> // BB#0: // %entry >> cbz w0, .LBB0_5 >> // BB#1: // %for.body.lr.ph >> mov w8, wzr >> cmp w0, #0 // =0 >> cinc w9, w0, lt >> asr w9, w9, #1 >> adrp x10, globalvar >> .LBB0_2: // %for.body >> // =>This Inner Loop Header: Depth=1 >> cmp w8, w9 >> b.hs .LBB0_4 >> // BB#3: // %if.then >> // in Loop: Header=BB0_2 Depth=1 >> ldr w11, [x10, :lo12:globalvar] <===== load inside loop >> add w11, w11, w1 >> str w11, [x10, :lo12:globalvar] <==== store inside loop >> .LBB0_4: // %for.inc >> // in Loop: Header=BB0_2 Depth=1 >> add w8, w8, w1 >> cmp w8, w0 >> b.lo .LBB0_2 >> .LBB0_5: // %for.end >> ret >> .Ltmp1: >> .size _Z3fooii, .Ltmp1-_Z3fooii >> >> >> .ident "clang version 3.6.0 " >> >> Thanks, >> Balaram >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140902/114f2eaa/attachment.html>
Ah right, I had missed the cbz, my bad. It is indeed sound under that condition. My point was that just hoisting/sinking unconditionally the memory accesses is unsound. It is good news that gcc learnt how to do it carefully :) On Tue, Sep 2, 2014 at 4:36 PM, Filip Pizlo <fpizlo at apple.com> wrote:> I think gcc is right. > > It inserted a branch for n == 0 (the cbz at the top), so that's not a > problem. > > In all other regards, this is safe: if you examine the sequence of loads > and stores, it eliminated all but the first load and all but the last > store. How's that unsafe? > > If I had to guess, the bug here is that LLVM doesn't want to hoist the > load over the condition (which it is right to want to avoid) but it fails > to realize that the condition is true for the first iteration. > > -Filip > > On Sep 2, 2014, at 4:23 PM, Robin Morisset <morisset at google.com> wrote: > > The problem here is that doing this can introduce a race which is > undefined at the IR level. > In the example you gave above I suspect that this is a bug in GCC. I may > have missed something in the assembly, but it appears that gcc loads a copy > of globalvar in w3, does stuff with w3 and then stores w3 in globalvar. > This is unsound in the case where the loop is run with n = 0. > For an example, if you have a thread run foo(0,0) in a loop (so not doing > anything) and another thread doing: > for (int i = 0 ; i <1000000 ; ++i) { > globalvar = i; > assert(globalvar == i); > } > the code should never assert. But if you host the load/store out of the > loop (as GCC appears to do), then occasionaly there will be something like > thread 2: globalvar = i (= 42) > thread 1: w3 = globalvar (= 42) > thread 2: ++i > thread 2: globalvar = i (= 43) > thread 1: globalvar =w3 (= 42) > thread 2: assert(globalvar == i) > And it will assert (42 == 43) and crash. > > Shorter answer: > - speculatively executing stores is unsound because the value may have > been modified behind your back by another thread. > - speculatively executing loads might not be observable in some specific > case but may always introduce races, thus introducing undefined behaviour > and breaking assumptions that other passes may rely on. > > Best regards, > Robin > > > On Tue, Sep 2, 2014 at 3:46 PM, Balaram Makam <bmakam at codeaurora.org> > wrote: > >> All, >> >> If we can speculatively execute a load instruction, why isn’t it safe to >> hoist it out by promoting it to a scalar in LICM pass? >> >> >> There is a comment in LICM pass that if a load/store is conditional then >> it is not safe because it would break the LLVM concurrency model (See >> commit 73bfa4a). >> It has an IR test for checking this in >> test/Transforms/LICM/scalar-promote-memmodel.ll >> >> However, I have a sample code where GCC is able to promote the memory to >> scalar and hoist/sink load/store out of loop but LLVM cannot. >> Is GCC being aggressive here or LLVM missing out an opportunity? >> >> Here is my sample code: >> >> extern int globalvar; >> void foo(int n , int incr) { >> unsigned int i; >> for (i = 0 ; i < n; i += incr ) { >> if (i < n/2) >> globalvar += incr; >> } >> return; >> } >> >> GCC output: >> >> $ aarch64-linux-gnu-g++ -S -o - -O3 -ffast-math -march=armv8-a+simd >> test.cpp >> .arch armv8-a+fp+simd >> .file "test.cpp" >> .text >> .align 2 >> .global _Z3fooii >> .type _Z3fooii, %function >> _Z3fooii: >> .LFB0: >> .cfi_startproc >> cbz w0, .L1 >> adrp x6, globalvar >> add w5, w0, w0, lsr 31 >> ldr w3, [x6,#:lo12:globalvar] <=>> hoist load of globalvar >> mov w2, 0 >> asr w5, w5, 1 >> .L4: >> cmp w5, w2 >> add w2, w2, w1 >> add w4, w3, w1 >> csel w3, w4, w3, hi >> cmp w2, w0 >> bcc .L4 >> str w3, [x6,#:lo12:globalvar] <== sink >> store of globalvar >> .L1: >> ret >> .cfi_endproc >> .LFE0: >> .size _Z3fooii, .-_Z3fooii >> .ident "GCC: (crosstool-NG linaro-1.13.1-4.8-2014.01 - Linaro >> GCC 2013.11) 4.9.0" >> >> LLVM output: >> >> $ clang-aarch64-x++ -S -o - -O3 -ffast-math -fslp-vectorize test.cpp >> .text >> .file "test.cpp" >> .globl _Z3fooii >> .align 2 >> .type _Z3fooii, at function >> _Z3fooii: // @_Z3fooii >> // BB#0: // %entry >> cbz w0, .LBB0_5 >> // BB#1: // %for.body.lr.ph >> mov w8, wzr >> cmp w0, #0 // =0 >> cinc w9, w0, lt >> asr w9, w9, #1 >> adrp x10, globalvar >> .LBB0_2: // %for.body >> // =>This Inner Loop Header: >> Depth=1 >> cmp w8, w9 >> b.hs .LBB0_4 >> // BB#3: // %if.then >> // in Loop: Header=BB0_2 Depth=1 >> ldr w11, [x10, :lo12:globalvar] <====>> load inside loop >> add w11, w11, w1 >> str w11, [x10, :lo12:globalvar] <===>> store inside loop >> .LBB0_4: // %for.inc >> // in Loop: Header=BB0_2 Depth=1 >> add w8, w8, w1 >> cmp w8, w0 >> b.lo .LBB0_2 >> .LBB0_5: // %for.end >> ret >> .Ltmp1: >> .size _Z3fooii, .Ltmp1-_Z3fooii >> >> >> .ident "clang version 3.6.0 " >> >> Thanks, >> Balaram >> >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140902/a6bb249b/attachment.html>
Thanks for the background on the concurrent memory model. So, is it sufficient that the loop entry is guarded by condition (cbz at top) for preventing the race? The loop entry will be guarded by condition if loop has been rotated by loop rotate pass. Since LICM runs after loop rotate, we can use ScalarEvolution::isLoopEntryGuardedByCond to check if we can speculatively execute load without causing a race. Is it heavy-handed solution for this problem? I can create a bug if you would like to. Thanks, Balaram From: Robin Morisset [mailto:morisset at google.com] Sent: Tuesday, September 02, 2014 7:40 PM To: Filip Pizlo Cc: Balaram Makam; LLVM Developers Mailing List Subject: Re: [LLVMdev] LICM promoting memory to scalar Ah right, I had missed the cbz, my bad. It is indeed sound under that condition. My point was that just hoisting/sinking unconditionally the memory accesses is unsound. It is good news that gcc learnt how to do it carefully :) On Tue, Sep 2, 2014 at 4:36 PM, Filip Pizlo <fpizlo at apple.com> wrote: I think gcc is right. It inserted a branch for n == 0 (the cbz at the top), so that's not a problem. In all other regards, this is safe: if you examine the sequence of loads and stores, it eliminated all but the first load and all but the last store. How's that unsafe? If I had to guess, the bug here is that LLVM doesn't want to hoist the load over the condition (which it is right to want to avoid) but it fails to realize that the condition is true for the first iteration. -Filip On Sep 2, 2014, at 4:23 PM, Robin Morisset <morisset at google.com> wrote: The problem here is that doing this can introduce a race which is undefined at the IR level. In the example you gave above I suspect that this is a bug in GCC. I may have missed something in the assembly, but it appears that gcc loads a copy of globalvar in w3, does stuff with w3 and then stores w3 in globalvar. This is unsound in the case where the loop is run with n = 0. For an example, if you have a thread run foo(0,0) in a loop (so not doing anything) and another thread doing: for (int i = 0 ; i <1000000 ; ++i) { globalvar = i; assert(globalvar == i); } the code should never assert. But if you host the load/store out of the loop (as GCC appears to do), then occasionaly there will be something like thread 2: globalvar = i (= 42) thread 1: w3 = globalvar (= 42) thread 2: ++i thread 2: globalvar = i (= 43) thread 1: globalvar =w3 (= 42) thread 2: assert(globalvar == i) And it will assert (42 == 43) and crash. Shorter answer: - speculatively executing stores is unsound because the value may have been modified behind your back by another thread. - speculatively executing loads might not be observable in some specific case but may always introduce races, thus introducing undefined behaviour and breaking assumptions that other passes may rely on. Best regards, Robin On Tue, Sep 2, 2014 at 3:46 PM, Balaram Makam <bmakam at codeaurora.org> wrote: All, If we can speculatively execute a load instruction, why isn’t it safe to hoist it out by promoting it to a scalar in LICM pass? There is a comment in LICM pass that if a load/store is conditional then it is not safe because it would break the LLVM concurrency model (See commit 73bfa4a). It has an IR test for checking this in test/Transforms/LICM/scalar-promote-memmodel.ll However, I have a sample code where GCC is able to promote the memory to scalar and hoist/sink load/store out of loop but LLVM cannot. Is GCC being aggressive here or LLVM missing out an opportunity? Here is my sample code: extern int globalvar; void foo(int n , int incr) { unsigned int i; for (i = 0 ; i < n; i += incr ) { if (i < n/2) globalvar += incr; } return; } GCC output: $ aarch64-linux-gnu-g++ -S -o - -O3 -ffast-math -march=armv8-a+simd test.cpp .arch armv8-a+fp+simd .file "test.cpp" .text .align 2 .global _Z3fooii .type _Z3fooii, %function _Z3fooii: .LFB0: .cfi_startproc cbz w0, .L1 adrp x6, globalvar add w5, w0, w0, lsr 31 ldr w3, [x6,#:lo12:globalvar] <== hoist load of globalvar mov w2, 0 asr w5, w5, 1 .L4: cmp w5, w2 add w2, w2, w1 add w4, w3, w1 csel w3, w4, w3, hi cmp w2, w0 bcc .L4 str w3, [x6,#:lo12:globalvar] <== sink store of globalvar .L1: ret .cfi_endproc .LFE0: .size _Z3fooii, .-_Z3fooii .ident "GCC: (crosstool-NG linaro-1.13.1-4.8-2014.01 - Linaro GCC 2013.11) 4.9.0" LLVM output: $ clang-aarch64-x++ -S -o - -O3 -ffast-math -fslp-vectorize test.cpp .text .file "test.cpp" .globl _Z3fooii .align 2 .type _Z3fooii, at function _Z3fooii: // @_Z3fooii // BB#0: // %entry cbz w0, .LBB0_5 // BB#1: // %for.body.lr.ph mov w8, wzr cmp w0, #0 // =0 cinc w9, w0, lt asr w9, w9, #1 adrp x10, globalvar .LBB0_2: // %for.body // =>This Inner Loop Header: Depth=1 cmp w8, w9 b.hs .LBB0_4 // BB#3: // %if.then // in Loop: Header=BB0_2 Depth=1 ldr w11, [x10, :lo12:globalvar] <===== load inside loop add w11, w11, w1 str w11, [x10, :lo12:globalvar] <==== store inside loop .LBB0_4: // %for.inc // in Loop: Header=BB0_2 Depth=1 add w8, w8, w1 cmp w8, w0 b.lo .LBB0_2 .LBB0_5: // %for.end ret .Ltmp1: .size _Z3fooii, .Ltmp1-_Z3fooii .ident "clang version 3.6.0 " Thanks, Balaram _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev -------------- next part -------------- A non-text attachment was scrubbed... Name: winmail.dat Type: application/ms-tnef Size: 5374 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140903/319bb074/attachment.bin>