When I request a write-before-read memory barrier on x86 I would expect to get an assembly instruction that would enforce this ordering (mfence, xchg, cas), but it just turns into a nop. 1. ; ModuleID = 'test.bc' 2. target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32- i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128" 3. target triple = "i686-apple-darwin9" 4. @a = external global i32 ; <i32*> [#uses=1] 5. @b = external global i32 ; <i32*> [#uses=1] 6. 7. define i32 @_Z3foov() nounwind { 8. entry: 9. store i32 1, i32* @a, align 4 10. tail call void @llvm.memory.barrier(i1 true, i1 true, \ i1 true, i1 true, \ i1 false) 11. %0 = load i32* @b, align 4 ; <i32> [#uses=1] 12. ret i32 %0 13. } 14. 15. declare void @llvm.memory.barrier(i1, i1, i1, i1, i1) nounwind compiled with: 1. llc -mcpu=core2 -mattr=+sse2,+sse3 -f -o test.s test.bc becomes 1. .text 2. .align 4,0x90 3. .globl __Z3foov 4. __Z3foov: 5. movl L_a$non_lazy_ptr, %eax 6. movl $1, (%eax) 7. nop 8. movl L_b$non_lazy_ptr, %eax 9. movl (%eax), %eax 10. ret 11. 12. .section __IMPORT,__pointers,non_lazy_symbol_pointers 13. L_a$non_lazy_ptr: 14. .indirect_symbol _a 15. .long 0 16. L_b$non_lazy_ptr: 17. .indirect_symbol _b 18. .long 0 19. .subsections_via_symbols Is the problem related to the fact that I get i386 from uname -m? If so, how can I tell override this setting during compilation? Thanks, Luke
On Thu, 2008-09-25 at 10:28 -0400, Luke Dalessandro wrote:> When I request a write-before-read memory barrier on x86 I would expect > to get an assembly instruction that would enforce this ordering (mfence, > xchg, cas), but it just turns into a nop.In its usual configuration, an x86 family CPU implements a strong memory ordering constraint for all loads and stores, so as long as the ordering of the read and write operations is preserved no atomic operation is required between them. XCHG and CAS only become necessary when you are coordinating reads and writes across processors. MFENCE similarly. So the current behavior of LLVM is correct, but there is a valid concern hiding here: there exist programs that intentionally alter the strong ordering contract in high-performance applications for the sake of performance, and in those applications it really is necessary to do some operation that suitably serializes the memory subsystem on the processor. The LLVM team may already have a better answer for this, but my first reaction is that this is effectively a different target architecture. My second, and possibly more interesting reaction is that a) There needs to be some means (through annotation) to insist that these instructions are not removed. Perhaps some means already exists; I have not looked. b) It might be interesting to examine whether coherency behavior could be handled as an attribute of address spaces in LLVM. Offhand, this would seem to require a notion of address spaces that are exact duplicates of each other except for coherency behavior, but there might be some cleaner way to handle that. The entire LLVM address space notion intrigues me, and I just haven't had any chance to dig in to it. shap
Jonathan S. Shapiro wrote:> On Thu, 2008-09-25 at 10:28 -0400, Luke Dalessandro wrote: >> When I request a write-before-read memory barrier on x86 I would expect >> to get an assembly instruction that would enforce this ordering (mfence, >> xchg, cas), but it just turns into a nop. > > In its usual configuration, an x86 family CPU implements a strong memory > ordering constraint for all loads and stores, so as long as the ordering > of the read and write operations is preserved no atomic operation is > required between them. XCHG and CAS only become necessary when you are > coordinating reads and writes across processors. MFENCE similarly.IA32 (http://www.intel.com/products/processor/manuals/318147.pdf) always allows load bypassing. I found the problem. llvm-gcc compiles __sync_synchronize() ("a full memory barrier") as: 1. tail call void @llvm.memory.barrier(i1 true, i1 true, i1 true, i1 true, i1 false) As pointed out on IRC, that 5th parameter being false is what is generating the nop. If I go in and manually change it to true I get the mfence. Did llvm.memory.barrier always have 5 parameters? What's the purpose of the 5th? Why isn't requesting a ls barrier enough? I think this might be a change that llvm-gcc doesn't know about yet (yet == r56496). Luke> > So the current behavior of LLVM is correct, but there is a valid concern > hiding here: there exist programs that intentionally alter the strong > ordering contract in high-performance applications for the sake of > performance, and in those applications it really is necessary to do some > operation that suitably serializes the memory subsystem on the > processor. > > The LLVM team may already have a better answer for this, but my first > reaction is that this is effectively a different target architecture. My > second, and possibly more interesting reaction is that > > a) There needs to be some means (through annotation) to insist that > these instructions are not removed. Perhaps some means already > exists; I have not looked. > > b) It might be interesting to examine whether coherency behavior > could be handled as an attribute of address spaces in LLVM. > Offhand, this would seem to require a notion of address spaces that > are exact duplicates of each other except for coherency behavior, > but there might be some cleaner way to handle that. > > The entire LLVM address space notion intrigues me, and I just haven't > had any chance to dig in to it. > > > shap > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Consider the following example (A and B are global variables that initially contain 0): Processor 1: store 1, A x = load B Processor 2: store 1, B y = load A Is it possible to end up with x = 0 and y = 0? Yes! This is exactly the example in table 2.3.a of http://www.intel.com/products/processor/manuals/318147.pdf. Yet it seems impossible to use gcc memory barriers to prevent this, since they compile to nothing on x86... Ciao, Duncan.
On Thursday 25 September 2008 09:41, Jonathan S. Shapiro wrote:> In its usual configuration, an x86 family CPU implements a strong memory > ordering constraint for all loads and stores, so as long as the ordering > of the read and write operations is preserved no atomic operation is > required between them. XCHG and CAS only become necessary when you are > coordinating reads and writes across processors. MFENCE similarly.That's not quite true. If you use non-temporal stores you need a way to generate a real mfence.> So the current behavior of LLVM is correct, but there is a valid concern > hiding here: there exist programs that intentionally alter the strong > ordering contract in high-performance applications for the sake of > performance, and in those applications it really is necessary to do some > operation that suitably serializes the memory subsystem on the > processor.This is going to become more and more common on x86.> The LLVM team may already have a better answer for this, but my first > reaction is that this is effectively a different target architecture. My > second, and possibly more interesting reaction is thatNo, it's not a separate target architecture. That would be overkill.> a) There needs to be some means (through annotation) to insist that > these instructions are not removed. Perhaps some means already > exists; I have not looked.As Luke discovered, it's the argument to llvm.memory.barrier that make the difference. See X86InstrSSE.td.> b) It might be interesting to examine whether coherency behavior > could be handled as an attribute of address spaces in LLVM. > Offhand, this would seem to require a notion of address spaces that > are exact duplicates of each other except for coherency behavior, > but there might be some cleaner way to handle that.That's an interesting thought as that's exactly how WC and non-WC memory is described in the Opteron manuals. -Dave
> In its usual configuration, an x86 family CPU implements a strong memory > ordering constraint for all loads and stores, so as long as the ordering > of the read and write operations is preserved no atomic operation is > required between them. XCHG and CAS only become necessary when you are > coordinating reads and writes across processors. MFENCE similarly.So... gcc's memory barriers are of no use on a multi-processor system? These are pretty common nowadays, so that sounds very bad... Ciao, Duncan.