Paul Peet via llvm-dev
2016-Dec-29  23:20 UTC
[llvm-dev] [compiler-rt] Improve atomic locking?
Hey, I am wondering if there wouldn't be more room for improving the locking of a pointer when an atomic operation is being made since I've noticed that one could increase the SPINLOCK_COUNT in lib/builtins/atomic.c to (1 << 13) which is a 8x increase of available locks if we also change the type of the atomic lock which currently is uintptr_t to a single byte (uint8_t) which I think is more appropriate since the lock can only have two states for other systems than freebsd, macos. Am I missing something? This seems like choosing uintptr_t was intentional. ~ Paul
David Chisnall via llvm-dev
2016-Dec-30  10:01 UTC
[llvm-dev] [compiler-rt] Improve atomic locking?
On 30 Dec 2016, at 00:20, Paul Peet via llvm-dev <llvm-dev at lists.llvm.org> wrote:> > Hey, > > I am wondering if there wouldn't be more room for improving the > locking of a pointer when an atomic operation is being made since I've > noticed that one could increase the SPINLOCK_COUNT in > lib/builtins/atomic.c to (1 << 13) which is a 8x increase of available > locks if we also change the type of the atomic lock which currently is > uintptr_t to a single byte (uint8_t) which I think is more appropriate > since the lock can only have two states for other systems than > freebsd, macos. > > Am I missing something? This seems like choosing uintptr_t was intentional.It’s mostly about false sharing and depends a bit on how the atomics are implemented for any given architecture and microarchitecture. In many implementations, the atomicity is guaranteed by the cache coherency protocol, so the granularity is one cache line. With 64 byte cache lines and an 8-byte lock, you have 8 locks per cache line, so with 2^10 locks there’s a 1/128 chance of false sharing causing contention on the locks even when two addresses map to different locks. With int8_t on the same system you now have 64 locks per cache line, so a 1/16 chance of false sharing. Additionally, on many architectures there are no atomic instructions for atomics that work on quantities smaller than a register (or a 32-bit value on 64-bit systems). For example, on MIPS you have 32- and 64-bit load-linked and store-conditional. If you want to do an 8-bit compare and swap, then you will encounter contention with other operations on the same 32-bit value, even on a microarchitecture that supports word-granularity atomics (using something finer grained than the cache coherency protocol). Even on systems where the ISA supports 8-bit atomics, they’re often microcoded as shift and mask sequences. Note: On Linux, we should be using futexes, but no one who uses Linux appears to care enough to implement this code path. Do you have an example in which lock contention here is a bottleneck? I’d be interested in seeing it. If you do encounter contention here, then you’d likely see more of a speedup simply bumping SPINLOCK_COUNT without changing the underlying type. There’s no reason on most systems not to allocate 64KB for the spinlock array: memory is cheap, most of it won’t be in the cache, and a more sparse structure will reduce contention. David
Reasonably Related Threads
- [LLVMdev] [PATCH][Compiler-rt] Windows implementation of mmap functionality in clear_cache_test and enable_execute_stack_test
- [LLVMdev] [llvm-commits] [PATCH][Compiler-rt] Windows implementation of mmap functionality in clear_cache_test and enable_execute_stack_test
- Handling of __c11_atomic_is_lock_free({1, 2, 4, 8}) in compiler-rt atomic.c
- [LLVMdev] Clang-Format all LLVM Projects
- [PATCH -next] vdpasim: remove unused variable 'ret'