Replied too early... Below: On Tue, May 10, 2016 at 2:04 PM, JF Bastien <jfb at google.com> wrote:> Thanks for the writeup, that is indeed pretty ugly. Simple > asm(:::"memory") isn't sufficient either, since the regalloc can decode to > spill :-( > > On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev < > llvm-dev at lists.llvm.org> wrote: >> >> Unfortunately, neither ARM nor PPC appear to precisely document the >> architectural constraints under which forward progress must be guaranteed >> by the implementation. They certainly have the same underlying >> implementation issues that give rise to the above rules -- that much seems >> documented -- they just don't appear to make explicit guarantees on how you >> can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 >> bytes, though. >> > > For ARMv7 from the ARM ARM: > > A Load-Exclusive instruction tags a small block of memory for exclusive > access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging > and the size of the tagged memory block on page A3-121. A Store-Exclusive > instruction to the same address clears the tag. > > And: > > The value of a in this assignment is IMPLEMENTATION DEFINED, between a > minimum value of 3 and a maximum value of 11. For example, in an > implementation where a is 4, a successful LDREX of address 0x000341B4 gives > a tag value of bits[31:4] of the address, giving 0x000341B. This means that > the four words of memory from 0x000341B0 to 0x000341BF are tagged for > exclusive access. > The size of the tagged memory block is called the Exclusives Reservation > Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in > the range 2-512 words: > • 2 words in an implementation where a is 3 > > • 512 words in an implementation where a is 11 > >There's a bit more info here: When a processor writes using any instruction other than a Store-Exclusive: • if the write is to a physical address that is not covered by its local monitor the write does not affect the state of the local monitor • if the write is to a physical address that is covered by its local monitor it is IMPLEMENTATION DEFINED whether the write affects the state of the local monitor. If the local monitor is in the Exclusive Access state and the processor performs a Store-Exclusive to any address other than the last one from which it performed a Load-Exclusive, it is IMPLEMENTATION DEFINED whether the store updates memory, but in all cases the local monitor is reset to the Open Access state. This mechanism: • is used on a context switch, see Context switch support on page A3-122 • must be treated as a software programming error in all other cases And around similar parts of the manual. You can search the web for these, they're all "superseded" versions of the docs and I can't find the canonical one! -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/91ba2cfd/attachment.html>
Yeah, I found that info in the ARM ARM, however, it provides no information about forward progress guarantees. There's no subset of instructions, maximal length of loop, or other such info about what sorts of LL/SC loops can be guaranteed not to live-lock. You can infer from erratum <http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/254392.html> and FAQs <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8404.html> that at least some of the ARM implementations likely do have some kind of cache control delay within which your loop should fit, but that's about all I can find. The lack of documented rules is perhaps only a theoretical concern, as if you follow all of the rules collected from other architectures, it's a pretty sure bet you'll be as safe as possible on ARM and PPC too. The only question is if it's safe to be more lax, and e.g. allowing different branch sequences, or not. (And of course that'd only be interesting if there were an actual performance advantage to doing so). But, it is certainly unfortunate that there appears to be no actual specification which a user can rely on, as MIPS has. BTW, on another note, the "no taken branches" restriction from Alpha is explained like this: "Branch instructions between the LDx_L/STx_C pair may be mispredicted, introducing load and store instructions that evict the locked cache block. To prevent that from happening, there is a bit in the instruction fetcher that is set for a LDx_L reference and cleared on any other memory reference. When this bit is set, the branch predictor predicts all branches to fall through." That seems like an interesting detail. I now wonder if other architectures have a similar issue with a mis-predicted branch in the LL/SC loop, and how they deal with it.> On May 10, 2016, at 5:08 PM, JF Bastien <jfb at google.com> wrote: > > Replied too early... Below: > > On Tue, May 10, 2016 at 2:04 PM, JF Bastien <jfb at google.com <mailto:jfb at google.com>> wrote: > Thanks for the writeup, that is indeed pretty ugly. Simple asm(:::"memory") isn't sufficient either, since the regalloc can decode to spill :-( > > On Tue, May 10, 2016 at 12:22 PM, James Knight via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote: > Unfortunately, neither ARM nor PPC appear to precisely document the architectural constraints under which forward progress must be guaranteed by the implementation. They certainly have the same underlying implementation issues that give rise to the above rules -- that much seems documented -- they just don't appear to make explicit guarantees on how you can guarantee success. ARM does "recommend" that LL/SC loops fit within 128 bytes, though. > > For ARMv7 from the ARM ARM: > A Load-Exclusive instruction tags a small block of memory for exclusive access. The size of the tagged block is IMPLEMENTATION DEFINED, see Tagging and the size of the tagged memory block on page A3-121. A Store-Exclusive instruction to the same address clears the tag. > And: > The value of a in this assignment is IMPLEMENTATION DEFINED, between a minimum value of 3 and a maximum value of 11. For example, in an implementation where a is 4, a successful LDREX of address 0x000341B4 gives a tag value of bits[31:4] of the address, giving 0x000341B. This means that the four words of memory from 0x000341B0 to 0x000341BF are tagged for exclusive access. > The size of the tagged memory block is called the Exclusives Reservation Granule. The Exclusives Reservation Granule is IMPLEMENTATION DEFINED in the range 2-512 words: > • 2 words in an implementation where a is 3 > • 512 words in an implementation where a is 11 > > > There's a bit more info here: > When a processor writes using any instruction other than a Store-Exclusive: > • if the write is to a physical address that is not covered by its local monitor the write does not affect the state of the local monitor > • if the write is to a physical address that is covered by its local monitor it is IMPLEMENTATION DEFINED whether the write affects the state of the local monitor. > > If the local monitor is in the Exclusive Access state and the processor performs a Store-Exclusive to any address other than the last one from which it performed a Load-Exclusive, it is IMPLEMENTATION DEFINED whether the store updates memory, but in all cases the local monitor is reset to the Open Access state. This mechanism: > • is used on a context switch, see Context switch support on page A3-122 > • must be treated as a software programming error in all other cases > > And around similar parts of the manual. You can search the web for these, they're all "superseded" versions of the docs and I can't find the canonical one!-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/8331e817/attachment.html>
On Tue, May 10, 2016 at 3:49 PM, James Knight <jyknight at google.com> wrote:> Yeah, I found that info in the ARM ARM, however, it provides no > information about forward progress guarantees. There's no subset of > instructions, maximal length of loop, or other such info about what sorts > of LL/SC loops can be guaranteed not to live-lock. You can infer from > erratum > <http://lists.infradead.org/pipermail/linux-arm-kernel/2014-May/254392.html> > and FAQs > <http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka8404.html> that > at least some of the ARM implementations likely do have some kind of cache > control delay within which your loop should fit, but that's about all I can > find. > > The lack of documented rules is perhaps only a theoretical concern, as if > you follow all of the rules collected from other architectures, it's a > pretty sure bet you'll be as safe as possible on ARM and PPC too. The only > question is if it's safe to be more lax, and e.g. allowing different branch > sequences, or not. (And of course that'd only be interesting if there were > an actual performance advantage to doing so). > > But, it is certainly unfortunate that there appears to be no actual > specification which a user can rely on, as MIPS has. >True, we should probably get that clarified. BTW, on another note, the "no taken branches" restriction from Alpha is> explained like this: > "Branch instructions between the LDx_L/STx_C pair may be mispredicted, > introducing load and store instructions that evict the locked cache block. > To prevent that from happening, there is a bit in the instruction fetcher > that is set for a LDx_L reference and cleared on any other memory > reference. When this bit is set, the branch predictor predicts all branches > to fall through." > > That seems like an interesting detail. I now wonder if other architectures > have a similar issue with a mis-predicted branch in the LL/SC loop, and how > they deal with it. >I'm not sure we care about Alpha, the memory model sure doesn't ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160510/c936e468/attachment.html>