thr3ads.net - llvm dev - [llvm-dev] RFC: Element-atomic memory intrinsics [May 2017]

If this information is useful, please help other people find it:
Share via:

Daniel Neilson via llvm-dev

2017-May-08 14:54 UTC

[llvm-dev] RFC: Element-atomic memory intrinsics

Greetings all,
 I am picking up the work that was started in https://reviews.llvm.org/D27133 —
adding support for an element-atomic memcpy/memset/memmove to LLVM. I would
appreciate suggestions/thoughts/advice/comments on how to best proceed with this
work in a way that will be acceptable to the LLVM group.

 I apologize in advance; this is going to be a long one...

**Background**

 Loads/stores of shared data in Java must be done in an atomic-like manner — it
must essentially look like the entire variable is loaded/stored in a single
operation. So, no doing stuff like splitting the load of a uint32 into four
loads of uint8’s, or two loads of uint16’s; such a splitting could result in a
data-race where each of the split loads can load a different version of the
variable’s value (due to the variable being stored-to in other threads of
execution). In LLVM, the atomic-like behaviour is achieved by emitting an atomic
load/store with the ‘unordered’ ordering constraint.

 In LLVM there is a loop idiom recognition pass that will convert loops that
look like memcpy/memmove/memset into a direct call to the corresponding
intrinsic — which allows the rest of the LLVM passes to reason about the
operation better, and for codegen to materialize probably-better optimized
versions of the loop (ex: one that uses wider loads/stores).  But, as one might
expect, this idiom recognition doesn’t touch atomic operations — i.e. if a loop
that’s doing a memcpy is doing so with unordered loads & stores then loop
idiom should not convert that into a memcpy. Reason being that
memcpy/memset/memmove implementations are allowed to operate on the byte-level
when doing the op, and that would break our requirement that our data be
loaded/stored in an atomic-like manner if we’re, say, copying arrays of
uint32’s.

**Goal**

 We would like for loop idiom to be able to convert unordered load/store loops
into memory intrinsics, but doing so requires that we have some way of
expressing that the resulting memory intrinsic be *unordered* and what the
element-size is. We would also like all/most of the existing LLVM passes that
understand the semantics of the memory intrinsics to understand the semantics of
these ‘unordered’ versions.

**Method**

 Clearly we are going to have to teach LLVM about unordered memory intrinsics.
There are, as I can see it, a couple of ways to accomplish this. I’d like your
opinions on which are preferable, or if you can think of any other options. In
no particular order…

Option 1)
 Introduce a new unordered/element-atomic version of each of the memory
intrinsics.
  Ex: @llvm.memcpy_element_atomic — work was already started to introduce this
one in D27133, but could be backed out and restarted.
  Intrinsic prototype: @llvm.memcpy_element_atomic.<overload
desc>(<ty>* dest, <ty>* src, <ty> len, i32 align, i2
isunordered, i16 element_size)
    Semantics:
       * Will do a memcpy of len bytes from src to dest.
       * len must = k * lcm( #bytes in dest type, #bytes in src type), for some
non-negative integer k [note: lcm = least-common multiple]
       * load/store size given by the constant power-of-2 parameter
“element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
       * isunordered param: bit 0 == 1 => stores to dest must be marked
‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'
       * expanded memcpy code has this sort of form (ignoring alignment for
simplicity):
                template<typename len_ty, uint16_t element_size>
                void memcpy_element_atomic(char *dest, char *src, len_ty len) {
                    copy_ty = <int type with sizeof(type) ==
element_size>;
                    len_ty num_iters = len / sizeof(copy_ty);
                    for (len_ty i=0; i < num_iters; i++)
                      *(copy_ty*)(dest + i*sizeof(copy_ty)) = *(copy_ty*)(src +
i*sizeof(copy_ty));
                    // Note: restriction on len means we don’t need a residual
loop.
                }

   We would inject the new intrinsic into the MemIntrinsic introspection
hierarchy (IR/InstrinsicInst.h).

     Pros:
       * New intrinsic doesn’t interfere with existing memory intrinsics code.
       * Adding new intrinsic to MemIntrinsic hierarchy means that many LLVM
passes will handle the new intrinsic “for free” (though, not *really* free; some
work would need to be done)

     Cons:
       * Passes that check intrinsic ID against Intrinsics::memcpy, etc would
need to be updated to handle the new intrinsic ID.
       * When new code is introduced by others to exploit/handle memcpy, then
they may not remember to handle the memcpy_element_atomic version.

Option 2)
  Expand the current/existing memory intrinsics to identify the unordered
constraint, if one exists, in much the same way that volatility is expressed —
i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
  This option has the same semantics as option 1; the only difference is,
literally, that we expand the existing memcpy/memset/memmove intrinsics to have
an ‘isunordered’ parameter and an ‘element_size’ parameter, so the prototype
becomes something like:
   @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src,
<ty> len, i32 align, i1 isvolatile, i2 isunordered, i16 element_size)

 Pros:
   * Minimal extra work to handle the new version in existing passes — only need
to change passes that create calls to memory intrinsics, expand memory
intrinsics, or that need to care about unordered (which none should that are
reasoning about memory intrinsic semantics).
   * New code that’s introduced by others to exploit/handle memory intrinsics
should just handle unordered for free — unordered being a part of the memory
intrinsic means it’s more likely that the person will realize that they have to
think about it (i.e. it raises the profile of unordered memory intrinsics).

 Cons:
   * Breaks backward compatibility of the IR — is there a mechanism for
migrating old IR into a new IR version when loading the IR into LLVM?
   * Memory intrinsic prototypes changing could break some 3rd party consumers
of LLVM-IR until they update their parsers.
   * Some risk that existing passes (LLVM or 3rd party) don’t handle the new
attribute of memory intrinsic properly — introducing bugs. Though, I suspect
that this risk is small/tiny as a functional issue would only materialize if
they’re materializing unordered-atomic load/stores (which only Java does, I
think).

**Note on generalizing to other orderings**

 We had some internal discussion on generalizing this to the other LLVM atomic
orderings: monotonic, acquire, release, acquire_release, and
sequentially_consistent.

 We don’t think that it’s worth generalizing the memory intrinsics to general
memory atomic-ordering constraints. The only benefit that we can see is that
doing so, and teaching loop idiom about them, would allow passes to reason about
memcpy/memset/memmove semantics via the intrinsic instead of inferring info from
the loop behaviour — arguably, a big benefit. However, there are complications
in doing so:

 i) The atomic ordering constraints are memory barriers — this would mean that
all passes that have to know about atomic-load/store vs load/store would also
have to be taught about atomic memory intrinsics. i.e. The memory intrinsic
itself would become a memory barrier, the type of which would be a union of the
src & dest atomic-orderings.

 ii) Rematerialzing the loop from the memory intrinsic will be tricky and have a
lot of special cases. The rematerialized loop would have to exactly match the
loop that the memory intrinsic was created from due to the requirements of the
memory ordering barriers. We would probably have to have a separate
atomic-ordering for the src & dest parameters of the memory intrinsic to be
able to faithfully reproduce the original loop semantics.

 iii) The loop would *always* have to be rematerialized from the intrinsic in
codegen — if the intrinsic ever becomes a lib call, then there’s a bug. No
different from the unordered case that we want to handle, really, but it’s worth
pointing out since this bug would show as a data race and would be a beast to
debug.

 iv) I doubt that any real-world code will *ever* exist that would become one of
these atomic-ordered memory intrinsics, so it’s probably not worth the trouble.


Thanks for reading! Thoughts, questions, and comments are welcome.

-Daniel

---
Daniel Neilson, Ph.D.
Azul Systems



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170508/4f6708eb/attachment.html>

Sanjoy Das via llvm-dev

2017-May-08 17:49 UTC

head link

[llvm-dev] RFC: Element-atomic memory intrinsics

Hi Daniel,

[+CC Mehdi, Vedant for the auto upgrade issue]

On Mon, May 8, 2017 at 7:54 AM, Daniel Neilson via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> **Method**
>
>  Clearly we are going to have to teach LLVM about unordered memory
> intrinsics. There are, as I can see it, a couple of ways to accomplish
this.
> I’d like your opinions on which are preferable, or if you can think of any
> other options. In no particular order…
>
> Option 1)
>  Introduce a new unordered/element-atomic version of each of the memory
> intrinsics.
>   Ex: @llvm.memcpy_element_atomic — work was already started to introduce
> this one in D27133, but could be backed out and restarted.
>   Intrinsic prototype: @llvm.memcpy_element_atomic.<overload
desc>(<ty>*
> dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16
element_size)
>     Semantics:
>        * Will do a memcpy of len bytes from src to dest.
>        * len must = k * lcm( #bytes in dest type, #bytes in src type), for
> some non-negative integer k [note: lcm = least-common multiple]
>        * load/store size given by the constant power-of-2 parameter
> “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.

LLVM is moving towards "typeless pointers" (i.e. pointers will not
have pointee types, instead they will just be a "generic pointer" in
some address space), so working in the types of dest and src into the
specification seems awkward.

Also, does the non-overlap restriction of src and dest (as in the
regular llvm.memcpy) apply here as well?
>        * isunordered param: bit 0 == 1 => stores to dest must be marked
> ‘unordered’; bit 1 == 1 => loads from src must be marked ‘unordered'
What if the bits are zero -- will the stores / loads (depending on
which bit) be "ordered" in that case, or something stronger?
> Option 2)
>   Expand the current/existing memory intrinsics to identify the unordered
> constraint, if one exists, in much the same way that volatility is
expressed
> — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
>   This option has the same semantics as option 1; the only difference is,
> literally, that we expand the existing memcpy/memset/memmove intrinsics to
> have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
> prototype becomes something like:
>    @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src,
<ty> len, i32 align,
> i1 isvolatile, i2 isunordered, i16 element_size)
>
>  Pros:
>    * Minimal extra work to handle the new version in existing passes — only
> need to change passes that create calls to memory intrinsics, expand memory
> intrinsics, or that need to care about unordered (which none should that
are
> reasoning about memory intrinsic semantics).
>    * New code that’s introduced by others to exploit/handle memory
> intrinsics should just handle unordered for free — unordered being a part
of
> the memory intrinsic means it’s more likely that the person will realize
> that they have to think about it (i.e. it raises the profile of unordered
> memory intrinsics).
I like the second point, but (unfortunately) I suspect in practice
you'll see new code do:

  if (MCI->isOrdered())
    return false;  // be conservative
>  Cons:
>    * Breaks backward compatibility of the IR — is there a mechanism for
> migrating old IR into a new IR version when loading the IR into LLVM?
I think the migration here will be fairly straightforward -- you can
just auto-upgrade old calls to memcpy to pass in 0 for the isordered
argument.  But I've CC'd Mehdi and Vedant to help shed some light on
this.

-- Sanjoy

Vedant Kumar via llvm-dev

2017-May-08 18:22 UTC

head link

[llvm-dev] RFC: Element-atomic memory intrinsics

> On May 8, 2017, at 10:49 AM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi Daniel,
> 
> [+CC Mehdi, Vedant for the auto upgrade issue]
> 
> On Mon, May 8, 2017 at 7:54 AM, Daniel Neilson via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> **Method**
>> 
>> Clearly we are going to have to teach LLVM about unordered memory
>> intrinsics. There are, as I can see it, a couple of ways to accomplish
this.
>> I’d like your opinions on which are preferable, or if you can think of
any
>> other options. In no particular order…
>> 
>> Option 1)
>> Introduce a new unordered/element-atomic version of each of the memory
>> intrinsics.
>> Ex: @llvm.memcpy_element_atomic — work was already started to introduce
>> this one in D27133, but could be backed out and restarted.
I'm curious about this -- do you know why the decision was made in D27133 to
go with a new intrinsic, instead of extending the existing one? Would the same
rationale apply here?
>> Intrinsic prototype: @llvm.memcpy_element_atomic.<overload
desc>(<ty>*
>> dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16
element_size)
>>   Semantics:
>>      * Will do a memcpy of len bytes from src to dest.
>>      * len must = k * lcm( #bytes in dest type, #bytes in src type),
for
>> some non-negative integer k [note: lcm = least-common multiple]
>>      * load/store size given by the constant power-of-2 parameter
>> “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
> 
> I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.
> 
> LLVM is moving towards "typeless pointers" (i.e. pointers will
not
> have pointee types, instead they will just be a "generic pointer"
in
> some address space), so working in the types of dest and src into the
> specification seems awkward.
> 
> Also, does the non-overlap restriction of src and dest (as in the
> regular llvm.memcpy) apply here as well?
> 
>>      * isunordered param: bit 0 == 1 => stores to dest must be
marked
>> ‘unordered’; bit 1 == 1 => loads from src must be marked
‘unordered'
> 
> What if the bits are zero -- will the stores / loads (depending on
> which bit) be "ordered" in that case, or something stronger?
> 
>> Option 2)
>> Expand the current/existing memory intrinsics to identify the unordered
>> constraint, if one exists, in much the same way that volatility is
expressed
>> — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
>> This option has the same semantics as option 1; the only difference is,
>> literally, that we expand the existing memcpy/memset/memmove intrinsics
to
>> have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
>> prototype becomes something like:
>>  @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src,
<ty> len, i32 align,
>> i1 isvolatile, i2 isunordered, i16 element_size)
>> 
>> Pros:
>>  * Minimal extra work to handle the new version in existing passes —
only
>> need to change passes that create calls to memory intrinsics, expand
memory
>> intrinsics, or that need to care about unordered (which none should
that are
>> reasoning about memory intrinsic semantics).
>>  * New code that’s introduced by others to exploit/handle memory
>> intrinsics should just handle unordered for free — unordered being a
part of
>> the memory intrinsic means it’s more likely that the person will
realize
>> that they have to think about it (i.e. it raises the profile of
unordered
>> memory intrinsics).
> 
> I like the second point, but (unfortunately) I suspect in practice
> you'll see new code do:
> 
> if (MCI->isOrdered())
>   return false;  // be conservative
> 
>> Cons:
>>  * Breaks backward compatibility of the IR — is there a mechanism for
>> migrating old IR into a new IR version when loading the IR into LLVM?
> 
> I think the migration here will be fairly straightforward -- you can
> just auto-upgrade old calls to memcpy to pass in 0 for the isordered
> argument.  But I've CC'd Mehdi and Vedant to help shed some light
on
> this.
LLVM has one test which for backwards-compatibility with old versions of the
memcpy intrinsic, which provides limited coverage (standardCIntrinsic.3.2.ll).

Whichever option you choose, it would be helpful to add uses of the new
intrinsic to test/Bitcode/compatibility.ll. If you choose Option 2, it would
also help to copy these uses into the 4.0 bitcode compatibility test (with
"is_unordered" dropped), and to re-generate the bitcode for the test.
I can help out with this if you'd like.

vedant
> 
> -- Sanjoy

Daniel Neilson via llvm-dev

2017-May-08 19:08 UTC

head link

[llvm-dev] RFC: Element-atomic memory intrinsics

Hi Sanjoy,
 Responses inlined…
> On May 8, 2017, at 12:49 PM, Sanjoy Das <sanjoy at
playingwithpointers.com> wrote:
> 
> Hi Daniel,
> 
> [+CC Mehdi, Vedant for the auto upgrade issue]
> 
> On Mon, May 8, 2017 at 7:54 AM, Daniel Neilson via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> **Method**
>> 
>> Clearly we are going to have to teach LLVM about unordered memory
>> intrinsics. There are, as I can see it, a couple of ways to accomplish
this.
>> I’d like your opinions on which are preferable, or if you can think of
any
>> other options. In no particular order…
>> 
>> Option 1)
>> Introduce a new unordered/element-atomic version of each of the memory
>> intrinsics.
>>  Ex: @llvm.memcpy_element_atomic — work was already started to
introduce
>> this one in D27133, but could be backed out and restarted.
>>  Intrinsic prototype: @llvm.memcpy_element_atomic.<overload
desc>(<ty>*
>> dest, <ty>* src, <ty> len, i32 align, i2 isunordered, i16
element_size)
>>    Semantics:
>>       * Will do a memcpy of len bytes from src to dest.
>>       * len must = k * lcm( #bytes in dest type, #bytes in src type),
for
>> some non-negative integer k [note: lcm = least-common multiple]
>>       * load/store size given by the constant power-of-2 parameter
>> “element_size”; expected to be the lcm(sizeof(dest_ty), sizeof(src_ty))
> 
> I'm not sure if sizeof(dest_ty) and sizeof(src_ty) adds anything here.
> 
> LLVM is moving towards "typeless pointers" (i.e. pointers will
not
> have pointee types, instead they will just be a "generic pointer"
in
> some address space), so working in the types of dest and src into the
> specification seems awkward.
> 
 Poor choice of wording on my part. By sizeof(<thing>) I mean the
element-size of the load/store that appeared in the original loop from which the
memory intrinsic was materialized — arguably, these will be the same in any real
cases so it probably doesn’t make any sense to even mention it...
> Also, does the non-overlap restriction of src and dest (as in the
> regular llvm.memcpy) apply here as well?
> 
 Yes, I would think so.
>>       * isunordered param: bit 0 == 1 => stores to dest must be
marked
>> ‘unordered’; bit 1 == 1 => loads from src must be marked
‘unordered'
> 
> What if the bits are zero -- will the stores / loads (depending on
> which bit) be "ordered" in that case, or something stronger?
> 
 This is partly why I prefer option 2. An ‘isunordered’ value of 0 is nonsense
for the standalone atomic-unordered memory intrinsic. It would imply that
neither the source nor dest needs to be loaded/stored via unordered-atomic ops,
and so the memory intrinsic is identical to the ordinary non-atomic one.
>> Option 2)
>>  Expand the current/existing memory intrinsics to identify the
unordered
>> constraint, if one exists, in much the same way that volatility is
expressed
>> — i.e. add an ‘isunordered’ parameter(s) to the intrinsic.
>>  This option has the same semantics as option 1; the only difference
is,
>> literally, that we expand the existing memcpy/memset/memmove intrinsics
to
>> have an ‘isunordered’ parameter and an ‘element_size’ parameter, so the
>> prototype becomes something like:
>>   @llvm.memcpy.<overload desc>(<ty>* dest, <ty>* src,
<ty> len, i32 align,
>> i1 isvolatile, i2 isunordered, i16 element_size)
>> 
>> Pros:
>>   * Minimal extra work to handle the new version in existing passes —
only
>> need to change passes that create calls to memory intrinsics, expand
memory
>> intrinsics, or that need to care about unordered (which none should
that are
>> reasoning about memory intrinsic semantics).
>>   * New code that’s introduced by others to exploit/handle memory
>> intrinsics should just handle unordered for free — unordered being a
part of
>> the memory intrinsic means it’s more likely that the person will
realize
>> that they have to think about it (i.e. it raises the profile of
unordered
>> memory intrinsics).
> 
> I like the second point, but (unfortunately) I suspect in practice
> you'll see new code do:
> 
>  if (MCI->isOrdered())
>    return false;  // be conservative
> 
Yes, that would be an unfortunate reality, but one can hope. :-)
>> Cons:
>>   * Breaks backward compatibility of the IR — is there a mechanism for
>> migrating old IR into a new IR version when loading the IR into LLVM?
> 
> I think the migration here will be fairly straightforward -- you can
> just auto-upgrade old calls to memcpy to pass in 0 for the isordered
> argument.  But I've CC'd Mehdi and Vedant to help shed some light
on
> this.
> 
> — Sanjoy
Thanks!

-Daniel

---
Daniel Neilson, Ph.D.
Azul Systems

Maybe Matching Threads

Search for more apparently analagous threads

llvm dev - May 2017 - RFC: Element-atomic memory intrinsics

[llvm-dev] RFC: Element-atomic memory intrinsics

[llvm-dev] RFC: Element-atomic memory intrinsics

[llvm-dev] RFC: Element-atomic memory intrinsics

[llvm-dev] RFC: Element-atomic memory intrinsics

Maybe Matching Threads