thr3ads.net - llvm dev - [llvm-dev] GC-parseable element atomic memcpy/memmove [Sep 2020]

If this information is useful, please help other people find it:
Share via:

Artur Pilipenko via llvm-dev

2020-Sep-18 23:51 UTC

[llvm-dev] GC-parseable element atomic memcpy/memmove

TLDR: a proposal to add GC-parseable lowering to element atomic
memcpy/memmove instrinsics controlled by a new "requires-statepoint”
call attribute.

Currently llvm.{memcpy|memmove}.element.unordered.atomic calls are
considered as GC leaf functions (like most other intrinsics). As a
result GC cannot occur while copy operation is in progress. This might
have negative effect on GC latencies when large amounts of data are
copied. To avoid this problem copying large amounts of data can be
done in chunks with GC safepoints in between. We'd like to be able to
represent such copy using existing instrinsics [1].

For that I'd like to propose a new attribute for
llvm.{memcpy|memmove}.element.unordered.atomic calls
"requires-statepoint". This attribute on a call will result in a
different lowering, which makes it possible to have a GC safepoint
during the copy operation.

There are three parts to the new lowering:

1) The calls with the new attribute will be wrapped into a statepoint
by RewriteStatepointsForGC (RS4GC). This way the stack at the calls
will be GC parceable.

2) Currently these intrinsics are lowered to GC leaf calls to the symbols
__llvm_{memcpy|memmove}_element_unordered_atomic_<element_size>.
The calls with the new attribute will be lowered to calls to different
symbols, let's say
__llvm_{memcpy|memmove}_element_unordered_atomic_safepoint_<element_size>.
This way the runtime can provide copy implementations with safepoints.

3) Currently memcpy/memmove calls take derived pointers as arguments.
If we copy with safepoints we might need to relocate the underlying
source/destination objects on a safepoint. In order to do this we need
to know the base pointers as well. How do we make the base pointers
available in the copy routine? I suggest we add them explicitly as
arguments during lowering.

For example:
__llvm_memcpy_element_unordered_atomic_safepoint_1(
  dest_base, dest_derived, src_base, src_derived, length)

It will be up to RS4GC to do the new lowering and prepare the arguments.
RS4GC knows how to compute base pointers for a given derived pointer.
It also already does lowering for deoptimize intrinsics by replacing
an intrinsic call with a symbol call. So there is a precedent here.

Other alternatives:
- Change llvm.{memcpy|memmove}.element.unordered.atomic API to accept
  base pointers + offsets instead of derived pointers. This will
  require autoupgrade of old representation. Changing API of a generic
  intrinsic to facilitate GC-specific lowering doesn't look like the
  best idea. This will not work if we want to do the same for non-atomic
  intrinsics.
- Teach GC infrastructure to record base pointers for all derived
  pointer arguments. This looks like an overkill for single use case.

Here is the proposed implementation in a single patch:
https://reviews.llvm.org/D87954
If there are no objections I will split it into individual reviews and
add langref changes.

Thoughts?

Artur

[1] An alternative approach would be to make the frontend generate a
chunked copy loop with a safepoint inside. The downsides are:
- It's harder for the optimizer to see that this loop is just a copy
  of a range of bytes.
- It forces one particular lowering with the chunked loop inlined in
  compiled code. We can't outline the copy loop into the copy routine.
  With the intrinsic representation of a chunked copy we can choose
  different lowering strategies if we want.
- In our system we have to outline the copy loop into the copy routine
  due to interactions with deoptimization.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200918/91173390/attachment.html>

Artur Pilipenko via llvm-dev

2020-Sep-25 02:28 UTC

head link

[llvm-dev] GC-parseable element atomic memcpy/memmove

Ping?

Artur

On Sep 18, 2020, at 4:51 PM, Artur Pilipenko <apilipenko at
azul.com<mailto:apilipenko at azul.com>> wrote:

TLDR: a proposal to add GC-parseable lowering to element atomic
memcpy/memmove instrinsics controlled by a new "requires-statepoint”
call attribute.

Currently llvm.{memcpy|memmove}.element.unordered.atomic calls are
considered as GC leaf functions (like most other intrinsics). As a
result GC cannot occur while copy operation is in progress. This might
have negative effect on GC latencies when large amounts of data are
copied. To avoid this problem copying large amounts of data can be
done in chunks with GC safepoints in between. We'd like to be able to
represent such copy using existing instrinsics [1].

For that I'd like to propose a new attribute for
llvm.{memcpy|memmove}.element.unordered.atomic calls
"requires-statepoint". This attribute on a call will result in a
different lowering, which makes it possible to have a GC safepoint
during the copy operation.

There are three parts to the new lowering:

1) The calls with the new attribute will be wrapped into a statepoint
by RewriteStatepointsForGC (RS4GC). This way the stack at the calls
will be GC parceable.

2) Currently these intrinsics are lowered to GC leaf calls to the symbols
__llvm_{memcpy|memmove}_element_unordered_atomic_<element_size>.
The calls with the new attribute will be lowered to calls to different
symbols, let's say
__llvm_{memcpy|memmove}_element_unordered_atomic_safepoint_<element_size>.
This way the runtime can provide copy implementations with safepoints.

3) Currently memcpy/memmove calls take derived pointers as arguments.
If we copy with safepoints we might need to relocate the underlying
source/destination objects on a safepoint. In order to do this we need
to know the base pointers as well. How do we make the base pointers
available in the copy routine? I suggest we add them explicitly as
arguments during lowering.

For example:
__llvm_memcpy_element_unordered_atomic_safepoint_1(
  dest_base, dest_derived, src_base, src_derived, length)

It will be up to RS4GC to do the new lowering and prepare the arguments.
RS4GC knows how to compute base pointers for a given derived pointer.
It also already does lowering for deoptimize intrinsics by replacing
an intrinsic call with a symbol call. So there is a precedent here.

Other alternatives:
- Change llvm.{memcpy|memmove}.element.unordered.atomic API to accept
  base pointers + offsets instead of derived pointers. This will
  require autoupgrade of old representation. Changing API of a generic
  intrinsic to facilitate GC-specific lowering doesn't look like the
  best idea. This will not work if we want to do the same for non-atomic
  intrinsics.
- Teach GC infrastructure to record base pointers for all derived
  pointer arguments. This looks like an overkill for single use case.

Here is the proposed implementation in a single patch:
https://reviews.llvm.org/D87954
If there are no objections I will split it into individual reviews and
add langref changes.

Thoughts?

Artur

[1] An alternative approach would be to make the frontend generate a
chunked copy loop with a safepoint inside. The downsides are:
- It's harder for the optimizer to see that this loop is just a copy
  of a range of bytes.
- It forces one particular lowering with the chunked loop inlined in
  compiled code. We can't outline the copy loop into the copy routine.
  With the intrinsic representation of a chunked copy we can choose
  different lowering strategies if we want.
- In our system we have to outline the copy loop into the copy routine
  due to interactions with deoptimization.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200925/7c6f6d22/attachment.html>

Philip Reames via llvm-dev

2020-Sep-28 17:56 UTC

head link

[llvm-dev] GC-parseable element atomic memcpy/memmove

In general, I am supportive of this direction.  It seems like an 
entirely reasonable solution.  I do have some comments below, but 
they're mostly of the "how do we generalize this?" variety.

First, let's touch on the attribute.

My first concern is naming; I think the use of "statepoint" here is 
problematic as this doesn't relate to lowering strategy needed (e.g. 
statepoints), but the conceptual support (e.g. a safepoint).  This could 
be resolved by simply tweaking to require-safepoint.

But that brings us to a broader point.  We've chosen to build in the 
fact intrinsics don't require safepoints.  If all we want is for some 
intrinsics *to* require safepoints, why isn't this simply a tweak to the 
existing code?  callsGCLeafFunction already has a small list of 
intrinsics which can have safepoints.

I think you can completely remove the need for this attribute by a) 
adding the atomic memcpy variants to the exclude list in 
callsGCLeafFunction, and b) using the existing "gc-leaf-function" on 
most calls the frontend generates.

Second, let's discuss the signature for the runtime function.

I think you should use a signature for the runtime call which takes base 
pointers and offsets, not base pointers and derived pointers.  Why?  
Because passing derived pointers in registers for arguments presumes 
that the runtime knows how to map a record in the stackmap to where a 
callee might have shuffled the argument to.  Some runtimes may support 
this, others may not.  Given using the offset scheme is just as simple 
to implement, being considerate and minimizing the runtime support 
required seems worthwhile.

On x86, the cost of a subtract (to produce the offset in the worst 
case), and an LEA (to produce the derived pointer again inside the 
runtime routine) is pretty minimal.  Particular since the former is 
likely to be optimized away and the later folded into the addressing mode.

Finally, it's also worth noting that some (but not all) GCs can convert 
from an interior derived pointer to the base of the containing object.  
With the memcpy family we know that either the pointers are all interior 
derived, or the length must be zero. This is not true for all GCs and 
thus we don't want to rely on it.

Philip

On 9/18/20 4:51 PM, Artur Pilipenko via llvm-dev wrote:> TLDR: a proposal to add GC-parseable lowering to element atomic
> memcpy/memmove instrinsics controlled by a new "requires-statepoint”
> call attribute.
>
> Currently llvm.{memcpy|memmove}.element.unordered.atomic calls are
> considered as GC leaf functions (like most other intrinsics). As a
> result GC cannot occur while copy operation is in progress. This might
> have negative effect on GC latencies when large amounts of data are
> copied. To avoid this problem copying large amounts of data can be
> done in chunks with GC safepoints in between. We'd like to be able to
> represent such copy using existing instrinsics [1].
>
> For that I'd like to propose a new attribute for
> llvm.{memcpy|memmove}.element.unordered.atomic calls
> "requires-statepoint". This attribute on a call will result in a
> different lowering, which makes it possible to have a GC safepoint
> during the copy operation.
>
> There are three parts to the new lowering:
>
> 1) The calls with the new attribute will be wrapped into a statepoint
> by RewriteStatepointsForGC (RS4GC). This way the stack at the calls
> will be GC parceable.
>
> 2) Currently these intrinsics are lowered to GC leaf calls to the symbols
> __llvm_{memcpy|memmove}_element_unordered_atomic_<element_size>.
> The calls with the new attribute will be lowered to calls to different
> symbols, let's say
>
__llvm_{memcpy|memmove}_element_unordered_atomic_safepoint_<element_size>.
> This way the runtime can provide copy implementations with safepoints.
>
> 3) Currently memcpy/memmove calls take derived pointers as arguments.
> If we copy with safepoints we might need to relocate the underlying
> source/destination objects on a safepoint. In order to do this we need
> to know the base pointers as well. How do we make the base pointers
> available in the copy routine? I suggest we add them explicitly as
> arguments during lowering.
>
> For example:
> __llvm_memcpy_element_unordered_atomic_safepoint_1(
>   dest_base, dest_derived, src_base, src_derived, length)
>
> It will be up to RS4GC to do the new lowering and prepare the arguments.
> RS4GC knows how to compute base pointers for a given derived pointer.
> It also already does lowering for deoptimize intrinsics by replacing
> an intrinsic call with a symbol call. So there is a precedent here.
>
> Other alternatives:
> - Change llvm.{memcpy|memmove}.element.unordered.atomic API to accept
>   base pointers + offsets instead of derived pointers. This will
>   require autoupgrade of old representation. Changing API of a generic
>   intrinsic to facilitate GC-specific lowering doesn't look like the
>   best idea. This will not work if we want to do the same for non-atomic
>   intrinsics.
> - Teach GC infrastructure to record base pointers for all derived
>   pointer arguments. This looks like an overkill for single use case.
>
> Here is the proposed implementation in a single patch:
> https://reviews.llvm.org/D87954
> If there are no objections I will split it into individual reviews and
> add langref changes.
>
> Thoughts?
>
> Artur
>
> [1] An alternative approach would be to make the frontend generate a
> chunked copy loop with a safepoint inside. The downsides are:
> - It's harder for the optimizer to see that this loop is just a copy
>   of a range of bytes.
> - It forces one particular lowering with the chunked loop inlined in
>   compiled code. We can't outline the copy loop into the copy routine.
>   With the intrinsic representation of a chunked copy we can choose
>   different lowering strategies if we want.
> - In our system we have to outline the copy loop into the copy routine
>   due to interactions with deoptimization.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200928/2d689ed2/attachment.html>

Artur Pilipenko via llvm-dev

2020-Sep-30 04:11 UTC

head link

[llvm-dev] GC-parseable element atomic memcpy/memmove

Thanks for the feedback.

I think both of the suggestions are very reasonable. I’ll incorporate them.

Given there were no objections for two weeks, I’m going to go ahead with posting
individual patches for review.

One small question inline:

On Sep 28, 2020, at 10:56 AM, Philip Reames <listmail at
philipreames.com<mailto:listmail at philipreames.com>> wrote:


In general, I am supportive of this direction.  It seems like an entirely
reasonable solution.  I do have some comments below, but they're mostly of
the "how do we generalize this?" variety.


First, let's touch on the attribute.

My first concern is naming; I think the use of "statepoint" here is
problematic as this doesn't relate to lowering strategy needed (e.g.
statepoints), but the conceptual support (e.g. a safepoint).  This could be
resolved by simply tweaking to require-safepoint.

But that brings us to a broader point.  We've chosen to build in the fact
intrinsics don't require safepoints.  If all we want is for some intrinsics
*to* require safepoints, why isn't this simply a tweak to the existing code?
callsGCLeafFunction already has a small list of intrinsics which can have
safepoints.

I think you can completely remove the need for this attribute by a) adding the
atomic memcpy variants to the exclude list in callsGCLeafFunction, and b) using
the existing "gc-leaf-function" on most calls the frontend generates.


Second, let's discuss the signature for the runtime function.

I think you should use a signature for the runtime call which takes base
pointers and offsets, not base pointers and derived pointers.  Why?  Because
passing derived pointers in registers for arguments presumes that the runtime
knows how to map a record in the stackmap to where a callee might have shuffled
the argument to.  Some runtimes may support this, others may not.  Given using
the offset scheme is just as simple to implement, being considerate and
minimizing the runtime support required seems worthwhile.

On x86, the cost of a subtract (to produce the offset in the worst case), and an
LEA (to produce the derived pointer again inside the runtime routine) is pretty
minimal.  Particular since the former is likely to be optimized away and the
later folded into the addressing mode.


Finally, it's also worth noting that some (but not all) GCs can convert from
an interior derived pointer to the base of the containing object.  With the
memcpy family we know that either the pointers are all interior derived, or the
length must be zero.  This is not true for all GCs and thus we don't want to
rely on it.

Do you think it makes sense to control this aspect of lowering (derived pointers
vs base+offset in memcpy args) using GCStrategy?

Artur

Philip


On 9/18/20 4:51 PM, Artur Pilipenko via llvm-dev wrote:
TLDR: a proposal to add GC-parseable lowering to element atomic
memcpy/memmove instrinsics controlled by a new "requires-statepoint”
call attribute.

Currently llvm.{memcpy|memmove}.element.unordered.atomic calls are
considered as GC leaf functions (like most other intrinsics). As a
result GC cannot occur while copy operation is in progress. This might
have negative effect on GC latencies when large amounts of data are
copied. To avoid this problem copying large amounts of data can be
done in chunks with GC safepoints in between. We'd like to be able to
represent such copy using existing instrinsics [1].

For that I'd like to propose a new attribute for
llvm.{memcpy|memmove}.element.unordered.atomic calls
"requires-statepoint". This attribute on a call will result in a
different lowering, which makes it possible to have a GC safepoint
during the copy operation.

There are three parts to the new lowering:

1) The calls with the new attribute will be wrapped into a statepoint
by RewriteStatepointsForGC (RS4GC). This way the stack at the calls
will be GC parceable.

2) Currently these intrinsics are lowered to GC leaf calls to the symbols
__llvm_{memcpy|memmove}_element_unordered_atomic_<element_size>.
The calls with the new attribute will be lowered to calls to different
symbols, let's say
__llvm_{memcpy|memmove}_element_unordered_atomic_safepoint_<element_size>.
This way the runtime can provide copy implementations with safepoints.

3) Currently memcpy/memmove calls take derived pointers as arguments.
If we copy with safepoints we might need to relocate the underlying
source/destination objects on a safepoint. In order to do this we need
to know the base pointers as well. How do we make the base pointers
available in the copy routine? I suggest we add them explicitly as
arguments during lowering.

For example:
__llvm_memcpy_element_unordered_atomic_safepoint_1(
  dest_base, dest_derived, src_base, src_derived, length)

It will be up to RS4GC to do the new lowering and prepare the arguments.
RS4GC knows how to compute base pointers for a given derived pointer.
It also already does lowering for deoptimize intrinsics by replacing
an intrinsic call with a symbol call. So there is a precedent here.

Other alternatives:
- Change llvm.{memcpy|memmove}.element.unordered.atomic API to accept
  base pointers + offsets instead of derived pointers. This will
  require autoupgrade of old representation. Changing API of a generic
  intrinsic to facilitate GC-specific lowering doesn't look like the
  best idea. This will not work if we want to do the same for non-atomic
  intrinsics.
- Teach GC infrastructure to record base pointers for all derived
  pointer arguments. This looks like an overkill for single use case.

Here is the proposed implementation in a single patch:
https://reviews.llvm.org/D87954
If there are no objections I will split it into individual reviews and
add langref changes.

Thoughts?

Artur

[1] An alternative approach would be to make the frontend generate a
chunked copy loop with a safepoint inside. The downsides are:
- It's harder for the optimizer to see that this loop is just a copy
  of a range of bytes.
- It forces one particular lowering with the chunked loop inlined in
  compiled code. We can't outline the copy loop into the copy routine.
  With the intrinsic representation of a chunked copy we can choose
  different lowering strategies if we want.
- In our system we have to outline the copy loop into the copy routine
  due to interactions with deoptimization.



_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200930/fbeadae9/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Sep 2020 - GC-parseable element atomic memcpy/memmove

[llvm-dev] GC-parseable element atomic memcpy/memmove

[llvm-dev] GC-parseable element atomic memcpy/memmove

[llvm-dev] GC-parseable element atomic memcpy/memmove

[llvm-dev] GC-parseable element atomic memcpy/memmove

Possibly Parallel Threads