thr3ads.net - llvm dev - [llvm-dev] Element atomic vector stores: just do it? [Aug 2021]

If this information is useful, please help other people find it:
Share via:

Max Kazantsev via llvm-dev

2021-Aug-06 10:16 UTC

[llvm-dev] Element atomic vector stores: just do it?

Hi everyone,

Recently I started to work on optimization of
@llvm.memcpy.element.unordered.atomic intrinsics to make them available and
widely usable. Currently, LLVM provides a langref description for it, but no 
baked-in lowering. To change this situation, some steps need to be taken.

The obvious way to lower @llvm.memcpy.element.unordered.atomic is smth like

for (i = 0; i < len; i++) {
  v = load atomic src[i]
  store atomic v, dest[i]
}

But this code itself is awkwardly slow, compared to efficient implementations of
regular memcpy. What we would really want to do is

for (i = 0; i < len; i += stride) {
  vector_v = load element atomic <stride x i32>
  store element atomic vector_v, dest[i]
}

However, currently there is no way to express this concept in LLVM. What we can
do is


for (i = 0; i < len; i += stride) {
  vector_v = load <stride x i32>
  store vector_v, dest[i]
}

When max vector size is supported on the platform, we can hope (but just hope!)
that it will lower into corresponding vector load/stores, such as ymm/xmm
load/stores in X86. But:

  *   There is no guarantee of that. Some IR pass theoretically may break the
vectors as it pleases because there is no atomicity demand.
     *   I don't think any pass does it in reality, but they have a right
to.
  *   Even if it is lowered into ymm/xmm/smth like this, yhere is no guarantee
of atomicity of xmm or ymm registers. So even if this code is lowered into
corresponding ymm stores, the whole store might not be atomic (especially if it
crosses the boundary of cache line).
  *   In codegen level, ymm load/store may be torn. For example, I find this
pass that says:
https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
// The pass currently only handles cases where memcpy is lowered to
// XMM/YMM registers, it tries to break the memcpy into smaller copies.
// breaking the memcpy should be possible since there is no atomicity
// guarantee for loads and stores to XMM/YMM.
So breakedge of non-atomic xmm/ymm loads is a real thing that should be
accounted for.

However, there is a bit of a positive moment. Specification of
@llvm.memcpy.element.unordered.atomic says:

> For each of the input pointers align parameter attribute must be specified.
It must be a power of two no less than the element_size. Caller guarantees that
both the source and destination pointers are aligned to that boundary.
So if we somehow enforce lowering of the vector stores into hardware supported
operations and prohibit other passes from tearing it apart, we'll have ymm
loads aligned by some basic type (say i32). It's a widely known that on X86,
despite xmm/ymm stores are not atomic, they don't tear words if they are
aligned by the width of the word (please correct me if it's not true!). So
by just enforcing straightforward lowering of such loads into ymm/xmm loads, and
prohibit all passes (including codegen) to touch it, we should have the desired
atomic behavior on X86.

This might not be true for other platforms, but we should start somewhere.

Proposition
We could have another flag in load/store instructions, similar to atomic, being
element-atomic which only matters for vector loads and stores. We can guarantee
that these loads only survive till codegen on platforms that support it. We can
also go and update all passes that potentially tear vectors (I don't think
there is many) and prohibit them from touching these loads and stores. And
we'll also need an assert (on X86) that the pointers are aligned properly.

It doesn't look a very hard task (the only hard thing is to detect all such
places), but maybe there are some pitfalls I don't know about.

Please discuss if you have an opinion on how to do it best, and what possible
problems do you anticipate.

Thanks,
Max

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210806/9d67a805/attachment.html>

Steve (Numerics) Canon via llvm-dev

2021-Aug-06 12:54 UTC

head link

[llvm-dev] Element atomic vector stores: just do it?

> On Aug 6, 2021, at 6:16 AM, Max Kazantsev via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> So if we somehow enforce lowering of the vector stores into hardware
supported operations and prohibit other passes from tearing it apart, we’ll have
ymm loads aligned by some basic type (say i32). It’s a widely known that on X86,
despite xmm/ymm stores are not atomic, they don’t tear words if they are aligned
by the width of the word (please correct me if it’s not true!).
As a practical matter of implementation, it is hard to imagine that an
implementation wouldn’t use at least word-size aligned accesses if it split an
SSE/AVX op, and as far as I know every actual implementation has done so, but
this is not actually guaranteed by the x86 ISA, nor by Intel’s or AMD’s
documentation.

– Steve

Artur Pilipenko via llvm-dev

2021-Aug-07 02:00 UTC

head link

[llvm-dev] Element atomic vector stores: just do it?

I guess there are two separate questions here.

1) Currently element atomic memcpys are lowered to calls to predefined symbols.
If we want to provide an efficient lowering instead, how do we do this?

We can follow the same scheme as the regular memcpy. There is a target-specific
hook in SelectionDAGTargetInfo::EmitTargetCodeForMemcpy. We can introduce a
similar hook for element-atomic memcpy. I’m not sure though how hard it would be
to emit a vectorized loop in SelectionDAG representation.

We can have a middle-end pass which does the lowering instead. We can either
emit a hand-rolled vectorized loop or we can emit a naive loop and rely on the
vectorizer to optimize it.

If we do the middle-end lowering we would need to think about interactions with
the backed optimizations for element atomic memcpys. E.g. an element atomic
memcpy with a small constant length is currently converted into a series of
loads and stores in the backed. If you rewrite all element atomic memcpys into
vectorized loops in the middle-end you might obscure this optimization.

2) If we decide to do the middle-end lowering we would need to solve the element
atomic vector problem. If we emit a naive loop today, it won’t be vectorized
because an element-wise atomic vector can’t be represented in the IR. If we
decide to write a hand-rolled vectorized loop we will have the same issue.

Here is some previous discussion on the topic of introducing element-wise atomic
vectors.
https://lists.llvm.org/pipermail/llvm-dev/2015-December/093237.html

Philip might have better insight on what it will take to introduce this
construct, since he attempted it back then.

Artur

On Aug 6, 2021, at 3:16 AM, Max Kazantsev via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Hi everyone,

Recently I started to work on optimization of
@llvm.memcpy.element.unordered.atomic intrinsics to make them available and
widely usable. Currently, LLVM provides a langref description for it, but no
baked-in lowering. To change this situation, some steps need to be taken.

The obvious way to lower @llvm.memcpy.element.unordered.atomic is smth like

for (i = 0; i < len; i++) {
v = load atomic src[i]
store atomic v, dest[i]
}

But this code itself is awkwardly slow, compared to efficient implementations of
regular memcpy. What we would really want to do is

for (i = 0; i < len; i += stride) {
vector_v = load element atomic <stride x i32>
store element atomic vector_v, dest[i]
}

However, currently there is no way to express this concept in LLVM. What we can
do is

for (i = 0; i < len; i += stride) {
vector_v = load <stride x i32>
store vector_v, dest[i]
}

When max vector size is supported on the platform, we can hope (but just hope!)
that it will lower into corresponding vector load/stores, such as ymm/xmm
load/stores in X86. But:

* There is no guarantee of that. Some IR pass theoretically may break the
vectors as it pleases because there is no atomicity demand.
* I don’t think any pass does it in reality, but they have a right to.
* Even if it is lowered into ymm/xmm/smth like this, yhere is no guarantee
of atomicity of xmm or ymm registers. So even if this code is lowered into
corresponding ymm stores, the whole store might not be atomic (especially if it
crosses the boundary of cache line).
* In codegen level, ymm load/store may be torn. For example, I find this
pass that says:

https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
// The pass currently only handles cases where memcpy is lowered to
// XMM/YMM registers, it tries to break the memcpy into smaller copies.
// breaking the memcpy should be possible since there is no atomicity
// guarantee for loads and stores to XMM/YMM.
So breakedge of non-atomic xmm/ymm loads is a real thing that should be
accounted for.

However, there is a bit of a positive moment. Specification of
@llvm.memcpy.element.unordered.atomic says:

> For each of the input pointers align parameter attribute must be specified.
It must be a power of two no less than the element_size. Caller guarantees that
both the source and destination pointers are aligned to that boundary.

So if we somehow enforce lowering of the vector stores into hardware supported
operations and prohibit other passes from tearing it apart, we’ll have ymm loads
aligned by some basic type (say i32). It’s a widely known that on X86, despite
xmm/ymm stores are not atomic, they don’t tear words if they are aligned by the
width of the word (please correct me if it’s not true!). So by just enforcing
straightforward lowering of such loads into ymm/xmm loads, and prohibit all
passes (including codegen) to touch it, we should have the desired atomic
behavior on X86.

This might not be true for other platforms, but we should start somewhere.

Proposition
We could have another flag in load/store instructions, similar to atomic, being
element-atomic which only matters for vector loads and stores. We can guarantee
that these loads only survive till codegen on platforms that support it. We can
also go and update all passes that potentially tear vectors (I don’t think there
is many) and prohibit them from touching these loads and stores. And we’ll also
need an assert (on X86) that the pointers are aligned properly.

It doesn’t look a very hard task (the only hard thing is to detect all such
places), but maybe there are some pitfalls I don’t know about.

Please discuss if you have an opinion on how to do it best, and what possible
problems do you anticipate.

Thanks,
Max

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210807/442ad25c/attachment.html>

llvm dev - Aug 2021 - Element atomic vector stores: just do it?

[llvm-dev] Element atomic vector stores: just do it?

[llvm-dev] Element atomic vector stores: just do it?

[llvm-dev] Element atomic vector stores: just do it?