thr3ads.net - llvm dev - [llvm-dev] Vectorization of single loop with AoSoA layout [Feb 2021]

If this information is useful, please help other people find it:
Share via:

Bernhard Manfred Gruber via llvm-dev

2021-Feb-24 17:43 UTC

[llvm-dev] Vectorization of single loop with AoSoA layout

Hi everybody!

I have a question for the vectorization experts and would like to ask for
some insight please.

I am working on an LLVM-independent library that offers various memory
layouts for arrays of plain structs in C++. One of these layouts is an
AoSoA (Array of Struct of Arrays). E.g.:
constexpr auto lanes = 8;
struct Block {
    float a[lanes];
    float b[lanes];
    float c[lanes];
};

Single loops that iterate over arrays of this layout fail to vectorize with
clang/LLVM (also with recent g++, icc and MSVC).
E.g. adding the vectors of floats a and b into c, where a, b and c are
stored in one memory block as AoSoA:
constexpr auto alignment = lanes * sizeof(float);
void aosoa1(Block* ubuf, size_t n) {
    auto* buf = std::assume_aligned<alignment>(ubuf);
    for (size_t i = 0; i < (n/lanes)*lanes; i++) {
        const auto block = i / lanes;
        const auto lane = i % lanes;
        buf[block].c[lane] = buf[block].a[lane] + buf[block].b[lane];
    }
}
Flags for clang: -std=c++20 -O3 -mavx2 -Rpass-analysis=loop-vectorize
-Rpass-missed=loop-vectorize
clang gives me this remark: loop not vectorized: cannot identify array
bounds [-Rpass-analysis=loop-vectorize]. I tried browsing through the LLVM
source to figure out if I could get it working, but that obviously grew
over my head :)

With two nested loops, the inner one vectorizes fine:
void aosoa2(Block* ubuf, size_t n) {
    auto* buf = std::assume_aligned<alignment>(ubuf);
    for (size_t block = 0; block < n/lanes; block++) {
        for (size_t lane = 0; lane < lanes; lane++) {
            buf[block].c[lane] = buf[block].a[lane] + buf[block].b[lane];
        }
    }
}

Full example: https://godbolt.org/z/qdG9aY

Why does clang/LLVM fail to vectorize the loop in aosoa1() which splits the
loop index into block and lane index? I think I do not sufficiently
understand the "cannot identify array bounds" remark.
Is vectorization theoretically possible for aosoa1()? That is, there is no
reason that forbids vectorization.
Is there a workaround for clang, like a #pragma, that can be used to allow
clang to vectorize aosoa1()?
Would this use case be important enough that clang/LLVM could at some point
recognize such a pattern and successfully vectorize it?

I really appreciate your input here! Thank you very much!

Bernhard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210224/a222b595/attachment-0001.html>

llvm dev - Feb 2021 - Vectorization of single loop with AoSoA layout

[llvm-dev] Vectorization of single loop with AoSoA layout