thr3ads.net - llvm dev - [llvm-dev] Help needed on vectorization strangeness and stuff [Apr 2017]

If this information is useful, please help other people find it:
Share via:
Anders Waldenborg via llvm-dev
2017-Apr-19 14:51 UTC
[llvm-dev] Help needed on vectorization strangeness and stuff

I happened to have a program that seemed to compile into quite non
optimal machine code and some spare time, so I decided this was a good
opportunity to learn more about optimization passes.

Now I think I figured out what is going on - but I'm stuck and would
appreciate some help on how to continue. Please check that my
conclusions are correct and answer my questions towards the end - or
tell me that I'm asking the wrong questions. Or just what I can do to
fix the bug.


Given this simple program:

// ---8<---------- [interesting.c]
unsigned char dst[DEFINEME] __attribute__((aligned (64)));
unsigned char src[DEFINEME] __attribute__((aligned (64)));

void copy_7bits(void)
{
for (int i = 0; i < DEFINEME; i++)
dst[i] = src[i] & 0x7f;
}
// ---8<----------------------------------

compiled with:

clang -march=haswell -O3 -S -o - interesting.c -DDEFINEME=160

it generates some interesting stuff which basically amounts to:

  vmovaps .LCPI0_0(%rip), %ymm0   # ymm0 = [127,....,127]
  vandps  src(%rip), %ymm0, %ymm1
  vandps  src+32(%rip), %ymm0, %ymm2
  vandps  src+64(%rip), %ymm0, %ymm3
  vandps  src+96(%rip), %ymm0, %ymm0
  vmovaps %ymm1, dst(%rip)
  vmovaps %ymm2, dst+32(%rip)
  vmovaps %ymm3, dst+64(%rip)
  vmovaps %ymm0, dst+96(%rip)

This looks ok, and when -DDEFINEME=128 this is the actual result. But
now I compiled with -DDEFINEME=160 so there is another 32 bytes to be
processed. What it looks like is like this:

  movb    src+128(%rip), %al
  andb    $127, %al
  movb    %al, dst+128(%rip)
  movb    src+129(%rip), %al
  andb    $127, %al
  movb    %al, dst+129(%rip)
  ....
  ....  Guess I don't need to show 87 more instructions
  ....  here to get my point accross
  ....
  movb    src+158(%rip), %al
  andb    $127, %al
  movb    %al, dst+158(%rip)
  movb    src+159(%rip), %al
  andb    $127, %al
  movb    %al, dst+159(%rip)


>From what I can tell the loop vectorizer comes to the conclusion thatit is a good idea to interleave the loop 4 times. As the loop has a
trip count of 160, which is 5 trips after vectorization it leaves a
remainder of 32 trips which does not get vectorized.

Then this remainder then gets unrolled in a later unrolling pass.


* Question on interleaving

TinyTripCountInterleaveThreshold what is the reasoning behind this
128? This measures in number of trips before vectorization? Shouldn't
this be number of trips after vectorization, as that is the trip count
that is relevant after vectorization?

E.g:

--- a/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -6382,7 +6382,7 @@ unsigned
LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,

   // Do not interleave loops with a relatively small trip count.
   unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
-  if (TC > 1 && TC < TinyTripCountInterleaveThreshold)
+  if (TC > 1 && (TC / VF) < TinyTripCountInterleaveThreshold)
     return 1;

   unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1);
---

This patch makes my the generated code for the example with
DEFINEME=160 look good and fully vectorized and unrolled.

However it obviously doesn't help for cases like DEFINEME=4128 where
it generates a vectorized loop with an interleaving of 4 which will
cover the 4096 first iterations, and then leaves the remaining 32 as
bytes scalars.


* Question on how to fix it?

Where should the remainder be vectorized? Is the loop vectorizer
supposed to leave the tail unvectorized even if it has a known trip
count? Does it leave it there in hope that slp or bb vectorizer will
pick it up after it has been unrolled?

I did a experiment and removed the 'Hints.setAlreadyVectorized' call
on the remainder loop (= the original loop) and added an extra
instance of the loop vectorizer just before the unroll pass. That does
nicely vectorize the remainder loop. (but it redundantly loads the
mask again from memory into the register it already has it in - but I
guess that would be cleaned up by some other pass)


* Question on fishy codegen on memcpy

By removing the '& 0x7f' part - i.e making it a memcpy I get even
more
surprising effects in codegen. (I need to compile with -fno-builtin
for this, as the loop idiom finder nicely detects it as a memcpy
otherwise. (Here could be a question 3b - what are the exact semantics
of -fno-builtin. I would expect -fno-builtin to mean don't mess with
my "memcpy" - don't add or remove any memcpy calls. But I would
also
expect it to still recognize the loop as a memcpy, as long as the
llvm.memcpy intrinsic not is lowered to a call to library memcpy.))

(again with a original tripcount of 160).

Now this is generated for the tail:

        movzwl  src+144(%rip), %eax
        movw    %ax, dst+144(%rip)
        movl    src+146(%rip), %eax
        movl    %eax, dst+146(%rip)
        movzwl  src+150(%rip), %eax
        movw    %ax, dst+150(%rip)
        movzwl  src+152(%rip), %eax
        movw    %ax, dst+152(%rip)
        movzwl  src+154(%rip), %eax
        movw    %ax, dst+154(%rip)
        movzwl  src+156(%rip), %eax
        movw    %ax, dst+156(%rip)
        movb    src+158(%rip), %al
        movb    %al, dst+158(%rip)
        movb    src+159(%rip), %al
        movb    %al, dst+159(%rip)

so the backend combines the movb's. But in a very strange way. It
starts with a 16bit move, than a 32bitmov, which after then becomes
unaligned, and the two final ones are not combined.



thanks for reading all the way here.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170419/7161ad8e/attachment.html>
llvm dev - Apr 2017 - Help needed on vectorization strangeness and stuff

[llvm-dev] Help needed on vectorization strangeness and stuff