Anders Waldenborg via llvm-dev
2017-Apr-19 14:51 UTC
[llvm-dev] Help needed on vectorization strangeness and stuff
I happened to have a program that seemed to compile into quite non optimal machine code and some spare time, so I decided this was a good opportunity to learn more about optimization passes. Now I think I figured out what is going on - but I'm stuck and would appreciate some help on how to continue. Please check that my conclusions are correct and answer my questions towards the end - or tell me that I'm asking the wrong questions. Or just what I can do to fix the bug. Given this simple program: // ---8<---------- [interesting.c] unsigned char dst[DEFINEME] __attribute__((aligned (64))); unsigned char src[DEFINEME] __attribute__((aligned (64))); void copy_7bits(void) { for (int i = 0; i < DEFINEME; i++) dst[i] = src[i] & 0x7f; } // ---8<---------------------------------- compiled with: clang -march=haswell -O3 -S -o - interesting.c -DDEFINEME=160 it generates some interesting stuff which basically amounts to: vmovaps .LCPI0_0(%rip), %ymm0 # ymm0 = [127,....,127] vandps src(%rip), %ymm0, %ymm1 vandps src+32(%rip), %ymm0, %ymm2 vandps src+64(%rip), %ymm0, %ymm3 vandps src+96(%rip), %ymm0, %ymm0 vmovaps %ymm1, dst(%rip) vmovaps %ymm2, dst+32(%rip) vmovaps %ymm3, dst+64(%rip) vmovaps %ymm0, dst+96(%rip) This looks ok, and when -DDEFINEME=128 this is the actual result. But now I compiled with -DDEFINEME=160 so there is another 32 bytes to be processed. What it looks like is like this: movb src+128(%rip), %al andb $127, %al movb %al, dst+128(%rip) movb src+129(%rip), %al andb $127, %al movb %al, dst+129(%rip) .... .... Guess I don't need to show 87 more instructions .... here to get my point accross .... movb src+158(%rip), %al andb $127, %al movb %al, dst+158(%rip) movb src+159(%rip), %al andb $127, %al movb %al, dst+159(%rip)>From what I can tell the loop vectorizer comes to the conclusion thatit is a good idea to interleave the loop 4 times. As the loop has a trip count of 160, which is 5 trips after vectorization it leaves a remainder of 32 trips which does not get vectorized. Then this remainder then gets unrolled in a later unrolling pass. * Question on interleaving TinyTripCountInterleaveThreshold what is the reasoning behind this 128? This measures in number of trips before vectorization? Shouldn't this be number of trips after vectorization, as that is the trip count that is relevant after vectorization? E.g: --- a/lib/Transforms/Vectorize/LoopVectorize.cpp +++ b/lib/Transforms/Vectorize/LoopVectorize.cpp @@ -6382,7 +6382,7 @@ unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize, // Do not interleave loops with a relatively small trip count. unsigned TC = PSE.getSE()->getSmallConstantTripCount(TheLoop); - if (TC > 1 && TC < TinyTripCountInterleaveThreshold) + if (TC > 1 && (TC / VF) < TinyTripCountInterleaveThreshold) return 1; unsigned TargetNumRegisters = TTI.getNumberOfRegisters(VF > 1); --- This patch makes my the generated code for the example with DEFINEME=160 look good and fully vectorized and unrolled. However it obviously doesn't help for cases like DEFINEME=4128 where it generates a vectorized loop with an interleaving of 4 which will cover the 4096 first iterations, and then leaves the remaining 32 as bytes scalars. * Question on how to fix it? Where should the remainder be vectorized? Is the loop vectorizer supposed to leave the tail unvectorized even if it has a known trip count? Does it leave it there in hope that slp or bb vectorizer will pick it up after it has been unrolled? I did a experiment and removed the 'Hints.setAlreadyVectorized' call on the remainder loop (= the original loop) and added an extra instance of the loop vectorizer just before the unroll pass. That does nicely vectorize the remainder loop. (but it redundantly loads the mask again from memory into the register it already has it in - but I guess that would be cleaned up by some other pass) * Question on fishy codegen on memcpy By removing the '& 0x7f' part - i.e making it a memcpy I get even more surprising effects in codegen. (I need to compile with -fno-builtin for this, as the loop idiom finder nicely detects it as a memcpy otherwise. (Here could be a question 3b - what are the exact semantics of -fno-builtin. I would expect -fno-builtin to mean don't mess with my "memcpy" - don't add or remove any memcpy calls. But I would also expect it to still recognize the loop as a memcpy, as long as the llvm.memcpy intrinsic not is lowered to a call to library memcpy.)) (again with a original tripcount of 160). Now this is generated for the tail: movzwl src+144(%rip), %eax movw %ax, dst+144(%rip) movl src+146(%rip), %eax movl %eax, dst+146(%rip) movzwl src+150(%rip), %eax movw %ax, dst+150(%rip) movzwl src+152(%rip), %eax movw %ax, dst+152(%rip) movzwl src+154(%rip), %eax movw %ax, dst+154(%rip) movzwl src+156(%rip), %eax movw %ax, dst+156(%rip) movb src+158(%rip), %al movb %al, dst+158(%rip) movb src+159(%rip), %al movb %al, dst+159(%rip) so the backend combines the movb's. But in a very strange way. It starts with a 16bit move, than a 32bitmov, which after then becomes unaligned, and the two final ones are not combined. thanks for reading all the way here. -------------- next part -------------- An HTML attachment was scrubbed... URL: <lists.llvm.org/pipermail/llvm-dev/attachments/20170419/7161ad8e/attachment.html>