thr3ads.net - llvm dev - [llvm-dev] Document to understand vectorized code [Jun 2021]

If this information is useful, please help other people find it:
Share via:

Sudakshina Dutta via llvm-dev

2021-Jun-08 12:14 UTC

[llvm-dev] Document to understand vectorized code

Dear Stefanos,

I shall go through this thoroughly and let you know.

Thanks a lot.
Sudakshina

On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
stefanos.baziotis at gmail.com> wrote:
> I think that you're interested in a more general tutorial on how to
> vectorize code. I'm sorry I'm going to plug in my own articles [1]
[2], but
> I've just tried to write a short version here and I basically almost
> rewrote them so there was
> no point. For your case, I'd read at least the first half of
"Residual
> Iterations" from Part 1 and most of part 2. The concept of residual
> iterations and reductions is something you'll see over and over and it
> happens
> in your example too. Let me explain the use of shufflevector pretty quick
> and then I'll explain some specifics in your example. Also, before we
move
> any further, I recommend that we turn off loop unrolling in your
> example because it doesn't add educational value and just clutters the
> code, so here's your updated version [3].
>
> --- Shufflevector ---
>
> If you are not sure whether you understand what the instruction alone
> does, let me know. You can also play with this LLVM IR [4]. I recommend
> that you _run_ this code and change the mask in line 24.
> The reason it's used in your example (see [3]) is because the compiler
> wants to create a vector which has `arr[0]` (referring to the C++ entity)
> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line 17
it
> inserts
> it only to the 0th (first) lane of a vector (and for the time being, we
> don't care about the other lanes which is why the first arg to
> `insertelement` is `poison`) and in line 18, it shuffles the value. The
> shuffle mask is the vector [0, 0, 0, 0]
> (in short, zeroinitializer) which means in the 0th lane of the _result_
> vector (%minmax.ident.splat) insert the value of the 0th lane of the
> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the _result_
> vector, insert the value of the 0th lane of the _input_ vector etc. So,
> all lanes of the result vector get the 0th lane of the input vector.
>
> --- Residual Iterations ---
>
> Assuming that you have understood the use of residual iterations, let's
> see where it happens in your example. First of all, notice that in `entry`,
> there is a check whether `n` is bigger than 1, because if it's not,
there
> are no
> iterations to do at all and in this case, the result, i.e., `max`, is
> `arr[0]`, which is why it's loaded so early. Check the phi at line 45.
> Then, if we make it to `for.body.preheader`, there is a check whether we
> have at least
> 4 iterations (the fact that this loop starts from one makes the
> computations somewhat more complicated but not a lot). Because if we
don't,
> then there's no point going into the vectorized version. Then, skipping
> what the vectorized
> code does exactly for now (but check how the iteration count is rounded
> down to the nearest multiple of 4 in line 15), check that _after_ the
> vectorized code, the flow moves to `middle.block`,
> which essentially checks whether there are any more iterations to do and
> if so, jumps to the serial version of the loop (this is where the residual
> iterations happen).
>
> --- Reduction ---
>
> In the article, I explain reductions using accumulation as an example, but
> in your example there's another reduction, namely, finding the maximum
> element. Notice that this is also a reduction: new_value >
max_of(old_value, data).
> The reduction variable (acting as both old and new value) is `max` in your
> example. The operation is max_of and the continuously new data is arr[i].
>
> Finding the maximum element is a commutative operation. For instance, to
> find the maximum element among 4, you can split them in two pairs _any_ way
> you want, find the maximum of each pair _independently_
> of the other pair (those two maximums are partial results and since their
> computation is independent of each other, it can be done in parallel), and
> finally find the maximum among these two maximums.
>
> What happens in the vectorized code is that each lane finds a partial
> maximum. In line 27 it loads 4 elements and in the next two lines, it
> "compares" them with the previous 4 elements (%vec.phi, which
keeps the 4
> partial maximums which are updated
> in every iteration). It computes a vector which has 4 maximums, one for
> each pair of lanes (between the new 4 elements in %wide.load, in
> %wide.load, and %vec.phi). You might want to run an example in paper to see
> the effect. Consider that you want to compute
> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>
> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2, 2>.
This
> is the starting value of %vec.phi. Then, you load 4 values: %wide.load >
<10, 7, 3, 1>
> Then, you do an element-wise comparison of the two and for each pair of
> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 > 2
but 1 < 2.
> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4> and
you
> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 > 2.
So, %5 > <10, 9, 11, 4>. Now, you have computed 4 partial maximums
> and you only need to find the maximum among those 4 to find the total
> maximum. That happens in line 35 and you're done.
>
> Hope this helps! Let me know if there are any questions.
>
> Best,
> Stefanos
>
> [1]
>
http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
> [2]
>
http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
> [3] https://godbolt.org/z/nMe9a6YWa
> [4] https://godbolt.org/z/Wf9fTq53G
>
> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
> craig.topper at gmail.com> έγραψε:
>
>> Shufflevector is a powerful instruction that can be used to copy a
scalar
>> to every elements of a vector, concatenate smaller vectors into a
larger
>> vector, rearrange elements one or two vectors, etc. Without example
code I
>> can't say for sure what the use of shufflevector in your case is
for.
>>
>> ~Craig
>>
>>
>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <sudakshina at
iitgoa.ac.in>
>> wrote:
>>
>>> Dear Craig,
>>>
>>> Thanks for your reply. It seems the control flow of the vectorized
>>> program is very different from that of the source code. Is there
any
>>> document describing difference in control flow, the reason for
using
>>> shufflevector, etc ?
>>>
>>> Sudakshina
>>>
>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at
gmail.com> wrote:
>>>
>>>> There's some notes here
https://llvm.org/docs/Vectorizers.html but I
>>>> didn't look too closely at it.  If you compile with
-fno-discard-values
>>>> names or use a debug build, some of the basic blocks will be
named similar
>>>> to the CFG diagram in this section
>>>> https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>
>>>> ~Craig
>>>>
>>>>
>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via llvm-dev
<
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Dear Stefanos,
>>>>>
>>>>> I want to understand how the generated code works. For
example, the
>>>>> code that I generated using -Rpass=loop-vectorize
>>>>> -Rpass-analysis=loop-vectorize has many shufflevector and
other
>>>>> instructions. I wanted to have a document (if there exists
any) where an
>>>>> example on loop-vectorization is done with some explanation
on the workflow
>>>>> of the code.
>>>>>
>>>>> Thanks,
>>>>> Sudakshina
>>>>>
>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>
>>>>>> Hi Sudakshina,
>>>>>>
>>>>>> First, it helps if you can put your code in a godbolt
snippet, like
>>>>>> this [1]. It helps people in multiple ways (e.g., they
don't have to
>>>>>> download files, they can see exactly what cmd arguments
you used, they can
>>>>>> tweak the cmd arguments without having LLVM on their
machine etc.).
>>>>>>
>>>>>> Is there any comprehensive tutorial/document to
understand generated
>>>>>>> instructions or the semantics of the vectorized
code ?
>>>>>>
>>>>>>
>>>>>> This is quite generic, what is more specifically that
you want to
>>>>>> understand? Do you want to understand what each
individual instruction
>>>>>> does? Do you maybe understand that but
>>>>>> you don't know what is the general method to
generate, let's say by
>>>>>> hand, vectorized code (or more specifically, branching
vectorized code). Or
>>>>>> maybe, you want to understand
>>>>>> how _LLVM_ generates this code, i.e., the inner
workings of the
>>>>>> vectorization passes.
>>>>>>
>>>>>> Best,
>>>>>> Stefanos
>>>>>>
>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>
>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η Sudakshina
Dutta via
>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>
>>>>>>> Dear all,
>>>>>>>
>>>>>>> Greetings. I have generated a vectorized code from
a C source file
>>>>>>> (attached). Is there any comprehensive
tutorial/document to understand
>>>>>>> generated instructions or the semantics of the
vectorized code ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sudakshina
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210608/1ed92efe/attachment.html>

Sudakshina Dutta via llvm-dev

2021-Jun-14 13:16 UTC

head link

[llvm-dev] Document to understand vectorized code

Dear Stefanos,

Sorry for getting back to you so late. Can you help with some document
(other than language reference manual) the purpose of using shufflevector ?

Thanks,
Sudakshina

On Tue, Jun 8, 2021 at 5:44 PM Sudakshina Dutta <sudakshina at
iitgoa.ac.in>
wrote:
> Dear Stefanos,
>
> I shall go through this thoroughly and let you know.
>
> Thanks a lot.
> Sudakshina
>
> On Mon, Jun 7, 2021 at 10:57 PM Stefanos Baziotis <
> stefanos.baziotis at gmail.com> wrote:
>
>> I think that you're interested in a more general tutorial on how to
>> vectorize code. I'm sorry I'm going to plug in my own articles
[1] [2], but
>> I've just tried to write a short version here and I basically
almost
>> rewrote them so there was
>> no point. For your case, I'd read at least the first half of
"Residual
>> Iterations" from Part 1 and most of part 2. The concept of
residual
>> iterations and reductions is something you'll see over and over and
it
>> happens
>> in your example too. Let me explain the use of shufflevector pretty
quick
>> and then I'll explain some specifics in your example. Also, before
we move
>> any further, I recommend that we turn off loop unrolling in your
>> example because it doesn't add educational value and just clutters
the
>> code, so here's your updated version [3].
>>
>> --- Shufflevector ---
>>
>> If you are not sure whether you understand what the instruction alone
>> does, let me know. You can also play with this LLVM IR [4]. I recommend
>> that you _run_ this code and change the mask in line 24.
>> The reason it's used in your example (see [3]) is because the
compiler
>> wants to create a vector which has `arr[0]` (referring to the C++
entity)
>> in all its lanes. In line 4 (see [2]) it's loaded into %0, in line
17 it
>> inserts
>> it only to the 0th (first) lane of a vector (and for the time being, we
>> don't care about the other lanes which is why the first arg to
>> `insertelement` is `poison`) and in line 18, it shuffles the value. The
>> shuffle mask is the vector [0, 0, 0, 0]
>> (in short, zeroinitializer) which means in the 0th lane of the _result_
>> vector (%minmax.ident.splat) insert the value of the 0th lane of the
>> _input_ vector (%minmax.ident.splatinsert). In the 1st lane of the
_result_
>> vector, insert the value of the 0th lane of the _input_ vector etc. So,
>> all lanes of the result vector get the 0th lane of the input vector.
>>
>> --- Residual Iterations ---
>>
>> Assuming that you have understood the use of residual iterations,
let's
>> see where it happens in your example. First of all, notice that in
`entry`,
>> there is a check whether `n` is bigger than 1, because if it's not,
there
>> are no
>> iterations to do at all and in this case, the result, i.e., `max`, is
>> `arr[0]`, which is why it's loaded so early. Check the phi at line
45.
>> Then, if we make it to `for.body.preheader`, there is a check whether
we
>> have at least
>> 4 iterations (the fact that this loop starts from one makes the
>> computations somewhat more complicated but not a lot). Because if we
don't,
>> then there's no point going into the vectorized version. Then,
skipping
>> what the vectorized
>> code does exactly for now (but check how the iteration count is rounded
>> down to the nearest multiple of 4 in line 15), check that _after_ the
>> vectorized code, the flow moves to `middle.block`,
>> which essentially checks whether there are any more iterations to do
and
>> if so, jumps to the serial version of the loop (this is where the
residual
>> iterations happen).
>>
>> --- Reduction ---
>>
>> In the article, I explain reductions using accumulation as an example,
>> but in your example there's another reduction, namely, finding the
maximum
>> element. Notice that this is also a reduction: new_value >>
max_of(old_value, data).
>> The reduction variable (acting as both old and new value) is `max` in
>> your example. The operation is max_of and the continuously new data is
>> arr[i].
>>
>> Finding the maximum element is a commutative operation. For instance,
to
>> find the maximum element among 4, you can split them in two pairs _any_
way
>> you want, find the maximum of each pair _independently_
>> of the other pair (those two maximums are partial results and since
their
>> computation is independent of each other, it can be done in parallel),
and
>> finally find the maximum among these two maximums.
>>
>> What happens in the vectorized code is that each lane finds a partial
>> maximum. In line 27 it loads 4 elements and in the next two lines, it
>> "compares" them with the previous 4 elements (%vec.phi, which
keeps the 4
>> partial maximums which are updated
>> in every iteration). It computes a vector which has 4 maximums, one for
>> each pair of lanes (between the new 4 elements in %wide.load, in
>> %wide.load, and %vec.phi). You might want to run an example in paper to
see
>> the effect. Consider that you want to compute
>> the maximum of 9 numbers, in this order: 2, 10, 7, 3, 1, 8, 9, 11, 4
>>
>> First, you get 2 (arr[0]) and broadcast it in a vector: <2, 2, 2,
2>.
>> This is the starting value of %vec.phi. Then, you load 4 values:
%wide.load
>> = <10, 7, 3, 1>
>> Then, you do an element-wise comparison of the two and for each pair of
>> lanes, you keep the maximum. In this case, 10 > 2, 7 > 2, 3 >
2 but 1 < 2.
>> So, %5 = <10, 7, 3, 2>, which becomes the new %vec.phi.
>> Next iteration, you load another 4: %wide.load = <8, 9, 11, 4>
and you
>> compare with (the new) %vec.phi: 8 < 10, 9 > 7, 11 > 3, 4 >
2. So, %5 >> <10, 9, 11, 4>. Now, you have computed 4 partial
maximums
>> and you only need to find the maximum among those 4 to find the total
>> maximum. That happens in line 35 and you're done.
>>
>> Hope this helps! Let me know if there are any questions.
>>
>> Best,
>> Stefanos
>>
>> [1]
>>
http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-1.html
>> [2]
>>
http://users.uoa.gr/~sdi1600105/performance/a-beginners-guide-to-vectorization-by-hand-part-2.html
>> [3] https://godbolt.org/z/nMe9a6YWa
>> [4] https://godbolt.org/z/Wf9fTq53G
>>
>> Στις Δευ, 7 Ιουν 2021 στις 8:49 π.μ., ο/η Craig Topper <
>> craig.topper at gmail.com> έγραψε:
>>
>>> Shufflevector is a powerful instruction that can be used to copy a
>>> scalar to every elements of a vector, concatenate smaller vectors
into a
>>> larger vector, rearrange elements one or two vectors, etc. Without
example
>>> code I can't say for sure what the use of shufflevector in your
case is for.
>>>
>>> ~Craig
>>>
>>>
>>> On Sun, Jun 6, 2021 at 10:44 PM Sudakshina Dutta <
>>> sudakshina at iitgoa.ac.in> wrote:
>>>
>>>> Dear Craig,
>>>>
>>>> Thanks for your reply. It seems the control flow of the
vectorized
>>>> program is very different from that of the source code. Is
there any
>>>> document describing difference in control flow, the reason for
using
>>>> shufflevector, etc ?
>>>>
>>>> Sudakshina
>>>>
>>>> On Mon, 7 Jun 2021, 07:41 Craig Topper, <craig.topper at
gmail.com> wrote:
>>>>
>>>>> There's some notes here
https://llvm.org/docs/Vectorizers.html but I
>>>>> didn't look too closely at it.  If you compile with
-fno-discard-values
>>>>> names or use a debug build, some of the basic blocks will
be named similar
>>>>> to the CFG diagram in this section
>>>>>
https://llvm.org/docs/Vectorizers.html#epilogue-vectorization
>>>>>
>>>>> ~Craig
>>>>>
>>>>>
>>>>> On Sun, Jun 6, 2021 at 7:04 PM Sudakshina Dutta via
llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Dear Stefanos,
>>>>>>
>>>>>> I want to understand how the generated code works. For
example, the
>>>>>> code that I generated using -Rpass=loop-vectorize
>>>>>> -Rpass-analysis=loop-vectorize has many shufflevector
and other
>>>>>> instructions. I wanted to have a document (if there
exists any) where an
>>>>>> example on loop-vectorization is done with some
explanation on the workflow
>>>>>> of the code.
>>>>>>
>>>>>> Thanks,
>>>>>> Sudakshina
>>>>>>
>>>>>> On Mon, Jun 7, 2021 at 7:03 AM Stefanos Baziotis <
>>>>>> stefanos.baziotis at gmail.com> wrote:
>>>>>>
>>>>>>> Hi Sudakshina,
>>>>>>>
>>>>>>> First, it helps if you can put your code in a
godbolt snippet, like
>>>>>>> this [1]. It helps people in multiple ways (e.g.,
they don't have to
>>>>>>> download files, they can see exactly what cmd
arguments you used, they can
>>>>>>> tweak the cmd arguments without having LLVM on
their machine etc.).
>>>>>>>
>>>>>>> Is there any comprehensive tutorial/document to
understand generated
>>>>>>>> instructions or the semantics of the vectorized
code ?
>>>>>>>
>>>>>>>
>>>>>>> This is quite generic, what is more specifically
that you want to
>>>>>>> understand? Do you want to understand what each
individual instruction
>>>>>>> does? Do you maybe understand that but
>>>>>>> you don't know what is the general method to
generate, let's say by
>>>>>>> hand, vectorized code (or more specifically,
branching vectorized code). Or
>>>>>>> maybe, you want to understand
>>>>>>> how _LLVM_ generates this code, i.e., the inner
workings of the
>>>>>>> vectorization passes.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefanos
>>>>>>>
>>>>>>> [1] https://godbolt.org/z/8eKqnrMPn
>>>>>>>
>>>>>>> Στις Κυρ, 6 Ιουν 2021 στις 6:17 π.μ., ο/η
Sudakshina Dutta via
>>>>>>> llvm-dev <llvm-dev at lists.llvm.org> έγραψε:
>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> Greetings. I have generated a vectorized code
from a C source file
>>>>>>>> (attached). Is there any comprehensive
tutorial/document to understand
>>>>>>>> generated instructions or the semantics of the
vectorized code ?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Sudakshina
>>>>>>>> _______________________________________________
>>>>>>>> LLVM Developers mailing list
>>>>>>>> llvm-dev at lists.llvm.org
>>>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20210614/e9011549/attachment.html>

llvm dev - Jun 2021 - Document to understand vectorized code

[llvm-dev] Document to understand vectorized code

[llvm-dev] Document to understand vectorized code