Roman Gareev via llvm-dev
2016-May-02  18:07 UTC
[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication
Hi Tobias,
according to [1], we can expect 90% of the turbo boost peak of the
processor with a C version of Matrix-Matrix Multiplication that is
similar to the one presented in [1]. In case of Intel Core i7-3820
SandyBridge, the theoretical maximal performance of the machine is
28.8 gflops and hence the expected number is 25,92 gflops.
However, in case of, for example, n = m = 1056 and k = 1024 a code
based on BLIS framework takes 0.088919 seconds and hence 25,68 gflops.
I’m not sure whether a C implementation, which similar to one the
presented in [1], can outperform a code based on BLIS framework. What
do you think about it?  What performance should we necessarily reach?
In case of, for example,  n = m = 1056 and k = 1024, I’ve managed to
reduce the execution time from 0.107210 to 0,097794 seconds (a
corresponding implementation can be found attached) and hence get
23,35 gflops. To do so, the following things were implemented:
1. An algorithm presented in [1] and, probably, in [2] assumes that
all matrices are stored in column-major order. That’s why 'i' an
'j'
loops were interchanged and parameters kc, nc, mc were recomputed to
correspond to row-major order.
2. The algorithm presented in [1] differs from the one presented in
[2]. It performs only one level of tiling of a 'j' loop and gets
micro-panels of width nr. Furthermore, it always uses a static buffer
with the sizes 1000 x kc. Although it helps to avoid repacking of the
Ac for every iteration of the Loop 1 from [2], it limits the size of
the ’n’ dimension to 1000. I’ve also implemented it in the attached
program (however, it uses a buffer with the sizes 1056 x 1024).
3. I’ve found out that along with kernels written in assembly a
repository with BLIS framework also contains their C implementations
[3] (Sorry, I managed to do it only after they reorganized their
repository. I haven’t determined the purpose of C implementations
yet). The implementation, which targets Sandy Bridge architecture [4],
contains inline assembly with 'prefetch' instructions. Using it in the
attached program helps to reduce the execution time from 0.12 to
0,097794 seconds.
I have one more question. Could we create additional threads to
perform packing transformation?
If I’m not mistaken, BLIS framework applies multithreading in its
implementation [5]. That’s why we would probably get more than
0.088919 seconds mentioned above, if the multithreading were disabled
(I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP
threads. However, I haven’t found a way to avoid usual
multithreading).
Refs.
[1] - http://wiki.cs.utexas.edu/rvdg/HowToOptimizeGemm
[2] - http://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf
[3] - https://github.com/flame/blis/tree/master/kernels/x86_64/sandybridge/3
[4] -
https://github.com/flame/blis/blob/master/kernels/x86_64/sandybridge/3/bli_gemm_int_d8x4.c
[5] -
https://github.com/flame/blis/blob/master/frame/3/gemm/bli_gemm_blk_var3f.c
-- 
                                    Cheers, Roman Gareev.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gemm_SIMD.c
Type: text/x-csrc
Size: 5715 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160502/ae26ee5e/attachment.c>
Tobias Grosser via llvm-dev
2016-May-04  06:18 UTC
[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication
On 05/02/2016 08:07 PM, Roman Gareev wrote:> Hi Tobias,Hi Roman, thank you for the updated.> according to [1], we can expect 90% of the turbo boost peak of the > processor with a C version of Matrix-Matrix Multiplication that is > similar to the one presented in [1]. In case of Intel Core i7-3820 > SandyBridge, the theoretical maximal performance of the machine is > 28.8 gflops and hence the expected number is 25,92 gflops.I see.> However, in case of, for example, n = m = 1056 and k = 1024 a code > based on BLIS framework takes 0.088919 seconds and hence 25,68 gflops.So you are saying BLIS itself only reaches 89% of peak? I would have expected for them to reach slightly more, but your measurements in your GSoC draft seem to give the same numbers, also for MKL. So let's assume this is true and not worry about this for now.> I’m not sure whether a C implementation, which similar to one the > presented in [1], can outperform a code based on BLIS framework. What > do you think about it? What performance should we necessarily reach?The first goal I would set is to get within 10-20% percent of what we see with BLIS, rather then being an order of magnitude off. The precise performance is probably not yet so relevant. What is important is to understand what transformations are needed and to ensure the most impactful building blocks are available through Polly.> In case of, for example, n = m = 1056 and k = 1024, I’ve managed to > reduce the execution time from 0.107210 to 0,097794 seconds (a > corresponding implementation can be found attached) and hence get > 23,35 gflops.This is 81% of peak. Even though I would hope we get even closer, I think this is already a great first step. Just for comparison it would be nice if you could list GFLOP performance we see with clang -O3, Polly and icc -O3 -ffast-math> To do so, the following things were implemented: > > 1. An algorithm presented in [1] and, probably, in [2] assumes that > all matrices are stored in column-major order. That’s why 'i' an 'j' > loops were interchanged and parameters kc, nc, mc were recomputed to > correspond to row-major order.OK.> 2. The algorithm presented in [1] differs from the one presented in > [2]. It performs only one level of tiling of a 'j' loop and gets > micro-panels of width nr. Furthermore, it always uses a static buffer > with the sizes 1000 x kc. Although it helps to avoid repacking of the > Ac for every iteration of the Loop 1 from [2], it limits the size of > the ’n’ dimension to 1000. I’ve also implemented it in the attached > program (however, it uses a buffer with the sizes 1056 x 1024).Did you now implement [1] or [2]? I assume it's [1]?> 3. I’ve found out that along with kernels written in assembly a > repository with BLIS framework also contains their C implementations > [3] (Sorry, I managed to do it only after they reorganized their > repository. I haven’t determined the purpose of C implementations > yet). The implementation, which targets Sandy Bridge architecture [4], > contains inline assembly with 'prefetch' instructions. Using it in the > attached program helps to reduce the execution time from 0.12 to > 0,097794 seconds.Nice.> > I have one more question. Could we create additional threads to > perform packing transformation?Why would we do this? I would believe that multi-threading at the outermost loops is likely more efficient?> If I’m not mistaken, BLIS framework applies multithreading in its > implementation [5]. That’s why we would probably get more than > 0.088919 seconds mentioned above, if the multithreading were disabled > (I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP > threads. However, I haven’t found a way to avoid usual > multithreading).So you are saying BLIS is using multithreading and still only reaches 90% of a single core? This sounds exceptionally inefficient. These numbers look a lot more like the single threaded performance of BLIS. I would suggest you cross check if this is really single-threaded or multi-threaded performance. Overall, these results already suggest that we can get with reasonable effort into the range or vendor BLAS libraries. Do I understand correctly that the only parts that can not yet be implemented with a Polly schedule transformation are prefetching as well as the on-the-fly transposition of the matrices? Best, Tobias
Roman Gareev via llvm-dev
2016-May-06  09:57 UTC
[llvm-dev] [GSoC 2016] Attaining 90% of the turbo boost peak with a C version of Matrix-Matrix Multiplication
> So you are saying BLIS itself only reaches 89% of peak? I would have > expected for them to reach slightly more, but your measurements in > your GSoC draft seem to give the same numbers, also for MKL. So let's > assume this is true and not worry about this for now.OK.>> I’m not sure whether a C implementation, which similar to one the >> presented in [1], can outperform a code based on BLIS framework. What >> do you think about it? What performance should we necessarily reach? > > The first goal I would set is to get within 10-20% percent of what we > see with BLIS, rather then being an order of magnitude off. The > precise performance is probably not yet so relevant. What is important > is to understand what transformations are needed and to ensure the > most impactful building blocks are available through Polly.Yes, I think that it's important. Thank you for the advice!>> In case of, for example, n = m = 1056 and k = 1024, I’ve managed to >> reduce the execution time from 0.107210 to 0,097794 seconds (a >> corresponding implementation can be found attached) and hence get >> 23,35 gflops. > > This is 81% of peak. Even though I would hope we get even closer, I > think this is already a great first step. Just for comparison it would > be nice if you could list GFLOP performance we see with clang -O3, > Polly and icc -O3 -ffast-mathIn case of n = m = 1056 and k = 1024, I get the following numbers: 1. clang -O3 1,09627398514907 2. Polly vect=stripmine 7,614133825873 3. Polly vect=none 3,04573476564844 4. OpenBLAS 26,39588687139538 5. BLIS 25,68403297383012 6. Intel MKL 26,75772431488793>> To do so, the following things were implemented: >> >> 1. An algorithm presented in [1] and, probably, in [2] assumes that >> all matrices are stored in column-major order. That’s why 'i' an 'j' >> loops were interchanged and parameters kc, nc, mc were recomputed to >> correspond to row-major order. > > OK. > >> 2. The algorithm presented in [1] differs from the one presented in >> [2]. It performs only one level of tiling of a 'j' loop and gets >> micro-panels of width nr. Furthermore, it always uses a static buffer >> with the sizes 1000 x kc. Although it helps to avoid repacking of the >> Ac for every iteration of the Loop 1 from [2], it limits the size of >> the ’n’ dimension to 1000. I’ve also implemented it in the attached >> program (however, it uses a buffer with the sizes 1056 x 1024). > > Did you now implement [1] or [2]? I assume it's [1]?Yes, I implemented [1]. It can be found in the attached file.>> 3. I’ve found out that along with kernels written in assembly a >> repository with BLIS framework also contains their C implementations >> [3] (Sorry, I managed to do it only after they reorganized their >> repository. I haven’t determined the purpose of C implementations >> yet). The implementation, which targets Sandy Bridge architecture [4], >> contains inline assembly with 'prefetch' instructions. Using it in the >> attached program helps to reduce the execution time from 0.12 to >> 0,097794 seconds. > > Nice. > >> >> I have one more question. Could we create additional threads to >> perform packing transformation? > > Why would we do this? I would believe that multi-threading at the > outermost loops is likely more efficient?Yes, it's possibly more efficient. Sorry, I thought that a multi-threaded version of the packing transformation is used, for example, in BLIS framework by default. However, we could probably use it, if we implement a multi-threaded version of matrix multiplication.>> If I’m not mistaken, BLIS framework applies multithreading in its >> implementation [5]. That’s why we would probably get more than >> 0.088919 seconds mentioned above, if the multithreading were disabled >> (I’ve been using export OMP_THREAD_LIMIT=1 to limit the number of OMP >> threads. However, I haven’t found a way to avoid usual >> multithreading). > > So you are saying BLIS is using multithreading and still only reaches > 90% of a single core? This sounds exceptionally inefficient. These > numbers look a lot more like the single threaded performance of BLIS. I > would suggest you cross check if this is really single-threaded or > multi-threaded performance.Yes, you’re right. This is single-threaded performance, because BLIS disables multithreading by default https://github.com/flame/blis/wiki/Multithreading. I confused it with Intel MKL. Thanks for the advice.> Overall, these results already suggest that we can get with reasonable > effort into the range or vendor BLAS libraries. > > Do I understand correctly that the only parts that can not yet be > implemented with a Polly schedule transformation are prefetching as > well as the on-the-fly transposition of the matrices?Yes, I think you're right. -- Cheers, Roman Gareev.