thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Oct-29 17:30 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Ralf, et al.,

Attached is the latest version of my autovectorization patch. llvmdev
has been CC'd (as had been suggested to me); this e-mail contains
additional benchmark results.

First, these are preliminary results because I did not do the things
necessary to make them real (explicitly quiet the machine, bind the
processes to one cpu, etc.). But they should be good enough for
discussion.

I'm using LLVM head r143101, with the attached patch applied, and clang
head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
without any other optimization flags. opt was run -vectorize
-unroll-allow-partial -O3 with no other optimization flags (the patch
adds the -vectorize option). llc was just given -O3.

Below I've included results using the benchmark program by Maleki, et
al. See:
An Evaluation of Vectorizing Compilers - PACT'11
(http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
their benchmark program was retrieved from:
http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz

Also, when using clang, I had to pass -Dinline= on the command line:
when using -emit-llvm, clang appears not to emit code for functions
declared inline. This is a bug, but I've not yet tracked it down. There
are two such small functions in the benchmark program, and the regular
inliner *should* catch them anyway.

Results:
0. Name of the loop
1. Time using LLVM with vectorization
2. Time using LLVM without vectorization
3. Time using gcc with vectorization
4. Time using gcc without vectorization

Loop       llvm-v   llvm     gcc-v    gcc
-------------------------------------------
S000        9.59     9.49     4.55    10.04
S111        7.67     7.37     7.68     7.83
S1111      13.98    14.48    16.14    16.30
S112       17.43    17.41    16.54    17.52
S1112      13.87    14.21    14.83    14.84
S113       22.97    22.88    22.05    22.05
S1113      11.46    11.42    11.03    11.01
S114       13.47    13.75    13.53    13.48
S115       33.06    33.24    49.98    49.99
S1115      13.91    14.18    13.65    13.66
S116       48.74    49.40    49.54    48.11
S118       11.04    11.26    10.79    10.50
S119        8.97     9.07    11.83    11.82
S1119       9.04     9.14     4.31    11.87
S121       18.06    18.78    14.84    17.31
S122        7.58     7.54     6.11     6.11
S123        7.02     7.38     7.42     7.41
S124        9.62     9.77     9.42     9.33
S125        7.14     7.22     4.67     7.81
S126        2.32     2.53     2.57     2.37
S127       12.87    12.97     7.06    14.50
S128       12.58    12.43    12.42    11.52
S131       29.91    29.91    25.17    28.94
S132       17.04    17.04    15.53    21.03
S141       12.59    12.26    12.38    12.05
S151       28.92    29.43    24.89    28.95
S152       15.68    16.03    11.19    15.63
S161        6.06     6.06     5.52     5.46
S1161      14.46    14.40     8.80     8.79
S162        8.31     9.05     5.36     8.18
S171       15.47     7.94     2.81     5.70
S172        5.92     5.89     2.75     5.70
S173       31.59    30.92    18.15    30.13
S174       31.16    31.66    18.51    30.16
S175        5.80     6.18     4.94     5.77
S176        5.69     5.83     4.41     7.65
S211       16.56    17.14    16.82    16.38
S212       13.46    14.28    13.34    13.18
S1213      13.12    13.46    12.80    12.43
S221       10.88    11.09     8.65     8.63
S1221       5.80     6.04     5.40     6.05
S222        6.01     6.26     5.70     5.72
S231       23.78    22.94    22.36    22.11
S232        6.88     6.88     6.89     6.89
S1232      16.00    15.34    15.05    15.10
S233       57.48    58.55    54.21    49.56
S2233      27.65    29.77    29.68    28.40
S235       46.40    44.92    46.94    43.93
S241       31.62    31.35    32.53    31.01
S242        7.20     7.20     7.20     7.20
S243       16.78    17.09    17.69    16.84
S244       14.64    14.83    16.91    16.82
S1244      14.98    14.83    14.77    14.40
S2244      10.47    10.62    10.40    10.06
S251       35.10    35.75    19.70    34.38
S1251      56.65    57.84    41.77    56.11
S2251      15.96    15.87    17.02    15.70
S3251      16.41    16.21    19.60    15.34
S252        7.24     6.32     7.72     7.26
S253       12.55    11.38    14.40    14.40
S254       19.08    18.70    28.23    28.06
S255        5.94     6.09     9.96     9.95
S256        3.14     3.42     3.10     3.09
S257        2.18     2.25     2.21     2.20
S258        1.80     1.82     1.84     1.84
S261       12.00    12.08    10.98    10.95
S271       32.93    33.04    33.25    33.01
S272       15.48    15.82    15.39    15.26
S273       13.99    14.04    16.86    16.80
S274       18.38    18.31    18.15    17.89
S275        3.02     3.02     3.36     2.98
S2275      33.71    33.50     8.97    33.60
S276       39.52    39.44    40.80    40.55
S277        4.81     4.80     4.81     4.80
S278       14.43    14.42    14.70    14.66
S279        8.10     8.29     7.25     7.27
S1279       9.77    10.06     9.34     9.25
S2710       7.85     8.04     7.86     7.56
S2711      35.54    35.55    36.56    36.00
S2712      33.16    33.17    34.24    33.47
S281       10.97    11.09    12.46    12.02
S1281      79.37    77.55    57.78    68.06
S291       11.94    11.78    14.03    14.03
S292        7.88     7.78     9.94     9.96
S293       15.90    15.87    19.32    19.33
S2101       2.59     2.58     2.59     2.60
S2102      17.63    17.53    16.68    16.75
S2111       5.63     5.60     5.85     5.85
S311       72.07    72.03    72.23    72.03
S31111      7.49     6.00     6.00     6.00
S312       96.06    96.04    96.05    96.03
S313       36.50    36.13    36.03    36.02
S314       36.10    36.07    74.67    72.42
S315        9.00     8.99     9.35     9.30
S316       36.11    36.06    72.08    74.87
S317      444.92   444.94   451.82   451.78
S318        9.04     9.07     7.30     7.30
S319       34.76    36.53    34.42    34.19
S3110       8.53     8.57     4.11     4.11
S13110      5.76     5.77    12.12    12.12
S3111       3.60     3.62     3.60     3.60
S3112       7.20     7.30     7.21     7.20
S3113      35.12    35.47    60.21    60.20
S321       16.81    16.81    16.80    16.80
S322       12.42    12.60    12.60    12.60
S323       10.93    11.02     8.48     8.51
S331        4.23     4.23     7.20     7.20
S332        7.21     7.21     5.21     5.31
S341        4.74     4.85     7.23     7.20
S342        6.02     6.09     7.25     7.20
S343        2.14     2.06     2.16     2.01
S351       49.26    47.34    21.82    46.46
S1351      50.85    50.35    33.68    49.06
S352       58.14    58.04    57.68    57.64
S353        8.35     8.38     8.34     8.19
S421       43.13    43.34    20.62    22.46
S1421      25.25    25.81    15.85    24.76
S422       88.36    87.53    79.22    78.99
S423      155.13   155.29   154.56   154.38
S424       37.11    37.51    11.42    22.36
S431       58.22    60.66    27.59    57.16
S441       14.05    13.29    12.88    12.81
S442        6.08     6.00     6.96     6.90
S443       17.60    17.77    17.15    16.95
S451       48.95    49.08    49.03    49.14
S452       42.98    39.32    14.64    96.03
S453       28.06    28.06    14.60    14.40
S471        8.53     8.65     8.39     8.43
S481       10.98    11.15    12.04    12.00
S482        9.31     9.31     9.19     9.17
S491       11.54    11.38    11.37    11.28
S4112       8.21     8.36     9.13     8.94
S4113       8.77     8.81     8.86     8.85
S4114      12.32    12.15    12.18    11.77
S4115       8.48     8.46     8.95     8.59
S4116       3.21     3.23     6.02     5.94
S4117      14.08     9.61    10.16     9.98
S4121       8.53     8.26     4.04     8.17
va         30.09    28.58    23.58    48.46
vag        12.35    12.36    13.58    13.20
vas        13.74    13.49    13.03    12.47
vif         4.49     4.57     5.06     4.92
vpv        58.59    57.22    28.28    57.24
vtv        59.15    57.83    28.40    57.63
vpvtv      33.18    32.84    16.35    32.73
vpvts       5.99     5.83     2.99     6.38
vpvpv      33.25    32.89    16.54    32.85
vtvtv      32.83    32.80    16.84    35.97
vsumr      72.03    72.03    72.20    72.04
vdotr      72.05    72.05    72.42    72.04
vbor      205.22   380.81    99.80   372.05

I've yet to go through these in detail (they just finished running 5
minutes ago). But for the curious (and I've had several requests for
benchmarks), here you go. There is obviously more work to do.

 -Hal

On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg
wrote:> Hi Hal,
> 
> those numbers look very promising, great work! :)
> 
> Best,
> Ralf
> 
> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com>
> > Cc: llvm-commits at cs.uiuc.edu
> > Sent: Freitag, 28. Oktober 2011 13:50:00
> > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
> > 
> > Bruno, et al.,
> > 
> > I've attached a new version of the patch that contains
improvements
> > (and
> > a critical bug fix [the code output is not more right, but the pass
> > in
> > the older patch would crash in certain cases and now does not])
> > compared
> > to previous versions that I've posted.
> > 
> > First, these are preliminary results because I did not do the things
> > necessary to make them real (explicitly quiet the machine, bind the
> > processes to one cpu, etc.). But they should be good enough for
> > discussion.
> > 
> > I'm using LLVM head r143101, with the attached patch applied, and
> > clang
> > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
> > gcc
> > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > without any other optimization flags. opt was run -vectorize
> > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > adds the -vectorize option). llc was just given -O3.
> > 
> > It is not difficult to construct an example in which vectorization
> > would
> > be useful: take a loop that does more computation than load/stores,
> > and
> > (partially) unroll it. Here is a simple case:
> > 
> > #define ITER 5000
> > #define NUM 200
> > double a[NUM][NUM];
> > double b[NUM][NUM];
> > 
> > ...
> > 
> > int main()
> > {
> > ...
> > 
> >   for (int i = 0; i < ITER; ++i) {
> >     for (int x = 0; x < NUM; ++x)
> >     for (int y = 0; y < NUM; ++y) {
> >       double v = a[x][y], w = b[x][y];
> >       double z1 = v*w;
> >       double z2 = v+w;
> >       double z3 = z1*z2;
> >       double z4 = z3+v;
> >       double z5 = z2+w;
> >       double z6 = z4*z5;
> >       double z7 = z4+z5;
> >       a[x][y] = v*v-z6;
> >       b[x][y] = w-z7;
> >     }
> >   }
> > 
> >  ...
> > 
> >   return 0;
> > }
> > 
> > Results:
> > gcc -03: 0m1.790s
> > llvm -vectorize: 0m2.360s
> > llvm: 0m2.780s
> > gcc -fno-tree-vectorize: 0m2.810s
> > (these are the user times after I've run enough for the times to
> > settle
> > to three decimal places)
> > 
> > So the vectorization gives a ~15% improvement in the running time.
> > gcc's
> > vectorization still does a much better job, however (yielding an ~36%
> > improvement). So there is still work to do ;)
> > 
> > Additionally, I've checked the autovectorization on some classic
> > numerical benchmarks from netlib. On these benchmarks, clang/llvm
> > already do a good job compared to gcc (gcc is only about 10% better,
> > and
> > this is true regardless of whether gcc's vectorization is on or
off).
> > For these cases, autovectorization provides an insignificant speedup
> > in
> > most cases (but does not tend to make things worse, just not really
> > any
> > better either). Because gcc's vectorization also did not really
help
> > gcc
> > in these cases, I'm not surprised. A good collection of these is
> > available here:
> > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > 
> > I've yet to run the test suite using the pass to validate it. That
is
> > something that I plan to do. Actually, the "Livermore Loops"
test in
> > the
> > aforementioned archive contains checksums to validate the results,
> > and
> > it looks like 1 or 2 of the loop results are wrong with vectorization
> > turned on, so I'll have to investigate that.
> > 
> >  -Hal
> > 
> > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > > Hi Hal,
> > > 
> > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at
anl.gov>
> > > wrote:
> > > > I've attached an initial version of a basic-block
> > > > autovectorization
> > > > pass. It works by searching a basic block for pairable
> > > > (independent)
> > > > instructions, and, using a chain-seeking heuristic, selects
> > > > pairings
> > > > likely to provide an overall speedup (if such pairings can
be
> > > > found).
> > > > The selected pairs are then fused and, if necessary, other
> > > > instructions
> > > > are moved in order to maintain data-flow consistency. This
works
> > > > only
> > > > within one basic block, but can do loop vectorization in
> > > > combination
> > > > with (partial) unrolling. The basic idea was inspired by the
> > > > Vienna MAP
> > > > Vectorizor, which has been used to vectorize FFT kernels,
but the
> > > > algorithm used here is different.
> > > >
> > > > To try it, use -bb-vectorize with opt. There are a few
options:
> > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of
the
> > > > chain of
> > > > instruction pairs necessary in order to consider the pairs
that
> > > > compose
> > > > the chain worthy of vectorization.
> > > > -bb-vectorize-vector-bits: default: 128 -- The size of the
target
> > > > vector
> > > > registers
> > > > -bb-vectorize-no-ints -- Don't consider integer
instructions
> > > > -bb-vectorize-no-floats -- Don't consider floating-point
> > > > instructions
> > > >
> > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > pairs;
> > > > The assumption is that other passes will turn these into
shuffles
> > > > when
> > > > possible (it looks like some work is necessary here). It
will
> > > > also
> > > > vectorize vector instructions, and generates shuffles in
this
> > > > case
> > > > (again, other passes should combine these as appropriate).
> > > >
> > > > Currently, it does not fuse load or store instructions, but
that
> > > > is a
> > > > feature that I'd like to add. Of course, alignment
information is
> > > > an
> > > > issue for load/store vectorization (or maybe I should just
fuse
> > > > them
> > > > anyway and let isel deal with unaligned cases?).
> > > >
> > > > Also, support needs to be added for fusing known intrinsics
(fma,
> > > > etc.),
> > > > and, as has been discussed on llvmdev, we should add some
> > > > intrinsics to
> > > > allow the generation of addsub-type instructions.
> > > >
> > > > I've included a few tests, but it needs more. Please
review (I'll
> > > > commit
> > > > if and when everyone is happy).
> > > >
> > > > Thanks in advance,
> > > > Hal
> > > >
> > > > P.S. There is another option (not so useful right now, but
could
> > > > be):
> > > > -bb-vectorize-fast-dep -- Don't do a full
inter-instruction
> > > > dependency
> > > > analysis; instead stop looking for instruction pairs after
the
> > > > first use
> > > > of an instruction's value. [This makes the pass faster,
but would
> > > > require a data-dependence-based reordering pass in order to
be
> > > > effective].
> > > 
> > > Cool! :)
> > > Have you run this pass with any benchmark or the llvm testsuite?
> > > Does
> > > it presents any regression?
> > > Do you have any performance results?
> > > Cheers,
> > > 
> > 
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111029.diff
Type: text/x-patch
Size: 71798 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111029/983e4044/attachment.bin>

Hal Finkel

2011-Oct-29 19:02 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:> Ralf, et al.,
> 
> Attached is the latest version of my autovectorization patch. llvmdev
> has been CC'd (as had been suggested to me); this e-mail contains
> additional benchmark results.
> 
> First, these are preliminary results because I did not do the things
> necessary to make them real (explicitly quiet the machine, bind the
> processes to one cpu, etc.). But they should be good enough for
> discussion.
> 
> I'm using LLVM head r143101, with the attached patch applied, and clang
> head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
> comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> without any other optimization flags. opt was run -vectorize
> -unroll-allow-partial -O3 with no other optimization flags (the patch
> adds the -vectorize option).
And opt had also been given the flag: -bb-vectorize-vector-bits=256

 -Hal
> llc was just given -O3.
> 
> Below I've included results using the benchmark program by Maleki, et
> al. See:
> An Evaluation of Vectorizing Compilers - PACT'11
> (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> their benchmark program was retrieved from:
> http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> 
> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down. There
> are two such small functions in the benchmark program, and the regular
> inliner *should* catch them anyway.
> 
> Results:
> 0. Name of the loop
> 1. Time using LLVM with vectorization
> 2. Time using LLVM without vectorization
> 3. Time using gcc with vectorization
> 4. Time using gcc without vectorization
> 
> Loop       llvm-v   llvm     gcc-v    gcc
> -------------------------------------------
> S000        9.59     9.49     4.55    10.04
> S111        7.67     7.37     7.68     7.83
> S1111      13.98    14.48    16.14    16.30
> S112       17.43    17.41    16.54    17.52
> S1112      13.87    14.21    14.83    14.84
> S113       22.97    22.88    22.05    22.05
> S1113      11.46    11.42    11.03    11.01
> S114       13.47    13.75    13.53    13.48
> S115       33.06    33.24    49.98    49.99
> S1115      13.91    14.18    13.65    13.66
> S116       48.74    49.40    49.54    48.11
> S118       11.04    11.26    10.79    10.50
> S119        8.97     9.07    11.83    11.82
> S1119       9.04     9.14     4.31    11.87
> S121       18.06    18.78    14.84    17.31
> S122        7.58     7.54     6.11     6.11
> S123        7.02     7.38     7.42     7.41
> S124        9.62     9.77     9.42     9.33
> S125        7.14     7.22     4.67     7.81
> S126        2.32     2.53     2.57     2.37
> S127       12.87    12.97     7.06    14.50
> S128       12.58    12.43    12.42    11.52
> S131       29.91    29.91    25.17    28.94
> S132       17.04    17.04    15.53    21.03
> S141       12.59    12.26    12.38    12.05
> S151       28.92    29.43    24.89    28.95
> S152       15.68    16.03    11.19    15.63
> S161        6.06     6.06     5.52     5.46
> S1161      14.46    14.40     8.80     8.79
> S162        8.31     9.05     5.36     8.18
> S171       15.47     7.94     2.81     5.70
> S172        5.92     5.89     2.75     5.70
> S173       31.59    30.92    18.15    30.13
> S174       31.16    31.66    18.51    30.16
> S175        5.80     6.18     4.94     5.77
> S176        5.69     5.83     4.41     7.65
> S211       16.56    17.14    16.82    16.38
> S212       13.46    14.28    13.34    13.18
> S1213      13.12    13.46    12.80    12.43
> S221       10.88    11.09     8.65     8.63
> S1221       5.80     6.04     5.40     6.05
> S222        6.01     6.26     5.70     5.72
> S231       23.78    22.94    22.36    22.11
> S232        6.88     6.88     6.89     6.89
> S1232      16.00    15.34    15.05    15.10
> S233       57.48    58.55    54.21    49.56
> S2233      27.65    29.77    29.68    28.40
> S235       46.40    44.92    46.94    43.93
> S241       31.62    31.35    32.53    31.01
> S242        7.20     7.20     7.20     7.20
> S243       16.78    17.09    17.69    16.84
> S244       14.64    14.83    16.91    16.82
> S1244      14.98    14.83    14.77    14.40
> S2244      10.47    10.62    10.40    10.06
> S251       35.10    35.75    19.70    34.38
> S1251      56.65    57.84    41.77    56.11
> S2251      15.96    15.87    17.02    15.70
> S3251      16.41    16.21    19.60    15.34
> S252        7.24     6.32     7.72     7.26
> S253       12.55    11.38    14.40    14.40
> S254       19.08    18.70    28.23    28.06
> S255        5.94     6.09     9.96     9.95
> S256        3.14     3.42     3.10     3.09
> S257        2.18     2.25     2.21     2.20
> S258        1.80     1.82     1.84     1.84
> S261       12.00    12.08    10.98    10.95
> S271       32.93    33.04    33.25    33.01
> S272       15.48    15.82    15.39    15.26
> S273       13.99    14.04    16.86    16.80
> S274       18.38    18.31    18.15    17.89
> S275        3.02     3.02     3.36     2.98
> S2275      33.71    33.50     8.97    33.60
> S276       39.52    39.44    40.80    40.55
> S277        4.81     4.80     4.81     4.80
> S278       14.43    14.42    14.70    14.66
> S279        8.10     8.29     7.25     7.27
> S1279       9.77    10.06     9.34     9.25
> S2710       7.85     8.04     7.86     7.56
> S2711      35.54    35.55    36.56    36.00
> S2712      33.16    33.17    34.24    33.47
> S281       10.97    11.09    12.46    12.02
> S1281      79.37    77.55    57.78    68.06
> S291       11.94    11.78    14.03    14.03
> S292        7.88     7.78     9.94     9.96
> S293       15.90    15.87    19.32    19.33
> S2101       2.59     2.58     2.59     2.60
> S2102      17.63    17.53    16.68    16.75
> S2111       5.63     5.60     5.85     5.85
> S311       72.07    72.03    72.23    72.03
> S31111      7.49     6.00     6.00     6.00
> S312       96.06    96.04    96.05    96.03
> S313       36.50    36.13    36.03    36.02
> S314       36.10    36.07    74.67    72.42
> S315        9.00     8.99     9.35     9.30
> S316       36.11    36.06    72.08    74.87
> S317      444.92   444.94   451.82   451.78
> S318        9.04     9.07     7.30     7.30
> S319       34.76    36.53    34.42    34.19
> S3110       8.53     8.57     4.11     4.11
> S13110      5.76     5.77    12.12    12.12
> S3111       3.60     3.62     3.60     3.60
> S3112       7.20     7.30     7.21     7.20
> S3113      35.12    35.47    60.21    60.20
> S321       16.81    16.81    16.80    16.80
> S322       12.42    12.60    12.60    12.60
> S323       10.93    11.02     8.48     8.51
> S331        4.23     4.23     7.20     7.20
> S332        7.21     7.21     5.21     5.31
> S341        4.74     4.85     7.23     7.20
> S342        6.02     6.09     7.25     7.20
> S343        2.14     2.06     2.16     2.01
> S351       49.26    47.34    21.82    46.46
> S1351      50.85    50.35    33.68    49.06
> S352       58.14    58.04    57.68    57.64
> S353        8.35     8.38     8.34     8.19
> S421       43.13    43.34    20.62    22.46
> S1421      25.25    25.81    15.85    24.76
> S422       88.36    87.53    79.22    78.99
> S423      155.13   155.29   154.56   154.38
> S424       37.11    37.51    11.42    22.36
> S431       58.22    60.66    27.59    57.16
> S441       14.05    13.29    12.88    12.81
> S442        6.08     6.00     6.96     6.90
> S443       17.60    17.77    17.15    16.95
> S451       48.95    49.08    49.03    49.14
> S452       42.98    39.32    14.64    96.03
> S453       28.06    28.06    14.60    14.40
> S471        8.53     8.65     8.39     8.43
> S481       10.98    11.15    12.04    12.00
> S482        9.31     9.31     9.19     9.17
> S491       11.54    11.38    11.37    11.28
> S4112       8.21     8.36     9.13     8.94
> S4113       8.77     8.81     8.86     8.85
> S4114      12.32    12.15    12.18    11.77
> S4115       8.48     8.46     8.95     8.59
> S4116       3.21     3.23     6.02     5.94
> S4117      14.08     9.61    10.16     9.98
> S4121       8.53     8.26     4.04     8.17
> va         30.09    28.58    23.58    48.46
> vag        12.35    12.36    13.58    13.20
> vas        13.74    13.49    13.03    12.47
> vif         4.49     4.57     5.06     4.92
> vpv        58.59    57.22    28.28    57.24
> vtv        59.15    57.83    28.40    57.63
> vpvtv      33.18    32.84    16.35    32.73
> vpvts       5.99     5.83     2.99     6.38
> vpvpv      33.25    32.89    16.54    32.85
> vtvtv      32.83    32.80    16.84    35.97
> vsumr      72.03    72.03    72.20    72.04
> vdotr      72.05    72.05    72.42    72.04
> vbor      205.22   380.81    99.80   372.05
> 
> I've yet to go through these in detail (they just finished running 5
> minutes ago). But for the curious (and I've had several requests for
> benchmarks), here you go. There is obviously more work to do.
> 
>  -Hal
> 
> On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > Hi Hal,
> > 
> > those numbers look very promising, great work! :)
> > 
> > Best,
> > Ralf
> > 
> > ----- Original Message -----
> > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > To: "Bruno Cardoso Lopes" <bruno.cardoso at
gmail.com>
> > > Cc: llvm-commits at cs.uiuc.edu
> > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization
Pass
> > > 
> > > Bruno, et al.,
> > > 
> > > I've attached a new version of the patch that contains
improvements
> > > (and
> > > a critical bug fix [the code output is not more right, but the
pass
> > > in
> > > the older patch would crash in certain cases and now does not])
> > > compared
> > > to previous versions that I've posted.
> > > 
> > > First, these are preliminary results because I did not do the
things
> > > necessary to make them real (explicitly quiet the machine, bind
the
> > > processes to one cpu, etc.). But they should be good enough for
> > > discussion.
> > > 
> > > I'm using LLVM head r143101, with the attached patch applied,
and
> > > clang
> > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For
the
> > > gcc
> > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was
run -O3
> > > without any other optimization flags. opt was run -vectorize
> > > -unroll-allow-partial -O3 with no other optimization flags (the
patch
> > > adds the -vectorize option). llc was just given -O3.
> > > 
> > > It is not difficult to construct an example in which
vectorization
> > > would
> > > be useful: take a loop that does more computation than
load/stores,
> > > and
> > > (partially) unroll it. Here is a simple case:
> > > 
> > > #define ITER 5000
> > > #define NUM 200
> > > double a[NUM][NUM];
> > > double b[NUM][NUM];
> > > 
> > > ...
> > > 
> > > int main()
> > > {
> > > ...
> > > 
> > >   for (int i = 0; i < ITER; ++i) {
> > >     for (int x = 0; x < NUM; ++x)
> > >     for (int y = 0; y < NUM; ++y) {
> > >       double v = a[x][y], w = b[x][y];
> > >       double z1 = v*w;
> > >       double z2 = v+w;
> > >       double z3 = z1*z2;
> > >       double z4 = z3+v;
> > >       double z5 = z2+w;
> > >       double z6 = z4*z5;
> > >       double z7 = z4+z5;
> > >       a[x][y] = v*v-z6;
> > >       b[x][y] = w-z7;
> > >     }
> > >   }
> > > 
> > >  ...
> > > 
> > >   return 0;
> > > }
> > > 
> > > Results:
> > > gcc -03: 0m1.790s
> > > llvm -vectorize: 0m2.360s
> > > llvm: 0m2.780s
> > > gcc -fno-tree-vectorize: 0m2.810s
> > > (these are the user times after I've run enough for the times
to
> > > settle
> > > to three decimal places)
> > > 
> > > So the vectorization gives a ~15% improvement in the running
time.
> > > gcc's
> > > vectorization still does a much better job, however (yielding an
~36%
> > > improvement). So there is still work to do ;)
> > > 
> > > Additionally, I've checked the autovectorization on some
classic
> > > numerical benchmarks from netlib. On these benchmarks, clang/llvm
> > > already do a good job compared to gcc (gcc is only about 10%
better,
> > > and
> > > this is true regardless of whether gcc's vectorization is on
or off).
> > > For these cases, autovectorization provides an insignificant
speedup
> > > in
> > > most cases (but does not tend to make things worse, just not
really
> > > any
> > > better either). Because gcc's vectorization also did not
really help
> > > gcc
> > > in these cases, I'm not surprised. A good collection of these
is
> > > available here:
> > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > 
> > > I've yet to run the test suite using the pass to validate it.
That is
> > > something that I plan to do. Actually, the "Livermore
Loops" test in
> > > the
> > > aforementioned archive contains checksums to validate the
results,
> > > and
> > > it looks like 1 or 2 of the loop results are wrong with
vectorization
> > > turned on, so I'll have to investigate that.
> > > 
> > >  -Hal
> > > 
> > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote:
> > > > Hi Hal,
> > > > 
> > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at
anl.gov>
> > > > wrote:
> > > > > I've attached an initial version of a basic-block
> > > > > autovectorization
> > > > > pass. It works by searching a basic block for pairable
> > > > > (independent)
> > > > > instructions, and, using a chain-seeking heuristic,
selects
> > > > > pairings
> > > > > likely to provide an overall speedup (if such pairings
can be
> > > > > found).
> > > > > The selected pairs are then fused and, if necessary,
other
> > > > > instructions
> > > > > are moved in order to maintain data-flow consistency.
This works
> > > > > only
> > > > > within one basic block, but can do loop vectorization
in
> > > > > combination
> > > > > with (partial) unrolling. The basic idea was inspired
by the
> > > > > Vienna MAP
> > > > > Vectorizor, which has been used to vectorize FFT
kernels, but the
> > > > > algorithm used here is different.
> > > > >
> > > > > To try it, use -bb-vectorize with opt. There are a few
options:
> > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth
of the
> > > > > chain of
> > > > > instruction pairs necessary in order to consider the
pairs that
> > > > > compose
> > > > > the chain worthy of vectorization.
> > > > > -bb-vectorize-vector-bits: default: 128 -- The size of
the target
> > > > > vector
> > > > > registers
> > > > > -bb-vectorize-no-ints -- Don't consider integer
instructions
> > > > > -bb-vectorize-no-floats -- Don't consider
floating-point
> > > > > instructions
> > > > >
> > > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > > pairs;
> > > > > The assumption is that other passes will turn these
into shuffles
> > > > > when
> > > > > possible (it looks like some work is necessary here).
It will
> > > > > also
> > > > > vectorize vector instructions, and generates shuffles
in this
> > > > > case
> > > > > (again, other passes should combine these as
appropriate).
> > > > >
> > > > > Currently, it does not fuse load or store instructions,
but that
> > > > > is a
> > > > > feature that I'd like to add. Of course, alignment
information is
> > > > > an
> > > > > issue for load/store vectorization (or maybe I should
just fuse
> > > > > them
> > > > > anyway and let isel deal with unaligned cases?).
> > > > >
> > > > > Also, support needs to be added for fusing known
intrinsics (fma,
> > > > > etc.),
> > > > > and, as has been discussed on llvmdev, we should add
some
> > > > > intrinsics to
> > > > > allow the generation of addsub-type instructions.
> > > > >
> > > > > I've included a few tests, but it needs more.
Please review (I'll
> > > > > commit
> > > > > if and when everyone is happy).
> > > > >
> > > > > Thanks in advance,
> > > > > Hal
> > > > >
> > > > > P.S. There is another option (not so useful right now,
but could
> > > > > be):
> > > > > -bb-vectorize-fast-dep -- Don't do a full
inter-instruction
> > > > > dependency
> > > > > analysis; instead stop looking for instruction pairs
after the
> > > > > first use
> > > > > of an instruction's value. [This makes the pass
faster, but would
> > > > > require a data-dependence-based reordering pass in
order to be
> > > > > effective].
> > > > 
> > > > Cool! :)
> > > > Have you run this pass with any benchmark or the llvm
testsuite?
> > > > Does
> > > > it presents any regression?
> > > > Do you have any performance results?
> > > > Cheers,
> > > > 
> > > 
> > > --
> > > Hal Finkel
> > > Postdoctoral Appointee
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
1-630-252-0023
hfinkel at anl.gov

Hal Finkel

2011-Oct-29 20:16 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> > Ralf, et al.,
> > 
> > Attached is the latest version of my autovectorization patch. llvmdev
> > has been CC'd (as had been suggested to me); this e-mail contains
> > additional benchmark results.
> > 
> > First, these are preliminary results because I did not do the things
> > necessary to make them real (explicitly quiet the machine, bind the
> > processes to one cpu, etc.). But they should be good enough for
> > discussion.
> > 
> > I'm using LLVM head r143101, with the attached patch applied, and
clang
> > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
gcc
> > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > without any other optimization flags. opt was run -vectorize
> > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > adds the -vectorize option).
> 
> And opt had also been given the flag: -bb-vectorize-vector-bits=256
And this was a mistake (because the machine on which the benchmarks were
run does not have AVX). I've rerun, see better results below...
> 
>  -Hal
> 
> > llc was just given -O3.
> > 
> > Below I've included results using the benchmark program by Maleki,
et
> > al. See:
> > An Evaluation of Vectorizing Compilers - PACT'11
> > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> > their benchmark program was retrieved from:
> > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> > 
> > Also, when using clang, I had to pass -Dinline= on the command line:
> > when using -emit-llvm, clang appears not to emit code for functions
> > declared inline. This is a bug, but I've not yet tracked it down.
There
> > are two such small functions in the benchmark program, and the regular
> > inliner *should* catch them anyway.
> > 
> > Results:
> > 0. Name of the loop
> > 1. Time using LLVM with vectorization
> > 2. Time using LLVM without vectorization
> > 3. Time using gcc with vectorization
> > 4. Time using gcc without vectorization
Here are improved results where the correct (and default)
vector-register size was used.

Loop       llvm-v   llvm     gcc-v    gcc
-------------------------------------------
S000        9.09     9.49     4.55    10.04
S111        7.28     7.37     7.68     7.83
S1111      13.78    14.48    16.14    16.30
S112       16.67    17.41    16.54    17.52
S1112      13.12    14.21    14.83    14.84
S113       22.12    22.88    22.05    22.05
S1113      11.06    11.42    11.03    11.01
S114       13.23    13.75    13.53    13.48
S115       32.76    33.24    49.98    49.99
S1115      13.68    14.18    13.65    13.66
S116       47.42    49.40    49.54    48.11
S118       10.84    11.26    10.79    10.50
S119        8.74     9.07    11.83    11.82
S1119       8.81     9.14     4.31    11.87
S121       17.28    18.78    14.84    17.31
S122        7.53     7.54     6.11     6.11
S123        6.90     7.38     7.42     7.41
S124        9.60     9.77     9.42     9.33
S125        6.92     7.22     4.67     7.81
S126        2.34     2.53     2.57     2.37
S127       12.19    12.97     7.06    14.50
S128       11.74    12.43    12.42    11.52
S131       28.75    29.91    25.17    28.94
S132       17.04    17.04    15.53    21.03
S141       12.28    12.26    12.38    12.05
S151       28.80    29.43    24.89    28.95
S152       15.54    16.03    11.19    15.63
S161        6.00     6.06     5.52     5.46
S1161      14.39    14.40     8.80     8.79
S162        8.19     9.05     5.36     8.18
S171       15.41     7.94     2.81     5.70
S172        5.71     5.89     2.75     5.70
S173       30.31    30.92    18.15    30.13
S174       30.18    31.66    18.51    30.16
S175        5.78     6.18     4.94     5.77
S176        5.59     5.83     4.41     7.65
S211       16.27    17.14    16.82    16.38
S212       13.21    14.28    13.34    13.18
S1213      12.81    13.46    12.80    12.43
S221       10.86    11.09     8.65     8.63
S1221       5.72     6.04     5.40     6.05
S222        6.02     6.26     5.70     5.72
S231       22.33    22.94    22.36    22.11
S232        6.88     6.88     6.89     6.89
S1232      15.30    15.34    15.05    15.10
S233       55.38    58.55    54.21    49.56
S2233      27.08    29.77    29.68    28.40
S235       44.00    44.92    46.94    43.93
S241       31.09    31.35    32.53    31.01
S242        7.19     7.20     7.20     7.20
S243       16.52    17.09    17.69    16.84
S244       14.45    14.83    16.91    16.82
S1244      14.71    14.83    14.77    14.40
S2244      10.04    10.62    10.40    10.06
S251       34.15    35.75    19.70    34.38
S1251      55.23    57.84    41.77    56.11
S2251      15.73    15.87    17.02    15.70
S3251      15.66    16.21    19.60    15.34
S252        6.18     6.32     7.72     7.26
S253       11.14    11.38    14.40    14.40
S254       18.41    18.70    28.23    28.06
S255        5.93     6.09     9.96     9.95
S256        3.08     3.42     3.10     3.09
S257        2.13     2.25     2.21     2.20
S258        1.79     1.82     1.84     1.84
S261       12.00    12.08    10.98    10.95
S271       32.82    33.04    33.25    33.01
S272       14.98    15.82    15.39    15.26
S273       13.92    14.04    16.86    16.80
S274       17.83    18.31    18.15    17.89
S275        2.92     3.02     3.36     2.98
S2275      32.80    33.50     8.97    33.60
S276       39.43    39.44    40.80    40.55
S277        4.80     4.80     4.81     4.80
S278       14.41    14.42    14.70    14.66
S279        8.03     8.29     7.25     7.27
S1279       9.71    10.06     9.34     9.25
S2710       7.71     8.04     7.86     7.56
S2711      35.53    35.55    36.56    36.00
S2712      32.94    33.17    34.24    33.47
S281       10.79    11.09    12.46    12.02
S1281      79.13    77.55    57.78    68.06
S291       11.80    11.78    14.03    14.03
S292        7.77     7.78     9.94     9.96
S293       15.50    15.87    19.32    19.33
S2101       2.56     2.58     2.59     2.60
S2102      16.71    17.53    16.68    16.75
S2111       5.60     5.60     5.85     5.85
S311       72.03    72.03    72.23    72.03
S31111      7.49     6.00     6.00     6.00
S312       96.04    96.04    96.05    96.03
S313       36.02    36.13    36.03    36.02
S314       36.01    36.07    74.67    72.42
S315        8.96     8.99     9.35     9.30
S316       36.02    36.06    72.08    74.87
S317      444.93   444.94   451.82   451.78
S318        9.05     9.07     7.30     7.30
S319       34.54    36.53    34.42    34.19
S3110       8.51     8.57     4.11     4.11
S13110      5.75     5.77    12.12    12.12
S3111       3.60     3.62     3.60     3.60
S3112       7.19     7.30     7.21     7.20
S3113      35.13    35.47    60.21    60.20
S321       16.79    16.81    16.80    16.80
S322       12.42    12.60    12.60    12.60
S323       10.86    11.02     8.48     8.51
S331        4.23     4.23     7.20     7.20
S332        7.20     7.21     5.21     5.31
S341        4.79     4.85     7.23     7.20
S342        6.01     6.09     7.25     7.20
S343        2.04     2.06     2.16     2.01
S351       46.61    47.34    21.82    46.46
S1351      49.28    50.35    33.68    49.06
S352       57.65    58.04    57.68    57.64
S353        8.21     8.38     8.34     8.19
S421       42.94    43.34    20.62    22.46
S1421      25.15    25.81    15.85    24.76
S422       87.39    87.53    79.22    78.99
S423      155.01   155.29   154.56   154.38
S424       36.51    37.51    11.42    22.36
S431       57.10    60.66    27.59    57.16
S441       14.04    13.29    12.88    12.81
S442        6.00     6.00     6.96     6.90
S443       17.28    17.77    17.15    16.95
S451       48.92    49.08    49.03    49.14
S452       42.98    39.32    14.64    96.03
S453       28.05    28.06    14.60    14.40
S471        8.24     8.65     8.39     8.43
S481       10.88    11.15    12.04    12.00
S482        9.21     9.31     9.19     9.17
S491       11.26    11.38    11.37    11.28
S4112       8.21     8.36     9.13     8.94
S4113       8.65     8.81     8.86     8.85
S4114      11.82    12.15    12.18    11.77
S4115       8.28     8.46     8.95     8.59
S4116       3.22     3.23     6.02     5.94
S4117      13.95     9.61    10.16     9.98
S4121       8.21     8.26     4.04     8.17
va         28.46    28.58    23.58    48.46
vag        12.35    12.36    13.58    13.20
vas        13.45    13.49    13.03    12.47
vif         4.55     4.57     5.06     4.92
vpv        57.08    57.22    28.28    57.24
vtv        57.81    57.83    28.40    57.63
vpvtv      32.82    32.84    16.35    32.73
vpvts       5.82     5.83     2.99     6.38
vpvpv      32.87    32.89    16.54    32.85
vtvtv      32.82    32.80    16.84    35.97
vsumr      72.04    72.03    72.20    72.04
vdotr      72.06    72.05    72.42    72.04
vbor      205.24   380.81    99.80   372.05

 -Hal
> > 
> > Loop       llvm-v   llvm     gcc-v    gcc
> > -------------------------------------------
> > S000        9.59     9.49     4.55    10.04
> > S111        7.67     7.37     7.68     7.83
> > S1111      13.98    14.48    16.14    16.30
> > S112       17.43    17.41    16.54    17.52
> > S1112      13.87    14.21    14.83    14.84
> > S113       22.97    22.88    22.05    22.05
> > S1113      11.46    11.42    11.03    11.01
> > S114       13.47    13.75    13.53    13.48
> > S115       33.06    33.24    49.98    49.99
> > S1115      13.91    14.18    13.65    13.66
> > S116       48.74    49.40    49.54    48.11
> > S118       11.04    11.26    10.79    10.50
> > S119        8.97     9.07    11.83    11.82
> > S1119       9.04     9.14     4.31    11.87
> > S121       18.06    18.78    14.84    17.31
> > S122        7.58     7.54     6.11     6.11
> > S123        7.02     7.38     7.42     7.41
> > S124        9.62     9.77     9.42     9.33
> > S125        7.14     7.22     4.67     7.81
> > S126        2.32     2.53     2.57     2.37
> > S127       12.87    12.97     7.06    14.50
> > S128       12.58    12.43    12.42    11.52
> > S131       29.91    29.91    25.17    28.94
> > S132       17.04    17.04    15.53    21.03
> > S141       12.59    12.26    12.38    12.05
> > S151       28.92    29.43    24.89    28.95
> > S152       15.68    16.03    11.19    15.63
> > S161        6.06     6.06     5.52     5.46
> > S1161      14.46    14.40     8.80     8.79
> > S162        8.31     9.05     5.36     8.18
> > S171       15.47     7.94     2.81     5.70
> > S172        5.92     5.89     2.75     5.70
> > S173       31.59    30.92    18.15    30.13
> > S174       31.16    31.66    18.51    30.16
> > S175        5.80     6.18     4.94     5.77
> > S176        5.69     5.83     4.41     7.65
> > S211       16.56    17.14    16.82    16.38
> > S212       13.46    14.28    13.34    13.18
> > S1213      13.12    13.46    12.80    12.43
> > S221       10.88    11.09     8.65     8.63
> > S1221       5.80     6.04     5.40     6.05
> > S222        6.01     6.26     5.70     5.72
> > S231       23.78    22.94    22.36    22.11
> > S232        6.88     6.88     6.89     6.89
> > S1232      16.00    15.34    15.05    15.10
> > S233       57.48    58.55    54.21    49.56
> > S2233      27.65    29.77    29.68    28.40
> > S235       46.40    44.92    46.94    43.93
> > S241       31.62    31.35    32.53    31.01
> > S242        7.20     7.20     7.20     7.20
> > S243       16.78    17.09    17.69    16.84
> > S244       14.64    14.83    16.91    16.82
> > S1244      14.98    14.83    14.77    14.40
> > S2244      10.47    10.62    10.40    10.06
> > S251       35.10    35.75    19.70    34.38
> > S1251      56.65    57.84    41.77    56.11
> > S2251      15.96    15.87    17.02    15.70
> > S3251      16.41    16.21    19.60    15.34
> > S252        7.24     6.32     7.72     7.26
> > S253       12.55    11.38    14.40    14.40
> > S254       19.08    18.70    28.23    28.06
> > S255        5.94     6.09     9.96     9.95
> > S256        3.14     3.42     3.10     3.09
> > S257        2.18     2.25     2.21     2.20
> > S258        1.80     1.82     1.84     1.84
> > S261       12.00    12.08    10.98    10.95
> > S271       32.93    33.04    33.25    33.01
> > S272       15.48    15.82    15.39    15.26
> > S273       13.99    14.04    16.86    16.80
> > S274       18.38    18.31    18.15    17.89
> > S275        3.02     3.02     3.36     2.98
> > S2275      33.71    33.50     8.97    33.60
> > S276       39.52    39.44    40.80    40.55
> > S277        4.81     4.80     4.81     4.80
> > S278       14.43    14.42    14.70    14.66
> > S279        8.10     8.29     7.25     7.27
> > S1279       9.77    10.06     9.34     9.25
> > S2710       7.85     8.04     7.86     7.56
> > S2711      35.54    35.55    36.56    36.00
> > S2712      33.16    33.17    34.24    33.47
> > S281       10.97    11.09    12.46    12.02
> > S1281      79.37    77.55    57.78    68.06
> > S291       11.94    11.78    14.03    14.03
> > S292        7.88     7.78     9.94     9.96
> > S293       15.90    15.87    19.32    19.33
> > S2101       2.59     2.58     2.59     2.60
> > S2102      17.63    17.53    16.68    16.75
> > S2111       5.63     5.60     5.85     5.85
> > S311       72.07    72.03    72.23    72.03
> > S31111      7.49     6.00     6.00     6.00
> > S312       96.06    96.04    96.05    96.03
> > S313       36.50    36.13    36.03    36.02
> > S314       36.10    36.07    74.67    72.42
> > S315        9.00     8.99     9.35     9.30
> > S316       36.11    36.06    72.08    74.87
> > S317      444.92   444.94   451.82   451.78
> > S318        9.04     9.07     7.30     7.30
> > S319       34.76    36.53    34.42    34.19
> > S3110       8.53     8.57     4.11     4.11
> > S13110      5.76     5.77    12.12    12.12
> > S3111       3.60     3.62     3.60     3.60
> > S3112       7.20     7.30     7.21     7.20
> > S3113      35.12    35.47    60.21    60.20
> > S321       16.81    16.81    16.80    16.80
> > S322       12.42    12.60    12.60    12.60
> > S323       10.93    11.02     8.48     8.51
> > S331        4.23     4.23     7.20     7.20
> > S332        7.21     7.21     5.21     5.31
> > S341        4.74     4.85     7.23     7.20
> > S342        6.02     6.09     7.25     7.20
> > S343        2.14     2.06     2.16     2.01
> > S351       49.26    47.34    21.82    46.46
> > S1351      50.85    50.35    33.68    49.06
> > S352       58.14    58.04    57.68    57.64
> > S353        8.35     8.38     8.34     8.19
> > S421       43.13    43.34    20.62    22.46
> > S1421      25.25    25.81    15.85    24.76
> > S422       88.36    87.53    79.22    78.99
> > S423      155.13   155.29   154.56   154.38
> > S424       37.11    37.51    11.42    22.36
> > S431       58.22    60.66    27.59    57.16
> > S441       14.05    13.29    12.88    12.81
> > S442        6.08     6.00     6.96     6.90
> > S443       17.60    17.77    17.15    16.95
> > S451       48.95    49.08    49.03    49.14
> > S452       42.98    39.32    14.64    96.03
> > S453       28.06    28.06    14.60    14.40
> > S471        8.53     8.65     8.39     8.43
> > S481       10.98    11.15    12.04    12.00
> > S482        9.31     9.31     9.19     9.17
> > S491       11.54    11.38    11.37    11.28
> > S4112       8.21     8.36     9.13     8.94
> > S4113       8.77     8.81     8.86     8.85
> > S4114      12.32    12.15    12.18    11.77
> > S4115       8.48     8.46     8.95     8.59
> > S4116       3.21     3.23     6.02     5.94
> > S4117      14.08     9.61    10.16     9.98
> > S4121       8.53     8.26     4.04     8.17
> > va         30.09    28.58    23.58    48.46
> > vag        12.35    12.36    13.58    13.20
> > vas        13.74    13.49    13.03    12.47
> > vif         4.49     4.57     5.06     4.92
> > vpv        58.59    57.22    28.28    57.24
> > vtv        59.15    57.83    28.40    57.63
> > vpvtv      33.18    32.84    16.35    32.73
> > vpvts       5.99     5.83     2.99     6.38
> > vpvpv      33.25    32.89    16.54    32.85
> > vtvtv      32.83    32.80    16.84    35.97
> > vsumr      72.03    72.03    72.20    72.04
> > vdotr      72.05    72.05    72.42    72.04
> > vbor      205.22   380.81    99.80   372.05
> > 
> > I've yet to go through these in detail (they just finished running
5
> > minutes ago). But for the curious (and I've had several requests
for
> > benchmarks), here you go. There is obviously more work to do.
> > 
> >  -Hal
> > 
> > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > > Hi Hal,
> > > 
> > > those numbers look very promising, great work! :)
> > > 
> > > Best,
> > > Ralf
> > > 
> > > ----- Original Message -----
> > > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at
gmail.com>
> > > > Cc: llvm-commits at cs.uiuc.edu
> > > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock
Autovectorization Pass
> > > > 
> > > > Bruno, et al.,
> > > > 
> > > > I've attached a new version of the patch that contains
improvements
> > > > (and
> > > > a critical bug fix [the code output is not more right, but
the pass
> > > > in
> > > > the older patch would crash in certain cases and now does
not])
> > > > compared
> > > > to previous versions that I've posted.
> > > > 
> > > > First, these are preliminary results because I did not do
the things
> > > > necessary to make them real (explicitly quiet the machine,
bind the
> > > > processes to one cpu, etc.). But they should be good enough
for
> > > > discussion.
> > > > 
> > > > I'm using LLVM head r143101, with the attached patch
applied, and
> > > > clang
> > > > head r143100 on an x86_64 machine (some kind of Intel Xeon).
For the
> > > > gcc
> > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc
was run -O3
> > > > without any other optimization flags. opt was run -vectorize
> > > > -unroll-allow-partial -O3 with no other optimization flags
(the patch
> > > > adds the -vectorize option). llc was just given -O3.
> > > > 
> > > > It is not difficult to construct an example in which
vectorization
> > > > would
> > > > be useful: take a loop that does more computation than
load/stores,
> > > > and
> > > > (partially) unroll it. Here is a simple case:
> > > > 
> > > > #define ITER 5000
> > > > #define NUM 200
> > > > double a[NUM][NUM];
> > > > double b[NUM][NUM];
> > > > 
> > > > ...
> > > > 
> > > > int main()
> > > > {
> > > > ...
> > > > 
> > > >   for (int i = 0; i < ITER; ++i) {
> > > >     for (int x = 0; x < NUM; ++x)
> > > >     for (int y = 0; y < NUM; ++y) {
> > > >       double v = a[x][y], w = b[x][y];
> > > >       double z1 = v*w;
> > > >       double z2 = v+w;
> > > >       double z3 = z1*z2;
> > > >       double z4 = z3+v;
> > > >       double z5 = z2+w;
> > > >       double z6 = z4*z5;
> > > >       double z7 = z4+z5;
> > > >       a[x][y] = v*v-z6;
> > > >       b[x][y] = w-z7;
> > > >     }
> > > >   }
> > > > 
> > > >  ...
> > > > 
> > > >   return 0;
> > > > }
> > > > 
> > > > Results:
> > > > gcc -03: 0m1.790s
> > > > llvm -vectorize: 0m2.360s
> > > > llvm: 0m2.780s
> > > > gcc -fno-tree-vectorize: 0m2.810s
> > > > (these are the user times after I've run enough for the
times to
> > > > settle
> > > > to three decimal places)
> > > > 
> > > > So the vectorization gives a ~15% improvement in the running
time.
> > > > gcc's
> > > > vectorization still does a much better job, however
(yielding an ~36%
> > > > improvement). So there is still work to do ;)
> > > > 
> > > > Additionally, I've checked the autovectorization on some
classic
> > > > numerical benchmarks from netlib. On these benchmarks,
clang/llvm
> > > > already do a good job compared to gcc (gcc is only about 10%
better,
> > > > and
> > > > this is true regardless of whether gcc's vectorization
is on or off).
> > > > For these cases, autovectorization provides an insignificant
speedup
> > > > in
> > > > most cases (but does not tend to make things worse, just not
really
> > > > any
> > > > better either). Because gcc's vectorization also did not
really help
> > > > gcc
> > > > in these cases, I'm not surprised. A good collection of
these is
> > > > available here:
> > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > > 
> > > > I've yet to run the test suite using the pass to
validate it. That is
> > > > something that I plan to do. Actually, the "Livermore
Loops" test in
> > > > the
> > > > aforementioned archive contains checksums to validate the
results,
> > > > and
> > > > it looks like 1 or 2 of the loop results are wrong with
vectorization
> > > > turned on, so I'll have to investigate that.
> > > > 
> > > >  -Hal
> > > > 
> > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes
wrote:
> > > > > Hi Hal,
> > > > > 
> > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel
at anl.gov>
> > > > > wrote:
> > > > > > I've attached an initial version of a
basic-block
> > > > > > autovectorization
> > > > > > pass. It works by searching a basic block for
pairable
> > > > > > (independent)
> > > > > > instructions, and, using a chain-seeking
heuristic, selects
> > > > > > pairings
> > > > > > likely to provide an overall speedup (if such
pairings can be
> > > > > > found).
> > > > > > The selected pairs are then fused and, if
necessary, other
> > > > > > instructions
> > > > > > are moved in order to maintain data-flow
consistency. This works
> > > > > > only
> > > > > > within one basic block, but can do loop
vectorization in
> > > > > > combination
> > > > > > with (partial) unrolling. The basic idea was
inspired by the
> > > > > > Vienna MAP
> > > > > > Vectorizor, which has been used to vectorize FFT
kernels, but the
> > > > > > algorithm used here is different.
> > > > > >
> > > > > > To try it, use -bb-vectorize with opt. There are a
few options:
> > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The
depth of the
> > > > > > chain of
> > > > > > instruction pairs necessary in order to consider
the pairs that
> > > > > > compose
> > > > > > the chain worthy of vectorization.
> > > > > > -bb-vectorize-vector-bits: default: 128 -- The
size of the target
> > > > > > vector
> > > > > > registers
> > > > > > -bb-vectorize-no-ints -- Don't consider
integer instructions
> > > > > > -bb-vectorize-no-floats -- Don't consider
floating-point
> > > > > > instructions
> > > > > >
> > > > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > > > pairs;
> > > > > > The assumption is that other passes will turn
these into shuffles
> > > > > > when
> > > > > > possible (it looks like some work is necessary
here). It will
> > > > > > also
> > > > > > vectorize vector instructions, and generates
shuffles in this
> > > > > > case
> > > > > > (again, other passes should combine these as
appropriate).
> > > > > >
> > > > > > Currently, it does not fuse load or store
instructions, but that
> > > > > > is a
> > > > > > feature that I'd like to add. Of course,
alignment information is
> > > > > > an
> > > > > > issue for load/store vectorization (or maybe I
should just fuse
> > > > > > them
> > > > > > anyway and let isel deal with unaligned cases?).
> > > > > >
> > > > > > Also, support needs to be added for fusing known
intrinsics (fma,
> > > > > > etc.),
> > > > > > and, as has been discussed on llvmdev, we should
add some
> > > > > > intrinsics to
> > > > > > allow the generation of addsub-type instructions.
> > > > > >
> > > > > > I've included a few tests, but it needs more.
Please review (I'll
> > > > > > commit
> > > > > > if and when everyone is happy).
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Hal
> > > > > >
> > > > > > P.S. There is another option (not so useful right
now, but could
> > > > > > be):
> > > > > > -bb-vectorize-fast-dep -- Don't do a full
inter-instruction
> > > > > > dependency
> > > > > > analysis; instead stop looking for instruction
pairs after the
> > > > > > first use
> > > > > > of an instruction's value. [This makes the
pass faster, but would
> > > > > > require a data-dependence-based reordering pass in
order to be
> > > > > > effective].
> > > > > 
> > > > > Cool! :)
> > > > > Have you run this pass with any benchmark or the llvm
testsuite?
> > > > > Does
> > > > > it presents any regression?
> > > > > Do you have any performance results?
> > > > > Cheers,
> > > > > 
> > > > 
> > > > --
> > > > Hal Finkel
> > > > Postdoctoral Appointee
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Peter Collingbourne

2011-Oct-29 20:54 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, Oct 29, 2011 at 12:30:12PM -0500, Hal Finkel
wrote:> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down.
http://clang.llvm.org/compatibility.html#inline

Thanks,
-- 
Peter

Hal Finkel

2011-Oct-29 21:24 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, 2011-10-29 at 21:54 +0100, Peter Collingbourne
wrote:> On Sat, Oct 29, 2011 at 12:30:12PM -0500, Hal Finkel wrote:
> > Also, when using clang, I had to pass -Dinline= on the command line:
> > when using -emit-llvm, clang appears not to emit code for functions
> > declared inline. This is a bug, but I've not yet tracked it down.
> 
> http://clang.llvm.org/compatibility.html#inline
Thanks! (Of course, the standard does not govern the relationship
between the compiler frontend and the backend, so it *could* work some
other way).

 -Hal
> 
> Thanks,
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Possibly Parallel Threads

Search for more possibly parallel threads

llvm dev - Oct 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads