thr3ads.net - llvm dev - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Hal Finkel

2011-Oct-29 20:16 UTC

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> > Ralf, et al.,
> > 
> > Attached is the latest version of my autovectorization patch. llvmdev
> > has been CC'd (as had been suggested to me); this e-mail contains
> > additional benchmark results.
> > 
> > First, these are preliminary results because I did not do the things
> > necessary to make them real (explicitly quiet the machine, bind the
> > processes to one cpu, etc.). But they should be good enough for
> > discussion.
> > 
> > I'm using LLVM head r143101, with the attached patch applied, and
clang
> > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the
gcc
> > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> > without any other optimization flags. opt was run -vectorize
> > -unroll-allow-partial -O3 with no other optimization flags (the patch
> > adds the -vectorize option).
> 
> And opt had also been given the flag: -bb-vectorize-vector-bits=256
And this was a mistake (because the machine on which the benchmarks were
run does not have AVX). I've rerun, see better results below...
> 
>  -Hal
> 
> > llc was just given -O3.
> > 
> > Below I've included results using the benchmark program by Maleki,
et
> > al. See:
> > An Evaluation of Vectorizing Compilers - PACT'11
> > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> > their benchmark program was retrieved from:
> > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> > 
> > Also, when using clang, I had to pass -Dinline= on the command line:
> > when using -emit-llvm, clang appears not to emit code for functions
> > declared inline. This is a bug, but I've not yet tracked it down.
There
> > are two such small functions in the benchmark program, and the regular
> > inliner *should* catch them anyway.
> > 
> > Results:
> > 0. Name of the loop
> > 1. Time using LLVM with vectorization
> > 2. Time using LLVM without vectorization
> > 3. Time using gcc with vectorization
> > 4. Time using gcc without vectorization
Here are improved results where the correct (and default)
vector-register size was used.

Loop       llvm-v   llvm     gcc-v    gcc
-------------------------------------------
S000        9.09     9.49     4.55    10.04
S111        7.28     7.37     7.68     7.83
S1111      13.78    14.48    16.14    16.30
S112       16.67    17.41    16.54    17.52
S1112      13.12    14.21    14.83    14.84
S113       22.12    22.88    22.05    22.05
S1113      11.06    11.42    11.03    11.01
S114       13.23    13.75    13.53    13.48
S115       32.76    33.24    49.98    49.99
S1115      13.68    14.18    13.65    13.66
S116       47.42    49.40    49.54    48.11
S118       10.84    11.26    10.79    10.50
S119        8.74     9.07    11.83    11.82
S1119       8.81     9.14     4.31    11.87
S121       17.28    18.78    14.84    17.31
S122        7.53     7.54     6.11     6.11
S123        6.90     7.38     7.42     7.41
S124        9.60     9.77     9.42     9.33
S125        6.92     7.22     4.67     7.81
S126        2.34     2.53     2.57     2.37
S127       12.19    12.97     7.06    14.50
S128       11.74    12.43    12.42    11.52
S131       28.75    29.91    25.17    28.94
S132       17.04    17.04    15.53    21.03
S141       12.28    12.26    12.38    12.05
S151       28.80    29.43    24.89    28.95
S152       15.54    16.03    11.19    15.63
S161        6.00     6.06     5.52     5.46
S1161      14.39    14.40     8.80     8.79
S162        8.19     9.05     5.36     8.18
S171       15.41     7.94     2.81     5.70
S172        5.71     5.89     2.75     5.70
S173       30.31    30.92    18.15    30.13
S174       30.18    31.66    18.51    30.16
S175        5.78     6.18     4.94     5.77
S176        5.59     5.83     4.41     7.65
S211       16.27    17.14    16.82    16.38
S212       13.21    14.28    13.34    13.18
S1213      12.81    13.46    12.80    12.43
S221       10.86    11.09     8.65     8.63
S1221       5.72     6.04     5.40     6.05
S222        6.02     6.26     5.70     5.72
S231       22.33    22.94    22.36    22.11
S232        6.88     6.88     6.89     6.89
S1232      15.30    15.34    15.05    15.10
S233       55.38    58.55    54.21    49.56
S2233      27.08    29.77    29.68    28.40
S235       44.00    44.92    46.94    43.93
S241       31.09    31.35    32.53    31.01
S242        7.19     7.20     7.20     7.20
S243       16.52    17.09    17.69    16.84
S244       14.45    14.83    16.91    16.82
S1244      14.71    14.83    14.77    14.40
S2244      10.04    10.62    10.40    10.06
S251       34.15    35.75    19.70    34.38
S1251      55.23    57.84    41.77    56.11
S2251      15.73    15.87    17.02    15.70
S3251      15.66    16.21    19.60    15.34
S252        6.18     6.32     7.72     7.26
S253       11.14    11.38    14.40    14.40
S254       18.41    18.70    28.23    28.06
S255        5.93     6.09     9.96     9.95
S256        3.08     3.42     3.10     3.09
S257        2.13     2.25     2.21     2.20
S258        1.79     1.82     1.84     1.84
S261       12.00    12.08    10.98    10.95
S271       32.82    33.04    33.25    33.01
S272       14.98    15.82    15.39    15.26
S273       13.92    14.04    16.86    16.80
S274       17.83    18.31    18.15    17.89
S275        2.92     3.02     3.36     2.98
S2275      32.80    33.50     8.97    33.60
S276       39.43    39.44    40.80    40.55
S277        4.80     4.80     4.81     4.80
S278       14.41    14.42    14.70    14.66
S279        8.03     8.29     7.25     7.27
S1279       9.71    10.06     9.34     9.25
S2710       7.71     8.04     7.86     7.56
S2711      35.53    35.55    36.56    36.00
S2712      32.94    33.17    34.24    33.47
S281       10.79    11.09    12.46    12.02
S1281      79.13    77.55    57.78    68.06
S291       11.80    11.78    14.03    14.03
S292        7.77     7.78     9.94     9.96
S293       15.50    15.87    19.32    19.33
S2101       2.56     2.58     2.59     2.60
S2102      16.71    17.53    16.68    16.75
S2111       5.60     5.60     5.85     5.85
S311       72.03    72.03    72.23    72.03
S31111      7.49     6.00     6.00     6.00
S312       96.04    96.04    96.05    96.03
S313       36.02    36.13    36.03    36.02
S314       36.01    36.07    74.67    72.42
S315        8.96     8.99     9.35     9.30
S316       36.02    36.06    72.08    74.87
S317      444.93   444.94   451.82   451.78
S318        9.05     9.07     7.30     7.30
S319       34.54    36.53    34.42    34.19
S3110       8.51     8.57     4.11     4.11
S13110      5.75     5.77    12.12    12.12
S3111       3.60     3.62     3.60     3.60
S3112       7.19     7.30     7.21     7.20
S3113      35.13    35.47    60.21    60.20
S321       16.79    16.81    16.80    16.80
S322       12.42    12.60    12.60    12.60
S323       10.86    11.02     8.48     8.51
S331        4.23     4.23     7.20     7.20
S332        7.20     7.21     5.21     5.31
S341        4.79     4.85     7.23     7.20
S342        6.01     6.09     7.25     7.20
S343        2.04     2.06     2.16     2.01
S351       46.61    47.34    21.82    46.46
S1351      49.28    50.35    33.68    49.06
S352       57.65    58.04    57.68    57.64
S353        8.21     8.38     8.34     8.19
S421       42.94    43.34    20.62    22.46
S1421      25.15    25.81    15.85    24.76
S422       87.39    87.53    79.22    78.99
S423      155.01   155.29   154.56   154.38
S424       36.51    37.51    11.42    22.36
S431       57.10    60.66    27.59    57.16
S441       14.04    13.29    12.88    12.81
S442        6.00     6.00     6.96     6.90
S443       17.28    17.77    17.15    16.95
S451       48.92    49.08    49.03    49.14
S452       42.98    39.32    14.64    96.03
S453       28.05    28.06    14.60    14.40
S471        8.24     8.65     8.39     8.43
S481       10.88    11.15    12.04    12.00
S482        9.21     9.31     9.19     9.17
S491       11.26    11.38    11.37    11.28
S4112       8.21     8.36     9.13     8.94
S4113       8.65     8.81     8.86     8.85
S4114      11.82    12.15    12.18    11.77
S4115       8.28     8.46     8.95     8.59
S4116       3.22     3.23     6.02     5.94
S4117      13.95     9.61    10.16     9.98
S4121       8.21     8.26     4.04     8.17
va         28.46    28.58    23.58    48.46
vag        12.35    12.36    13.58    13.20
vas        13.45    13.49    13.03    12.47
vif         4.55     4.57     5.06     4.92
vpv        57.08    57.22    28.28    57.24
vtv        57.81    57.83    28.40    57.63
vpvtv      32.82    32.84    16.35    32.73
vpvts       5.82     5.83     2.99     6.38
vpvpv      32.87    32.89    16.54    32.85
vtvtv      32.82    32.80    16.84    35.97
vsumr      72.04    72.03    72.20    72.04
vdotr      72.06    72.05    72.42    72.04
vbor      205.24   380.81    99.80   372.05

 -Hal
> > 
> > Loop       llvm-v   llvm     gcc-v    gcc
> > -------------------------------------------
> > S000        9.59     9.49     4.55    10.04
> > S111        7.67     7.37     7.68     7.83
> > S1111      13.98    14.48    16.14    16.30
> > S112       17.43    17.41    16.54    17.52
> > S1112      13.87    14.21    14.83    14.84
> > S113       22.97    22.88    22.05    22.05
> > S1113      11.46    11.42    11.03    11.01
> > S114       13.47    13.75    13.53    13.48
> > S115       33.06    33.24    49.98    49.99
> > S1115      13.91    14.18    13.65    13.66
> > S116       48.74    49.40    49.54    48.11
> > S118       11.04    11.26    10.79    10.50
> > S119        8.97     9.07    11.83    11.82
> > S1119       9.04     9.14     4.31    11.87
> > S121       18.06    18.78    14.84    17.31
> > S122        7.58     7.54     6.11     6.11
> > S123        7.02     7.38     7.42     7.41
> > S124        9.62     9.77     9.42     9.33
> > S125        7.14     7.22     4.67     7.81
> > S126        2.32     2.53     2.57     2.37
> > S127       12.87    12.97     7.06    14.50
> > S128       12.58    12.43    12.42    11.52
> > S131       29.91    29.91    25.17    28.94
> > S132       17.04    17.04    15.53    21.03
> > S141       12.59    12.26    12.38    12.05
> > S151       28.92    29.43    24.89    28.95
> > S152       15.68    16.03    11.19    15.63
> > S161        6.06     6.06     5.52     5.46
> > S1161      14.46    14.40     8.80     8.79
> > S162        8.31     9.05     5.36     8.18
> > S171       15.47     7.94     2.81     5.70
> > S172        5.92     5.89     2.75     5.70
> > S173       31.59    30.92    18.15    30.13
> > S174       31.16    31.66    18.51    30.16
> > S175        5.80     6.18     4.94     5.77
> > S176        5.69     5.83     4.41     7.65
> > S211       16.56    17.14    16.82    16.38
> > S212       13.46    14.28    13.34    13.18
> > S1213      13.12    13.46    12.80    12.43
> > S221       10.88    11.09     8.65     8.63
> > S1221       5.80     6.04     5.40     6.05
> > S222        6.01     6.26     5.70     5.72
> > S231       23.78    22.94    22.36    22.11
> > S232        6.88     6.88     6.89     6.89
> > S1232      16.00    15.34    15.05    15.10
> > S233       57.48    58.55    54.21    49.56
> > S2233      27.65    29.77    29.68    28.40
> > S235       46.40    44.92    46.94    43.93
> > S241       31.62    31.35    32.53    31.01
> > S242        7.20     7.20     7.20     7.20
> > S243       16.78    17.09    17.69    16.84
> > S244       14.64    14.83    16.91    16.82
> > S1244      14.98    14.83    14.77    14.40
> > S2244      10.47    10.62    10.40    10.06
> > S251       35.10    35.75    19.70    34.38
> > S1251      56.65    57.84    41.77    56.11
> > S2251      15.96    15.87    17.02    15.70
> > S3251      16.41    16.21    19.60    15.34
> > S252        7.24     6.32     7.72     7.26
> > S253       12.55    11.38    14.40    14.40
> > S254       19.08    18.70    28.23    28.06
> > S255        5.94     6.09     9.96     9.95
> > S256        3.14     3.42     3.10     3.09
> > S257        2.18     2.25     2.21     2.20
> > S258        1.80     1.82     1.84     1.84
> > S261       12.00    12.08    10.98    10.95
> > S271       32.93    33.04    33.25    33.01
> > S272       15.48    15.82    15.39    15.26
> > S273       13.99    14.04    16.86    16.80
> > S274       18.38    18.31    18.15    17.89
> > S275        3.02     3.02     3.36     2.98
> > S2275      33.71    33.50     8.97    33.60
> > S276       39.52    39.44    40.80    40.55
> > S277        4.81     4.80     4.81     4.80
> > S278       14.43    14.42    14.70    14.66
> > S279        8.10     8.29     7.25     7.27
> > S1279       9.77    10.06     9.34     9.25
> > S2710       7.85     8.04     7.86     7.56
> > S2711      35.54    35.55    36.56    36.00
> > S2712      33.16    33.17    34.24    33.47
> > S281       10.97    11.09    12.46    12.02
> > S1281      79.37    77.55    57.78    68.06
> > S291       11.94    11.78    14.03    14.03
> > S292        7.88     7.78     9.94     9.96
> > S293       15.90    15.87    19.32    19.33
> > S2101       2.59     2.58     2.59     2.60
> > S2102      17.63    17.53    16.68    16.75
> > S2111       5.63     5.60     5.85     5.85
> > S311       72.07    72.03    72.23    72.03
> > S31111      7.49     6.00     6.00     6.00
> > S312       96.06    96.04    96.05    96.03
> > S313       36.50    36.13    36.03    36.02
> > S314       36.10    36.07    74.67    72.42
> > S315        9.00     8.99     9.35     9.30
> > S316       36.11    36.06    72.08    74.87
> > S317      444.92   444.94   451.82   451.78
> > S318        9.04     9.07     7.30     7.30
> > S319       34.76    36.53    34.42    34.19
> > S3110       8.53     8.57     4.11     4.11
> > S13110      5.76     5.77    12.12    12.12
> > S3111       3.60     3.62     3.60     3.60
> > S3112       7.20     7.30     7.21     7.20
> > S3113      35.12    35.47    60.21    60.20
> > S321       16.81    16.81    16.80    16.80
> > S322       12.42    12.60    12.60    12.60
> > S323       10.93    11.02     8.48     8.51
> > S331        4.23     4.23     7.20     7.20
> > S332        7.21     7.21     5.21     5.31
> > S341        4.74     4.85     7.23     7.20
> > S342        6.02     6.09     7.25     7.20
> > S343        2.14     2.06     2.16     2.01
> > S351       49.26    47.34    21.82    46.46
> > S1351      50.85    50.35    33.68    49.06
> > S352       58.14    58.04    57.68    57.64
> > S353        8.35     8.38     8.34     8.19
> > S421       43.13    43.34    20.62    22.46
> > S1421      25.25    25.81    15.85    24.76
> > S422       88.36    87.53    79.22    78.99
> > S423      155.13   155.29   154.56   154.38
> > S424       37.11    37.51    11.42    22.36
> > S431       58.22    60.66    27.59    57.16
> > S441       14.05    13.29    12.88    12.81
> > S442        6.08     6.00     6.96     6.90
> > S443       17.60    17.77    17.15    16.95
> > S451       48.95    49.08    49.03    49.14
> > S452       42.98    39.32    14.64    96.03
> > S453       28.06    28.06    14.60    14.40
> > S471        8.53     8.65     8.39     8.43
> > S481       10.98    11.15    12.04    12.00
> > S482        9.31     9.31     9.19     9.17
> > S491       11.54    11.38    11.37    11.28
> > S4112       8.21     8.36     9.13     8.94
> > S4113       8.77     8.81     8.86     8.85
> > S4114      12.32    12.15    12.18    11.77
> > S4115       8.48     8.46     8.95     8.59
> > S4116       3.21     3.23     6.02     5.94
> > S4117      14.08     9.61    10.16     9.98
> > S4121       8.53     8.26     4.04     8.17
> > va         30.09    28.58    23.58    48.46
> > vag        12.35    12.36    13.58    13.20
> > vas        13.74    13.49    13.03    12.47
> > vif         4.49     4.57     5.06     4.92
> > vpv        58.59    57.22    28.28    57.24
> > vtv        59.15    57.83    28.40    57.63
> > vpvtv      33.18    32.84    16.35    32.73
> > vpvts       5.99     5.83     2.99     6.38
> > vpvpv      33.25    32.89    16.54    32.85
> > vtvtv      32.83    32.80    16.84    35.97
> > vsumr      72.03    72.03    72.20    72.04
> > vdotr      72.05    72.05    72.42    72.04
> > vbor      205.22   380.81    99.80   372.05
> > 
> > I've yet to go through these in detail (they just finished running
5
> > minutes ago). But for the curious (and I've had several requests
for
> > benchmarks), here you go. There is obviously more work to do.
> > 
> >  -Hal
> > 
> > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > > Hi Hal,
> > > 
> > > those numbers look very promising, great work! :)
> > > 
> > > Best,
> > > Ralf
> > > 
> > > ----- Original Message -----
> > > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at
gmail.com>
> > > > Cc: llvm-commits at cs.uiuc.edu
> > > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock
Autovectorization Pass
> > > > 
> > > > Bruno, et al.,
> > > > 
> > > > I've attached a new version of the patch that contains
improvements
> > > > (and
> > > > a critical bug fix [the code output is not more right, but
the pass
> > > > in
> > > > the older patch would crash in certain cases and now does
not])
> > > > compared
> > > > to previous versions that I've posted.
> > > > 
> > > > First, these are preliminary results because I did not do
the things
> > > > necessary to make them real (explicitly quiet the machine,
bind the
> > > > processes to one cpu, etc.). But they should be good enough
for
> > > > discussion.
> > > > 
> > > > I'm using LLVM head r143101, with the attached patch
applied, and
> > > > clang
> > > > head r143100 on an x86_64 machine (some kind of Intel Xeon).
For the
> > > > gcc
> > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc
was run -O3
> > > > without any other optimization flags. opt was run -vectorize
> > > > -unroll-allow-partial -O3 with no other optimization flags
(the patch
> > > > adds the -vectorize option). llc was just given -O3.
> > > > 
> > > > It is not difficult to construct an example in which
vectorization
> > > > would
> > > > be useful: take a loop that does more computation than
load/stores,
> > > > and
> > > > (partially) unroll it. Here is a simple case:
> > > > 
> > > > #define ITER 5000
> > > > #define NUM 200
> > > > double a[NUM][NUM];
> > > > double b[NUM][NUM];
> > > > 
> > > > ...
> > > > 
> > > > int main()
> > > > {
> > > > ...
> > > > 
> > > >   for (int i = 0; i < ITER; ++i) {
> > > >     for (int x = 0; x < NUM; ++x)
> > > >     for (int y = 0; y < NUM; ++y) {
> > > >       double v = a[x][y], w = b[x][y];
> > > >       double z1 = v*w;
> > > >       double z2 = v+w;
> > > >       double z3 = z1*z2;
> > > >       double z4 = z3+v;
> > > >       double z5 = z2+w;
> > > >       double z6 = z4*z5;
> > > >       double z7 = z4+z5;
> > > >       a[x][y] = v*v-z6;
> > > >       b[x][y] = w-z7;
> > > >     }
> > > >   }
> > > > 
> > > >  ...
> > > > 
> > > >   return 0;
> > > > }
> > > > 
> > > > Results:
> > > > gcc -03: 0m1.790s
> > > > llvm -vectorize: 0m2.360s
> > > > llvm: 0m2.780s
> > > > gcc -fno-tree-vectorize: 0m2.810s
> > > > (these are the user times after I've run enough for the
times to
> > > > settle
> > > > to three decimal places)
> > > > 
> > > > So the vectorization gives a ~15% improvement in the running
time.
> > > > gcc's
> > > > vectorization still does a much better job, however
(yielding an ~36%
> > > > improvement). So there is still work to do ;)
> > > > 
> > > > Additionally, I've checked the autovectorization on some
classic
> > > > numerical benchmarks from netlib. On these benchmarks,
clang/llvm
> > > > already do a good job compared to gcc (gcc is only about 10%
better,
> > > > and
> > > > this is true regardless of whether gcc's vectorization
is on or off).
> > > > For these cases, autovectorization provides an insignificant
speedup
> > > > in
> > > > most cases (but does not tend to make things worse, just not
really
> > > > any
> > > > better either). Because gcc's vectorization also did not
really help
> > > > gcc
> > > > in these cases, I'm not surprised. A good collection of
these is
> > > > available here:
> > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > > 
> > > > I've yet to run the test suite using the pass to
validate it. That is
> > > > something that I plan to do. Actually, the "Livermore
Loops" test in
> > > > the
> > > > aforementioned archive contains checksums to validate the
results,
> > > > and
> > > > it looks like 1 or 2 of the loop results are wrong with
vectorization
> > > > turned on, so I'll have to investigate that.
> > > > 
> > > >  -Hal
> > > > 
> > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes
wrote:
> > > > > Hi Hal,
> > > > > 
> > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel
at anl.gov>
> > > > > wrote:
> > > > > > I've attached an initial version of a
basic-block
> > > > > > autovectorization
> > > > > > pass. It works by searching a basic block for
pairable
> > > > > > (independent)
> > > > > > instructions, and, using a chain-seeking
heuristic, selects
> > > > > > pairings
> > > > > > likely to provide an overall speedup (if such
pairings can be
> > > > > > found).
> > > > > > The selected pairs are then fused and, if
necessary, other
> > > > > > instructions
> > > > > > are moved in order to maintain data-flow
consistency. This works
> > > > > > only
> > > > > > within one basic block, but can do loop
vectorization in
> > > > > > combination
> > > > > > with (partial) unrolling. The basic idea was
inspired by the
> > > > > > Vienna MAP
> > > > > > Vectorizor, which has been used to vectorize FFT
kernels, but the
> > > > > > algorithm used here is different.
> > > > > >
> > > > > > To try it, use -bb-vectorize with opt. There are a
few options:
> > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The
depth of the
> > > > > > chain of
> > > > > > instruction pairs necessary in order to consider
the pairs that
> > > > > > compose
> > > > > > the chain worthy of vectorization.
> > > > > > -bb-vectorize-vector-bits: default: 128 -- The
size of the target
> > > > > > vector
> > > > > > registers
> > > > > > -bb-vectorize-no-ints -- Don't consider
integer instructions
> > > > > > -bb-vectorize-no-floats -- Don't consider
floating-point
> > > > > > instructions
> > > > > >
> > > > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > > > pairs;
> > > > > > The assumption is that other passes will turn
these into shuffles
> > > > > > when
> > > > > > possible (it looks like some work is necessary
here). It will
> > > > > > also
> > > > > > vectorize vector instructions, and generates
shuffles in this
> > > > > > case
> > > > > > (again, other passes should combine these as
appropriate).
> > > > > >
> > > > > > Currently, it does not fuse load or store
instructions, but that
> > > > > > is a
> > > > > > feature that I'd like to add. Of course,
alignment information is
> > > > > > an
> > > > > > issue for load/store vectorization (or maybe I
should just fuse
> > > > > > them
> > > > > > anyway and let isel deal with unaligned cases?).
> > > > > >
> > > > > > Also, support needs to be added for fusing known
intrinsics (fma,
> > > > > > etc.),
> > > > > > and, as has been discussed on llvmdev, we should
add some
> > > > > > intrinsics to
> > > > > > allow the generation of addsub-type instructions.
> > > > > >
> > > > > > I've included a few tests, but it needs more.
Please review (I'll
> > > > > > commit
> > > > > > if and when everyone is happy).
> > > > > >
> > > > > > Thanks in advance,
> > > > > > Hal
> > > > > >
> > > > > > P.S. There is another option (not so useful right
now, but could
> > > > > > be):
> > > > > > -bb-vectorize-fast-dep -- Don't do a full
inter-instruction
> > > > > > dependency
> > > > > > analysis; instead stop looking for instruction
pairs after the
> > > > > > first use
> > > > > > of an instruction's value. [This makes the
pass faster, but would
> > > > > > require a data-dependence-based reordering pass in
order to be
> > > > > > effective].
> > > > > 
> > > > > Cool! :)
> > > > > Have you run this pass with any benchmark or the llvm
testsuite?
> > > > > Does
> > > > > it presents any regression?
> > > > > Do you have any performance results?
> > > > > Cheers,
> > > > > 
> > > > 
> > > > --
> > > > Hal Finkel
> > > > Postdoctoral Appointee
> > > > Leadership Computing Facility
> > > > Argonne National Laboratory
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > 
> > 
> > _______________________________________________
> > llvm-commits mailing list
> > llvm-commits at cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Oct-29 22:56 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On Sat, 2011-10-29 at 15:16 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:
> > On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> > > Ralf, et al.,
> > > 
> > > Attached is the latest version of my autovectorization patch.
llvmdev
> > > has been CC'd (as had been suggested to me); this e-mail
contains
> > > additional benchmark results.
> > > 
> > > First, these are preliminary results because I did not do the
things
> > > necessary to make them real (explicitly quiet the machine, bind
the
> > > processes to one cpu, etc.). But they should be good enough for
> > > discussion.
> > > 
> > > I'm using LLVM head r143101, with the attached patch applied,
and clang
> > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For
the gcc
> > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was
run -O3
> > > without any other optimization flags. opt was run -vectorize
> > > -unroll-allow-partial -O3 with no other optimization flags (the
patch
> > > adds the -vectorize option).
> > 
> > And opt had also been given the flag: -bb-vectorize-vector-bits=256
> 
> And this was a mistake (because the machine on which the benchmarks were
> run does not have AVX). I've rerun, see better results below...
> 
> > 
> >  -Hal
> > 
> > > llc was just given -O3.
> > > 
> > > Below I've included results using the benchmark program by
Maleki, et
> > > al. See:
> > > An Evaluation of Vectorizing Compilers - PACT'11
> > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source
of
> > > their benchmark program was retrieved from:
> > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> > > 
> > > Also, when using clang, I had to pass -Dinline= on the command
line:
> > > when using -emit-llvm, clang appears not to emit code for
functions
> > > declared inline. This is a bug, but I've not yet tracked it
down. There
> > > are two such small functions in the benchmark program, and the
regular
> > > inliner *should* catch them anyway.
> > > 
> > > Results:
> > > 0. Name of the loop
> > > 1. Time using LLVM with vectorization
> > > 2. Time using LLVM without vectorization
> > > 3. Time using gcc with vectorization
> > > 4. Time using gcc without vectorization
As Peter Collingbourne indirectly pointed out to me, clang's
optimizations are still important (even if it is emitting only llvm).
I've rerun the llvm code generation steps, adding -O3 to clang. Here are
the results (they are significantly better):

Loop       llvm-v   llvm     gcc-v    gcc
-------------------------------------------
S000        9.10     9.59     4.55    10.04
S111        7.29     7.65     7.68     7.83
S1111      13.87    14.72    16.14    16.30
S112       16.67    17.45    16.54    17.52
S1112      13.16    13.87    14.83    14.84
S113       22.14    22.98    22.05    22.05
S1113      11.06    11.48    11.03    11.01
S114       13.21    13.81    13.53    13.48
S115       32.82    33.36    49.98    49.99
S1115      13.67    14.23    13.65    13.66
S116       47.37    49.43    49.54    48.11
S118       10.81    11.25    10.79    10.50
S119        8.73     9.09    11.83    11.82
S1119       8.82     9.15     4.31    11.87
S121       17.29    18.06    14.84    17.31
S122        7.53     7.70     6.11     6.11
S123        6.93     7.10     7.42     7.41
S124        9.63     9.84     9.42     9.33
S125        6.94     7.10     4.67     7.81
S126        2.34     2.55     2.57     2.37
S127       12.23    12.68     7.06    14.50
S128       11.78    12.41    12.42    11.52
S131       28.79    30.11    25.17    28.94
S132       17.04    17.04    15.53    21.03
S141       12.26    12.85    12.38    12.05
S151       28.79    30.11    24.89    28.95
S152       15.53    16.03    11.19    15.63
S161        6.00     6.12     5.52     5.46
S1161      14.40    14.50     8.80     8.79
S162        8.19     8.41     5.36     8.18
S171       15.41     7.96     2.81     5.70
S172        5.70     5.97     2.75     5.70
S173       30.32    31.69    18.15    30.13
S174       30.20    31.53    18.51    30.16
S175        5.79     6.04     4.94     5.77
S176        5.59     5.83     4.41     7.65
S211       16.31    16.89    16.82    16.38
S212       13.23    13.50    13.34    13.18
S1213      12.82    13.35    12.80    12.43
S221       10.87    11.09     8.65     8.63
S1221       5.72     6.03     5.40     6.05
S222        6.01     6.29     5.70     5.72
S231       22.38    24.22    22.36    22.11
S232        6.89     6.94     6.89     6.89
S1232      15.31    16.43    15.05    15.10
S233       55.47    59.98    54.21    49.56
S2233      27.23    29.71    29.68    28.40
S235       44.08    47.85    46.94    43.93
S241       31.14    31.72    32.53    31.01
S242        7.20     7.21     7.20     7.20
S243       16.54    16.99    17.69    16.84
S244       14.51    14.93    16.91    16.82
S1244      14.72    15.02    14.77    14.40
S2244      10.09    10.60    10.40    10.06
S251       34.42    35.55    19.70    34.38
S1251      55.39    57.11    41.77    56.11
S2251      15.69    16.26    17.02    15.70
S3251      15.69    16.52    19.60    15.34
S252        6.18     6.46     7.72     7.26
S253       11.19    11.52    14.40    14.40
S254       18.00    18.98    28.23    28.06
S255        5.94     6.14     9.96     9.95
S256        3.09     3.39     3.10     3.09
S257        2.13     2.31     2.21     2.20
S258        1.80     1.87     1.84     1.84
S261       12.00    12.22    10.98    10.95
S271       32.81    33.76    33.25    33.01
S272       15.04    15.52    15.39    15.26
S273       13.93    14.10    16.86    16.80
S274       17.83    18.53    18.15    17.89
S275        2.92     3.14     3.36     2.98
S2275      32.81    34.95     8.97    33.60
S276       41.26    41.97    40.80    40.55
S277        4.80     4.93     4.81     4.80
S278       14.43    14.76    14.70    14.66
S279        8.05     8.24     7.25     7.27
S1279       9.72     9.92     9.34     9.25
S2710       7.73     8.07     7.86     7.56
S2711      36.49    37.10    36.56    36.00
S2712      32.96    33.96    34.24    33.47
S281       10.80    11.32    12.46    12.02
S1281      79.10    78.11    57.78    68.06
S291       11.79    12.27    14.03    14.03
S292        6.70     6.91     9.94     9.96
S293       15.50    16.24    19.32    19.33
S2101       2.56     2.67     2.59     2.60
S2102      16.74    18.45    16.68    16.75
S2111       5.59     5.63     5.85     5.85
S311       72.04    72.27    72.23    72.03
S31111      7.50     6.01     6.00     6.00
S312       96.04    96.17    96.05    96.03
S313       36.02    36.61    36.03    36.02
S314       36.01    36.12    74.67    72.42
S315        9.11     9.21     9.35     9.30
S316       36.01    36.12    72.08    74.87
S317      444.91   444.94   451.82   451.78
S318        9.07     9.12     7.30     7.30
S319       34.57    36.46    34.42    34.19
S3110       8.52     8.61     4.11     4.11
S13110      5.75     5.78    12.12    12.12
S3111       3.60     3.64     3.60     3.60
S3112       7.20     7.30     7.21     7.20
S3113      33.68    34.18    60.21    60.20
S321       16.80    16.87    16.80    16.80
S322       12.42    12.64    12.60    12.60
S323       10.88    11.24     8.48     8.51
S331        4.23     4.36     7.20     7.20
S332        7.20     7.28     5.21     5.31
S341        4.80     5.04     7.23     7.20
S342        6.01     6.24     7.25     7.20
S343        2.04     2.16     2.16     2.01
S351       46.63    48.65    21.82    46.46
S1351      49.37    51.28    33.68    49.06
S352       57.64    58.44    57.68    57.64
S353        8.21     8.44     8.34     8.19
S421       24.26    25.29    20.62    22.46
S1421      25.18    26.16    15.85    24.76
S422       80.08    81.51    79.22    78.99
S423      155.02   155.21   154.56   154.38
S424       22.62    23.35    11.42    22.36
S431       57.22    59.82    27.59    57.16
S441       13.27    14.23    12.88    12.81
S442        5.99     6.13     6.96     6.90
S443       17.37    17.77    17.15    16.95
S451       48.92    48.99    49.03    49.14
S452       42.97    39.57    14.64    96.03
S453       28.06    28.07    14.60    14.40
S471        8.27     8.56     8.39     8.43
S481       10.93    11.23    12.04    12.00
S482        9.21     9.42     9.19     9.17
S491       11.31    11.60    11.37    11.28
S4112       8.21     8.45     9.13     8.94
S4113       8.65     8.95     8.86     8.85
S4114      11.87    12.35    12.18    11.77
S4115       8.28     8.51     8.95     8.59
S4116       3.23     3.22     6.02     5.94
S4117      13.97     9.69    10.16     9.98
S4121       8.20     8.44     4.04     8.17
va         28.50    29.33    23.58    48.46
vag        12.37    12.93    13.58    13.20
vas        13.46    14.15    13.03    12.47
vif         4.55     4.79     5.06     4.92
vpv        57.21    59.83    28.28    57.24
vtv        57.92    60.42    28.40    57.63
vpvtv      32.84    33.77    16.35    32.73
vpvts       5.82     6.07     2.99     6.38
vpvpv      32.87    33.84    16.54    32.85
vtvtv      32.82    33.75    16.84    35.97
vsumr      72.03    72.28    72.20    72.04
vdotr      72.05    73.22    72.42    72.04
vbor      205.24   381.18    99.80   372.05

I apologize for the multiple e-mails with a long list of numbers, but I
think that this was worth it (and I did not want to be unfair to the
clang developers).

 -Hal
> 
> Here are improved results where the correct (and default)
> vector-register size was used.
> 
> Loop       llvm-v   llvm     gcc-v    gcc
> -------------------------------------------
> S000        9.09     9.49     4.55    10.04
> S111        7.28     7.37     7.68     7.83
> S1111      13.78    14.48    16.14    16.30
> S112       16.67    17.41    16.54    17.52
> S1112      13.12    14.21    14.83    14.84
> S113       22.12    22.88    22.05    22.05
> S1113      11.06    11.42    11.03    11.01
> S114       13.23    13.75    13.53    13.48
> S115       32.76    33.24    49.98    49.99
> S1115      13.68    14.18    13.65    13.66
> S116       47.42    49.40    49.54    48.11
> S118       10.84    11.26    10.79    10.50
> S119        8.74     9.07    11.83    11.82
> S1119       8.81     9.14     4.31    11.87
> S121       17.28    18.78    14.84    17.31
> S122        7.53     7.54     6.11     6.11
> S123        6.90     7.38     7.42     7.41
> S124        9.60     9.77     9.42     9.33
> S125        6.92     7.22     4.67     7.81
> S126        2.34     2.53     2.57     2.37
> S127       12.19    12.97     7.06    14.50
> S128       11.74    12.43    12.42    11.52
> S131       28.75    29.91    25.17    28.94
> S132       17.04    17.04    15.53    21.03
> S141       12.28    12.26    12.38    12.05
> S151       28.80    29.43    24.89    28.95
> S152       15.54    16.03    11.19    15.63
> S161        6.00     6.06     5.52     5.46
> S1161      14.39    14.40     8.80     8.79
> S162        8.19     9.05     5.36     8.18
> S171       15.41     7.94     2.81     5.70
> S172        5.71     5.89     2.75     5.70
> S173       30.31    30.92    18.15    30.13
> S174       30.18    31.66    18.51    30.16
> S175        5.78     6.18     4.94     5.77
> S176        5.59     5.83     4.41     7.65
> S211       16.27    17.14    16.82    16.38
> S212       13.21    14.28    13.34    13.18
> S1213      12.81    13.46    12.80    12.43
> S221       10.86    11.09     8.65     8.63
> S1221       5.72     6.04     5.40     6.05
> S222        6.02     6.26     5.70     5.72
> S231       22.33    22.94    22.36    22.11
> S232        6.88     6.88     6.89     6.89
> S1232      15.30    15.34    15.05    15.10
> S233       55.38    58.55    54.21    49.56
> S2233      27.08    29.77    29.68    28.40
> S235       44.00    44.92    46.94    43.93
> S241       31.09    31.35    32.53    31.01
> S242        7.19     7.20     7.20     7.20
> S243       16.52    17.09    17.69    16.84
> S244       14.45    14.83    16.91    16.82
> S1244      14.71    14.83    14.77    14.40
> S2244      10.04    10.62    10.40    10.06
> S251       34.15    35.75    19.70    34.38
> S1251      55.23    57.84    41.77    56.11
> S2251      15.73    15.87    17.02    15.70
> S3251      15.66    16.21    19.60    15.34
> S252        6.18     6.32     7.72     7.26
> S253       11.14    11.38    14.40    14.40
> S254       18.41    18.70    28.23    28.06
> S255        5.93     6.09     9.96     9.95
> S256        3.08     3.42     3.10     3.09
> S257        2.13     2.25     2.21     2.20
> S258        1.79     1.82     1.84     1.84
> S261       12.00    12.08    10.98    10.95
> S271       32.82    33.04    33.25    33.01
> S272       14.98    15.82    15.39    15.26
> S273       13.92    14.04    16.86    16.80
> S274       17.83    18.31    18.15    17.89
> S275        2.92     3.02     3.36     2.98
> S2275      32.80    33.50     8.97    33.60
> S276       39.43    39.44    40.80    40.55
> S277        4.80     4.80     4.81     4.80
> S278       14.41    14.42    14.70    14.66
> S279        8.03     8.29     7.25     7.27
> S1279       9.71    10.06     9.34     9.25
> S2710       7.71     8.04     7.86     7.56
> S2711      35.53    35.55    36.56    36.00
> S2712      32.94    33.17    34.24    33.47
> S281       10.79    11.09    12.46    12.02
> S1281      79.13    77.55    57.78    68.06
> S291       11.80    11.78    14.03    14.03
> S292        7.77     7.78     9.94     9.96
> S293       15.50    15.87    19.32    19.33
> S2101       2.56     2.58     2.59     2.60
> S2102      16.71    17.53    16.68    16.75
> S2111       5.60     5.60     5.85     5.85
> S311       72.03    72.03    72.23    72.03
> S31111      7.49     6.00     6.00     6.00
> S312       96.04    96.04    96.05    96.03
> S313       36.02    36.13    36.03    36.02
> S314       36.01    36.07    74.67    72.42
> S315        8.96     8.99     9.35     9.30
> S316       36.02    36.06    72.08    74.87
> S317      444.93   444.94   451.82   451.78
> S318        9.05     9.07     7.30     7.30
> S319       34.54    36.53    34.42    34.19
> S3110       8.51     8.57     4.11     4.11
> S13110      5.75     5.77    12.12    12.12
> S3111       3.60     3.62     3.60     3.60
> S3112       7.19     7.30     7.21     7.20
> S3113      35.13    35.47    60.21    60.20
> S321       16.79    16.81    16.80    16.80
> S322       12.42    12.60    12.60    12.60
> S323       10.86    11.02     8.48     8.51
> S331        4.23     4.23     7.20     7.20
> S332        7.20     7.21     5.21     5.31
> S341        4.79     4.85     7.23     7.20
> S342        6.01     6.09     7.25     7.20
> S343        2.04     2.06     2.16     2.01
> S351       46.61    47.34    21.82    46.46
> S1351      49.28    50.35    33.68    49.06
> S352       57.65    58.04    57.68    57.64
> S353        8.21     8.38     8.34     8.19
> S421       42.94    43.34    20.62    22.46
> S1421      25.15    25.81    15.85    24.76
> S422       87.39    87.53    79.22    78.99
> S423      155.01   155.29   154.56   154.38
> S424       36.51    37.51    11.42    22.36
> S431       57.10    60.66    27.59    57.16
> S441       14.04    13.29    12.88    12.81
> S442        6.00     6.00     6.96     6.90
> S443       17.28    17.77    17.15    16.95
> S451       48.92    49.08    49.03    49.14
> S452       42.98    39.32    14.64    96.03
> S453       28.05    28.06    14.60    14.40
> S471        8.24     8.65     8.39     8.43
> S481       10.88    11.15    12.04    12.00
> S482        9.21     9.31     9.19     9.17
> S491       11.26    11.38    11.37    11.28
> S4112       8.21     8.36     9.13     8.94
> S4113       8.65     8.81     8.86     8.85
> S4114      11.82    12.15    12.18    11.77
> S4115       8.28     8.46     8.95     8.59
> S4116       3.22     3.23     6.02     5.94
> S4117      13.95     9.61    10.16     9.98
> S4121       8.21     8.26     4.04     8.17
> va         28.46    28.58    23.58    48.46
> vag        12.35    12.36    13.58    13.20
> vas        13.45    13.49    13.03    12.47
> vif         4.55     4.57     5.06     4.92
> vpv        57.08    57.22    28.28    57.24
> vtv        57.81    57.83    28.40    57.63
> vpvtv      32.82    32.84    16.35    32.73
> vpvts       5.82     5.83     2.99     6.38
> vpvpv      32.87    32.89    16.54    32.85
> vtvtv      32.82    32.80    16.84    35.97
> vsumr      72.04    72.03    72.20    72.04
> vdotr      72.06    72.05    72.42    72.04
> vbor      205.24   380.81    99.80   372.05
> 
>  -Hal
> 
> > > 
> > > Loop       llvm-v   llvm     gcc-v    gcc
> > > -------------------------------------------
> > > S000        9.59     9.49     4.55    10.04
> > > S111        7.67     7.37     7.68     7.83
> > > S1111      13.98    14.48    16.14    16.30
> > > S112       17.43    17.41    16.54    17.52
> > > S1112      13.87    14.21    14.83    14.84
> > > S113       22.97    22.88    22.05    22.05
> > > S1113      11.46    11.42    11.03    11.01
> > > S114       13.47    13.75    13.53    13.48
> > > S115       33.06    33.24    49.98    49.99
> > > S1115      13.91    14.18    13.65    13.66
> > > S116       48.74    49.40    49.54    48.11
> > > S118       11.04    11.26    10.79    10.50
> > > S119        8.97     9.07    11.83    11.82
> > > S1119       9.04     9.14     4.31    11.87
> > > S121       18.06    18.78    14.84    17.31
> > > S122        7.58     7.54     6.11     6.11
> > > S123        7.02     7.38     7.42     7.41
> > > S124        9.62     9.77     9.42     9.33
> > > S125        7.14     7.22     4.67     7.81
> > > S126        2.32     2.53     2.57     2.37
> > > S127       12.87    12.97     7.06    14.50
> > > S128       12.58    12.43    12.42    11.52
> > > S131       29.91    29.91    25.17    28.94
> > > S132       17.04    17.04    15.53    21.03
> > > S141       12.59    12.26    12.38    12.05
> > > S151       28.92    29.43    24.89    28.95
> > > S152       15.68    16.03    11.19    15.63
> > > S161        6.06     6.06     5.52     5.46
> > > S1161      14.46    14.40     8.80     8.79
> > > S162        8.31     9.05     5.36     8.18
> > > S171       15.47     7.94     2.81     5.70
> > > S172        5.92     5.89     2.75     5.70
> > > S173       31.59    30.92    18.15    30.13
> > > S174       31.16    31.66    18.51    30.16
> > > S175        5.80     6.18     4.94     5.77
> > > S176        5.69     5.83     4.41     7.65
> > > S211       16.56    17.14    16.82    16.38
> > > S212       13.46    14.28    13.34    13.18
> > > S1213      13.12    13.46    12.80    12.43
> > > S221       10.88    11.09     8.65     8.63
> > > S1221       5.80     6.04     5.40     6.05
> > > S222        6.01     6.26     5.70     5.72
> > > S231       23.78    22.94    22.36    22.11
> > > S232        6.88     6.88     6.89     6.89
> > > S1232      16.00    15.34    15.05    15.10
> > > S233       57.48    58.55    54.21    49.56
> > > S2233      27.65    29.77    29.68    28.40
> > > S235       46.40    44.92    46.94    43.93
> > > S241       31.62    31.35    32.53    31.01
> > > S242        7.20     7.20     7.20     7.20
> > > S243       16.78    17.09    17.69    16.84
> > > S244       14.64    14.83    16.91    16.82
> > > S1244      14.98    14.83    14.77    14.40
> > > S2244      10.47    10.62    10.40    10.06
> > > S251       35.10    35.75    19.70    34.38
> > > S1251      56.65    57.84    41.77    56.11
> > > S2251      15.96    15.87    17.02    15.70
> > > S3251      16.41    16.21    19.60    15.34
> > > S252        7.24     6.32     7.72     7.26
> > > S253       12.55    11.38    14.40    14.40
> > > S254       19.08    18.70    28.23    28.06
> > > S255        5.94     6.09     9.96     9.95
> > > S256        3.14     3.42     3.10     3.09
> > > S257        2.18     2.25     2.21     2.20
> > > S258        1.80     1.82     1.84     1.84
> > > S261       12.00    12.08    10.98    10.95
> > > S271       32.93    33.04    33.25    33.01
> > > S272       15.48    15.82    15.39    15.26
> > > S273       13.99    14.04    16.86    16.80
> > > S274       18.38    18.31    18.15    17.89
> > > S275        3.02     3.02     3.36     2.98
> > > S2275      33.71    33.50     8.97    33.60
> > > S276       39.52    39.44    40.80    40.55
> > > S277        4.81     4.80     4.81     4.80
> > > S278       14.43    14.42    14.70    14.66
> > > S279        8.10     8.29     7.25     7.27
> > > S1279       9.77    10.06     9.34     9.25
> > > S2710       7.85     8.04     7.86     7.56
> > > S2711      35.54    35.55    36.56    36.00
> > > S2712      33.16    33.17    34.24    33.47
> > > S281       10.97    11.09    12.46    12.02
> > > S1281      79.37    77.55    57.78    68.06
> > > S291       11.94    11.78    14.03    14.03
> > > S292        7.88     7.78     9.94     9.96
> > > S293       15.90    15.87    19.32    19.33
> > > S2101       2.59     2.58     2.59     2.60
> > > S2102      17.63    17.53    16.68    16.75
> > > S2111       5.63     5.60     5.85     5.85
> > > S311       72.07    72.03    72.23    72.03
> > > S31111      7.49     6.00     6.00     6.00
> > > S312       96.06    96.04    96.05    96.03
> > > S313       36.50    36.13    36.03    36.02
> > > S314       36.10    36.07    74.67    72.42
> > > S315        9.00     8.99     9.35     9.30
> > > S316       36.11    36.06    72.08    74.87
> > > S317      444.92   444.94   451.82   451.78
> > > S318        9.04     9.07     7.30     7.30
> > > S319       34.76    36.53    34.42    34.19
> > > S3110       8.53     8.57     4.11     4.11
> > > S13110      5.76     5.77    12.12    12.12
> > > S3111       3.60     3.62     3.60     3.60
> > > S3112       7.20     7.30     7.21     7.20
> > > S3113      35.12    35.47    60.21    60.20
> > > S321       16.81    16.81    16.80    16.80
> > > S322       12.42    12.60    12.60    12.60
> > > S323       10.93    11.02     8.48     8.51
> > > S331        4.23     4.23     7.20     7.20
> > > S332        7.21     7.21     5.21     5.31
> > > S341        4.74     4.85     7.23     7.20
> > > S342        6.02     6.09     7.25     7.20
> > > S343        2.14     2.06     2.16     2.01
> > > S351       49.26    47.34    21.82    46.46
> > > S1351      50.85    50.35    33.68    49.06
> > > S352       58.14    58.04    57.68    57.64
> > > S353        8.35     8.38     8.34     8.19
> > > S421       43.13    43.34    20.62    22.46
> > > S1421      25.25    25.81    15.85    24.76
> > > S422       88.36    87.53    79.22    78.99
> > > S423      155.13   155.29   154.56   154.38
> > > S424       37.11    37.51    11.42    22.36
> > > S431       58.22    60.66    27.59    57.16
> > > S441       14.05    13.29    12.88    12.81
> > > S442        6.08     6.00     6.96     6.90
> > > S443       17.60    17.77    17.15    16.95
> > > S451       48.95    49.08    49.03    49.14
> > > S452       42.98    39.32    14.64    96.03
> > > S453       28.06    28.06    14.60    14.40
> > > S471        8.53     8.65     8.39     8.43
> > > S481       10.98    11.15    12.04    12.00
> > > S482        9.31     9.31     9.19     9.17
> > > S491       11.54    11.38    11.37    11.28
> > > S4112       8.21     8.36     9.13     8.94
> > > S4113       8.77     8.81     8.86     8.85
> > > S4114      12.32    12.15    12.18    11.77
> > > S4115       8.48     8.46     8.95     8.59
> > > S4116       3.21     3.23     6.02     5.94
> > > S4117      14.08     9.61    10.16     9.98
> > > S4121       8.53     8.26     4.04     8.17
> > > va         30.09    28.58    23.58    48.46
> > > vag        12.35    12.36    13.58    13.20
> > > vas        13.74    13.49    13.03    12.47
> > > vif         4.49     4.57     5.06     4.92
> > > vpv        58.59    57.22    28.28    57.24
> > > vtv        59.15    57.83    28.40    57.63
> > > vpvtv      33.18    32.84    16.35    32.73
> > > vpvts       5.99     5.83     2.99     6.38
> > > vpvpv      33.25    32.89    16.54    32.85
> > > vtvtv      32.83    32.80    16.84    35.97
> > > vsumr      72.03    72.03    72.20    72.04
> > > vdotr      72.05    72.05    72.42    72.04
> > > vbor      205.22   380.81    99.80   372.05
> > > 
> > > I've yet to go through these in detail (they just finished
running 5
> > > minutes ago). But for the curious (and I've had several
requests for
> > > benchmarks), here you go. There is obviously more work to do.
> > > 
> > >  -Hal
> > > 
> > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > > > Hi Hal,
> > > > 
> > > > those numbers look very promising, great work! :)
> > > > 
> > > > Best,
> > > > Ralf
> > > > 
> > > > ----- Original Message -----
> > > > > From: "Hal Finkel" <hfinkel at anl.gov>
> > > > > To: "Bruno Cardoso Lopes" <bruno.cardoso
at gmail.com>
> > > > > Cc: llvm-commits at cs.uiuc.edu
> > > > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock
Autovectorization Pass
> > > > > 
> > > > > Bruno, et al.,
> > > > > 
> > > > > I've attached a new version of the patch that
contains improvements
> > > > > (and
> > > > > a critical bug fix [the code output is not more right,
but the pass
> > > > > in
> > > > > the older patch would crash in certain cases and now
does not])
> > > > > compared
> > > > > to previous versions that I've posted.
> > > > > 
> > > > > First, these are preliminary results because I did not
do the things
> > > > > necessary to make them real (explicitly quiet the
machine, bind the
> > > > > processes to one cpu, etc.). But they should be good
enough for
> > > > > discussion.
> > > > > 
> > > > > I'm using LLVM head r143101, with the attached
patch applied, and
> > > > > clang
> > > > > head r143100 on an x86_64 machine (some kind of Intel
Xeon). For the
> > > > > gcc
> > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5.
gcc was run -O3
> > > > > without any other optimization flags. opt was run
-vectorize
> > > > > -unroll-allow-partial -O3 with no other optimization
flags (the patch
> > > > > adds the -vectorize option). llc was just given -O3.
> > > > > 
> > > > > It is not difficult to construct an example in which
vectorization
> > > > > would
> > > > > be useful: take a loop that does more computation than
load/stores,
> > > > > and
> > > > > (partially) unroll it. Here is a simple case:
> > > > > 
> > > > > #define ITER 5000
> > > > > #define NUM 200
> > > > > double a[NUM][NUM];
> > > > > double b[NUM][NUM];
> > > > > 
> > > > > ...
> > > > > 
> > > > > int main()
> > > > > {
> > > > > ...
> > > > > 
> > > > >   for (int i = 0; i < ITER; ++i) {
> > > > >     for (int x = 0; x < NUM; ++x)
> > > > >     for (int y = 0; y < NUM; ++y) {
> > > > >       double v = a[x][y], w = b[x][y];
> > > > >       double z1 = v*w;
> > > > >       double z2 = v+w;
> > > > >       double z3 = z1*z2;
> > > > >       double z4 = z3+v;
> > > > >       double z5 = z2+w;
> > > > >       double z6 = z4*z5;
> > > > >       double z7 = z4+z5;
> > > > >       a[x][y] = v*v-z6;
> > > > >       b[x][y] = w-z7;
> > > > >     }
> > > > >   }
> > > > > 
> > > > >  ...
> > > > > 
> > > > >   return 0;
> > > > > }
> > > > > 
> > > > > Results:
> > > > > gcc -03: 0m1.790s
> > > > > llvm -vectorize: 0m2.360s
> > > > > llvm: 0m2.780s
> > > > > gcc -fno-tree-vectorize: 0m2.810s
> > > > > (these are the user times after I've run enough for
the times to
> > > > > settle
> > > > > to three decimal places)
> > > > > 
> > > > > So the vectorization gives a ~15% improvement in the
running time.
> > > > > gcc's
> > > > > vectorization still does a much better job, however
(yielding an ~36%
> > > > > improvement). So there is still work to do ;)
> > > > > 
> > > > > Additionally, I've checked the autovectorization on
some classic
> > > > > numerical benchmarks from netlib. On these benchmarks,
clang/llvm
> > > > > already do a good job compared to gcc (gcc is only
about 10% better,
> > > > > and
> > > > > this is true regardless of whether gcc's
vectorization is on or off).
> > > > > For these cases, autovectorization provides an
insignificant speedup
> > > > > in
> > > > > most cases (but does not tend to make things worse,
just not really
> > > > > any
> > > > > better either). Because gcc's vectorization also
did not really help
> > > > > gcc
> > > > > in these cases, I'm not surprised. A good
collection of these is
> > > > > available here:
> > > > >
http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > > > 
> > > > > I've yet to run the test suite using the pass to
validate it. That is
> > > > > something that I plan to do. Actually, the
"Livermore Loops" test in
> > > > > the
> > > > > aforementioned archive contains checksums to validate
the results,
> > > > > and
> > > > > it looks like 1 or 2 of the loop results are wrong with
vectorization
> > > > > turned on, so I'll have to investigate that.
> > > > > 
> > > > >  -Hal
> > > > > 
> > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes
wrote:
> > > > > > Hi Hal,
> > > > > > 
> > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel
<hfinkel at anl.gov>
> > > > > > wrote:
> > > > > > > I've attached an initial version of a
basic-block
> > > > > > > autovectorization
> > > > > > > pass. It works by searching a basic block for
pairable
> > > > > > > (independent)
> > > > > > > instructions, and, using a chain-seeking
heuristic, selects
> > > > > > > pairings
> > > > > > > likely to provide an overall speedup (if such
pairings can be
> > > > > > > found).
> > > > > > > The selected pairs are then fused and, if
necessary, other
> > > > > > > instructions
> > > > > > > are moved in order to maintain data-flow
consistency. This works
> > > > > > > only
> > > > > > > within one basic block, but can do loop
vectorization in
> > > > > > > combination
> > > > > > > with (partial) unrolling. The basic idea was
inspired by the
> > > > > > > Vienna MAP
> > > > > > > Vectorizor, which has been used to vectorize
FFT kernels, but the
> > > > > > > algorithm used here is different.
> > > > > > >
> > > > > > > To try it, use -bb-vectorize with opt. There
are a few options:
> > > > > > > -bb-vectorize-req-chain-depth: default: 3 --
The depth of the
> > > > > > > chain of
> > > > > > > instruction pairs necessary in order to
consider the pairs that
> > > > > > > compose
> > > > > > > the chain worthy of vectorization.
> > > > > > > -bb-vectorize-vector-bits: default: 128 --
The size of the target
> > > > > > > vector
> > > > > > > registers
> > > > > > > -bb-vectorize-no-ints -- Don't consider
integer instructions
> > > > > > > -bb-vectorize-no-floats -- Don't consider
floating-point
> > > > > > > instructions
> > > > > > >
> > > > > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > > > > pairs;
> > > > > > > The assumption is that other passes will turn
these into shuffles
> > > > > > > when
> > > > > > > possible (it looks like some work is
necessary here). It will
> > > > > > > also
> > > > > > > vectorize vector instructions, and generates
shuffles in this
> > > > > > > case
> > > > > > > (again, other passes should combine these as
appropriate).
> > > > > > >
> > > > > > > Currently, it does not fuse load or store
instructions, but that
> > > > > > > is a
> > > > > > > feature that I'd like to add. Of course,
alignment information is
> > > > > > > an
> > > > > > > issue for load/store vectorization (or maybe
I should just fuse
> > > > > > > them
> > > > > > > anyway and let isel deal with unaligned
cases?).
> > > > > > >
> > > > > > > Also, support needs to be added for fusing
known intrinsics (fma,
> > > > > > > etc.),
> > > > > > > and, as has been discussed on llvmdev, we
should add some
> > > > > > > intrinsics to
> > > > > > > allow the generation of addsub-type
instructions.
> > > > > > >
> > > > > > > I've included a few tests, but it needs
more. Please review (I'll
> > > > > > > commit
> > > > > > > if and when everyone is happy).
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > > Hal
> > > > > > >
> > > > > > > P.S. There is another option (not so useful
right now, but could
> > > > > > > be):
> > > > > > > -bb-vectorize-fast-dep -- Don't do a full
inter-instruction
> > > > > > > dependency
> > > > > > > analysis; instead stop looking for
instruction pairs after the
> > > > > > > first use
> > > > > > > of an instruction's value. [This makes
the pass faster, but would
> > > > > > > require a data-dependence-based reordering
pass in order to be
> > > > > > > effective].
> > > > > > 
> > > > > > Cool! :)
> > > > > > Have you run this pass with any benchmark or the
llvm testsuite?
> > > > > > Does
> > > > > > it presents any regression?
> > > > > > Do you have any performance results?
> > > > > > Cheers,
> > > > > > 
> > > > > 
> > > > > --
> > > > > Hal Finkel
> > > > > Postdoctoral Appointee
> > > > > Leadership Computing Facility
> > > > > Argonne National Laboratory
> > > > > 
> > > > > _______________________________________________
> > > > > llvm-commits mailing list
> > > > > llvm-commits at cs.uiuc.edu
> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > > 
> > > 
> > > _______________________________________________
> > > llvm-commits mailing list
> > > llvm-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory

Hal Finkel

2011-Nov-01 00:50 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

I've attached the latest version of my autovectorization patch. This
version is significantly faster (in compile time) than the version I
posted a couple of days ago, and generally produces better output.

At this point, next steps in enhancing the vectorization include:
1. Add an add/sub and/or alternating-negation vector intrinsic to
provide for generating add-subtract, and more generally, asymmetric fma
instructions.
2. Make -vectorize imply -unroll-allow-partial [Is there an easy way to
do this?]
3. Add a -fvectorize flag to clang along the same lines.

Updated vectorization benchmark:
Loop       llvm-v   llvm     gcc-v    gcc
-------------------------------------------
S000        9.00     9.59     4.55    10.04
S111        7.25     7.65     7.68     7.83
S1111      13.63    14.72    16.14    16.30
S112       16.60    17.45    16.54    17.52
S1112      12.99    13.87    14.83    14.84
S113       22.03    22.98    22.05    22.05
S1113      11.01    11.48    11.03    11.01
S114       13.14    13.81    13.53    13.48
S115       32.92    33.36    49.98    49.99
S1115      13.61    14.23    13.65    13.66
S116       46.90    49.43    49.54    48.11
S118       10.76    11.25    10.79    10.50
S119        8.68     9.09    11.83    11.82
S1119       8.75     9.15     4.31    11.87
S121       17.17    18.06    14.84    17.31
S122        7.53     7.70     6.11     6.11
S123        6.92     7.10     7.42     7.41
S124        9.60     9.84     9.42     9.33
S125        6.89     7.10     4.67     7.81
S126        2.33     2.55     2.57     2.37
S127       12.18    12.68     7.06    14.50
S128       11.66    12.41    12.42    11.52
S131       28.59    30.11    25.17    28.94
S132       17.04    17.04    15.53    21.03
S141       12.18    12.85    12.38    12.05
S151       28.61    30.11    24.89    28.95
S152       15.47    16.03    11.19    15.63
S161        6.00     6.12     5.52     5.46
S1161      14.40    14.50     8.80     8.79
S162        8.18     8.41     5.36     8.18
S171       14.05     7.96     2.81     5.70
S172        5.67     5.97     2.75     5.70
S173       30.17    31.69    18.15    30.13
S174       30.12    31.53    18.51    30.16
S175        5.75     6.04     4.94     5.77
S176        5.57     5.83     4.41     7.65
S211       16.23    16.89    16.82    16.38
S212       13.19    13.50    13.34    13.18
S1213      12.83    13.35    12.80    12.43
S221       10.86    11.09     8.65     8.63
S1221       5.71     6.03     5.40     6.05
S222        6.00     6.29     5.70     5.72
S231       22.23    24.22    22.36    22.11
S232        6.89     6.94     6.89     6.89
S1232      15.23    16.43    15.05    15.10
S233       55.17    59.98    54.21    49.56
S2233      27.07    29.71    29.68    28.40
S235       43.79    47.85    46.94    43.93
S241       31.00    31.72    32.53    31.01
S242        7.20     7.21     7.20     7.20
S243       16.48    16.99    17.69    16.84
S244       14.47    14.93    16.91    16.82
S1244      14.75    15.02    14.77    14.40
S2244       9.97    10.60    10.40    10.06
S251       34.20    35.55    19.70    34.38
S1251      55.09    57.11    41.77    56.11
S2251      15.64    16.26    17.02    15.70
S3251      15.55    16.52    19.60    15.34
S252        6.14     6.46     7.72     7.26
S253       11.18    11.52    14.40    14.40
S254       17.72    18.98    28.23    28.06
S255        5.93     6.14     9.96     9.95
S256        3.06     3.39     3.10     3.09
S257        2.12     2.31     2.21     2.20
S258        1.79     1.87     1.84     1.84
S261       12.01    12.22    10.98    10.95
S271       32.76    33.76    33.25    33.01
S272       14.93    15.52    15.39    15.26
S273       13.92    14.10    16.86    16.80
S274       17.77    18.53    18.15    17.89
S275        2.90     3.14     3.36     2.98
S2275      32.65    34.95     8.97    33.60
S276       41.38    41.97    40.80    40.55
S277        4.81     4.93     4.81     4.80
S278       14.41    14.76    14.70    14.66
S279        8.04     8.24     7.25     7.27
S1279       9.71     9.92     9.34     9.25
S2710       7.68     8.07     7.86     7.56
S2711      35.53    37.10    36.56    36.00
S2712      32.91    33.96    34.24    33.47
S281       10.75    11.32    12.46    12.02
S1281     104.13    78.11    57.78    68.06
S291       11.75    12.27    14.03    14.03
S292        6.70     6.91     9.94     9.96
S293       15.38    16.24    19.32    19.33
S2101       2.50     2.67     2.59     2.60
S2102      16.56    18.45    16.68    16.75
S2111       5.59     5.63     5.85     5.85
S311       72.04    72.27    72.23    72.03
S31111      6.37     6.01     6.00     6.00
S312       96.04    96.17    96.05    96.03
S313       36.03    36.61    36.03    36.02
S314       36.02    36.12    74.67    72.42
S315        9.11     9.21     9.35     9.30
S316       36.02    36.12    72.08    74.87
S317      444.96   444.94   451.82   451.78
S318        9.07     9.12     7.30     7.30
S319       34.49    36.46    34.42    34.19
S3110       8.53     8.61     4.11     4.11
S13110      5.75     5.78    12.12    12.12
S3111       3.60     3.64     3.60     3.60
S3112       7.21     7.30     7.21     7.20
S3113      33.68    34.18    60.21    60.20
S321       16.80    16.87    16.80    16.80
S322       12.42    12.64    12.60    12.60
S323       10.89    11.24     8.48     8.51
S331        4.23     4.36     7.20     7.20
S332        7.21     7.28     5.21     5.31
S341        4.76     5.04     7.23     7.20
S342        6.02     6.24     7.25     7.20
S343        2.02     2.16     2.16     2.01
S351       46.33    48.65    21.82    46.46
S1351      49.07    51.28    33.68    49.06
S352       57.65    58.44    57.68    57.64
S353        8.19     8.44     8.34     8.19
S421       24.17    25.29    20.62    22.46
S1421      25.09    26.16    15.85    24.76
S422       79.95    81.51    79.22    78.99
S423      154.93   155.21   154.56   154.38
S424       22.61    23.35    11.42    22.36
S431       56.88    59.82    27.59    57.16
S441       14.05    14.23    12.88    12.81
S442        5.99     6.13     6.96     6.90
S443       17.33    17.77    17.15    16.95
S451       48.94    48.99    49.03    49.14
S452       43.01    39.57    14.64    96.03
S453       28.07    28.07    14.60    14.40
S471        8.20     8.56     8.39     8.43
S481       10.89    11.23    12.04    12.00
S482        9.20     9.42     9.19     9.17
S491       11.25    11.60    11.37    11.28
S4112       8.20     8.45     9.13     8.94
S4113       8.64     8.95     8.86     8.85
S4114      11.82    12.35    12.18    11.77
S4115       8.27     8.51     8.95     8.59
S4116       3.22     3.22     6.02     5.94
S4117      13.96     9.69    10.16     9.98
S4121       8.19     8.44     4.04     8.17
va         28.39    29.33    23.58    48.46
vag        12.26    12.93    13.58    13.20
vas        13.36    14.15    13.03    12.47
vif         4.50     4.79     5.06     4.92
vpv        56.84    59.83    28.28    57.24
vtv        57.58    60.42    28.40    57.63
vpvtv      32.78    33.77    16.35    32.73
vpvts       5.78     6.07     2.99     6.38
vpvpv      32.78    33.84    16.54    32.85
vtvtv      32.76    33.75    16.84    35.97
vsumr      72.04    72.28    72.20    72.04
vdotr      72.05    73.22    72.42    72.04
vbor      227.55   381.18    99.80   372.05

 -Hal

On Sat, 2011-10-29 at 17:56 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 15:16 -0500, Hal Finkel wrote:
> > On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:
> > > On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote:
> > > > Ralf, et al.,
> > > > 
> > > > Attached is the latest version of my autovectorization
patch. llvmdev
> > > > has been CC'd (as had been suggested to me); this e-mail
contains
> > > > additional benchmark results.
> > > > 
> > > > First, these are preliminary results because I did not do
the things
> > > > necessary to make them real (explicitly quiet the machine,
bind the
> > > > processes to one cpu, etc.). But they should be good enough
for
> > > > discussion.
> > > > 
> > > > I'm using LLVM head r143101, with the attached patch
applied, and clang
> > > > head r143100 on an x86_64 machine (some kind of Intel Xeon).
For the gcc
> > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc
was run -O3
> > > > without any other optimization flags. opt was run -vectorize
> > > > -unroll-allow-partial -O3 with no other optimization flags
(the patch
> > > > adds the -vectorize option).
> > > 
> > > And opt had also been given the flag:
-bb-vectorize-vector-bits=256
> > 
> > And this was a mistake (because the machine on which the benchmarks
were
> > run does not have AVX). I've rerun, see better results below...
> > 
> > > 
> > >  -Hal
> > > 
> > > > llc was just given -O3.
> > > > 
> > > > Below I've included results using the benchmark program
by Maleki, et
> > > > al. See:
> > > > An Evaluation of Vectorizing Compilers - PACT'11
> > > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The
source of
> > > > their benchmark program was retrieved from:
> > > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
> > > > 
> > > > Also, when using clang, I had to pass -Dinline= on the
command line:
> > > > when using -emit-llvm, clang appears not to emit code for
functions
> > > > declared inline. This is a bug, but I've not yet tracked
it down. There
> > > > are two such small functions in the benchmark program, and
the regular
> > > > inliner *should* catch them anyway.
> > > > 
> > > > Results:
> > > > 0. Name of the loop
> > > > 1. Time using LLVM with vectorization
> > > > 2. Time using LLVM without vectorization
> > > > 3. Time using gcc with vectorization
> > > > 4. Time using gcc without vectorization
> 
> As Peter Collingbourne indirectly pointed out to me, clang's
> optimizations are still important (even if it is emitting only llvm).
> I've rerun the llvm code generation steps, adding -O3 to clang. Here
are
> the results (they are significantly better):
> 
> Loop       llvm-v   llvm     gcc-v    gcc
> -------------------------------------------
> S000        9.10     9.59     4.55    10.04
> S111        7.29     7.65     7.68     7.83
> S1111      13.87    14.72    16.14    16.30
> S112       16.67    17.45    16.54    17.52
> S1112      13.16    13.87    14.83    14.84
> S113       22.14    22.98    22.05    22.05
> S1113      11.06    11.48    11.03    11.01
> S114       13.21    13.81    13.53    13.48
> S115       32.82    33.36    49.98    49.99
> S1115      13.67    14.23    13.65    13.66
> S116       47.37    49.43    49.54    48.11
> S118       10.81    11.25    10.79    10.50
> S119        8.73     9.09    11.83    11.82
> S1119       8.82     9.15     4.31    11.87
> S121       17.29    18.06    14.84    17.31
> S122        7.53     7.70     6.11     6.11
> S123        6.93     7.10     7.42     7.41
> S124        9.63     9.84     9.42     9.33
> S125        6.94     7.10     4.67     7.81
> S126        2.34     2.55     2.57     2.37
> S127       12.23    12.68     7.06    14.50
> S128       11.78    12.41    12.42    11.52
> S131       28.79    30.11    25.17    28.94
> S132       17.04    17.04    15.53    21.03
> S141       12.26    12.85    12.38    12.05
> S151       28.79    30.11    24.89    28.95
> S152       15.53    16.03    11.19    15.63
> S161        6.00     6.12     5.52     5.46
> S1161      14.40    14.50     8.80     8.79
> S162        8.19     8.41     5.36     8.18
> S171       15.41     7.96     2.81     5.70
> S172        5.70     5.97     2.75     5.70
> S173       30.32    31.69    18.15    30.13
> S174       30.20    31.53    18.51    30.16
> S175        5.79     6.04     4.94     5.77
> S176        5.59     5.83     4.41     7.65
> S211       16.31    16.89    16.82    16.38
> S212       13.23    13.50    13.34    13.18
> S1213      12.82    13.35    12.80    12.43
> S221       10.87    11.09     8.65     8.63
> S1221       5.72     6.03     5.40     6.05
> S222        6.01     6.29     5.70     5.72
> S231       22.38    24.22    22.36    22.11
> S232        6.89     6.94     6.89     6.89
> S1232      15.31    16.43    15.05    15.10
> S233       55.47    59.98    54.21    49.56
> S2233      27.23    29.71    29.68    28.40
> S235       44.08    47.85    46.94    43.93
> S241       31.14    31.72    32.53    31.01
> S242        7.20     7.21     7.20     7.20
> S243       16.54    16.99    17.69    16.84
> S244       14.51    14.93    16.91    16.82
> S1244      14.72    15.02    14.77    14.40
> S2244      10.09    10.60    10.40    10.06
> S251       34.42    35.55    19.70    34.38
> S1251      55.39    57.11    41.77    56.11
> S2251      15.69    16.26    17.02    15.70
> S3251      15.69    16.52    19.60    15.34
> S252        6.18     6.46     7.72     7.26
> S253       11.19    11.52    14.40    14.40
> S254       18.00    18.98    28.23    28.06
> S255        5.94     6.14     9.96     9.95
> S256        3.09     3.39     3.10     3.09
> S257        2.13     2.31     2.21     2.20
> S258        1.80     1.87     1.84     1.84
> S261       12.00    12.22    10.98    10.95
> S271       32.81    33.76    33.25    33.01
> S272       15.04    15.52    15.39    15.26
> S273       13.93    14.10    16.86    16.80
> S274       17.83    18.53    18.15    17.89
> S275        2.92     3.14     3.36     2.98
> S2275      32.81    34.95     8.97    33.60
> S276       41.26    41.97    40.80    40.55
> S277        4.80     4.93     4.81     4.80
> S278       14.43    14.76    14.70    14.66
> S279        8.05     8.24     7.25     7.27
> S1279       9.72     9.92     9.34     9.25
> S2710       7.73     8.07     7.86     7.56
> S2711      36.49    37.10    36.56    36.00
> S2712      32.96    33.96    34.24    33.47
> S281       10.80    11.32    12.46    12.02
> S1281      79.10    78.11    57.78    68.06
> S291       11.79    12.27    14.03    14.03
> S292        6.70     6.91     9.94     9.96
> S293       15.50    16.24    19.32    19.33
> S2101       2.56     2.67     2.59     2.60
> S2102      16.74    18.45    16.68    16.75
> S2111       5.59     5.63     5.85     5.85
> S311       72.04    72.27    72.23    72.03
> S31111      7.50     6.01     6.00     6.00
> S312       96.04    96.17    96.05    96.03
> S313       36.02    36.61    36.03    36.02
> S314       36.01    36.12    74.67    72.42
> S315        9.11     9.21     9.35     9.30
> S316       36.01    36.12    72.08    74.87
> S317      444.91   444.94   451.82   451.78
> S318        9.07     9.12     7.30     7.30
> S319       34.57    36.46    34.42    34.19
> S3110       8.52     8.61     4.11     4.11
> S13110      5.75     5.78    12.12    12.12
> S3111       3.60     3.64     3.60     3.60
> S3112       7.20     7.30     7.21     7.20
> S3113      33.68    34.18    60.21    60.20
> S321       16.80    16.87    16.80    16.80
> S322       12.42    12.64    12.60    12.60
> S323       10.88    11.24     8.48     8.51
> S331        4.23     4.36     7.20     7.20
> S332        7.20     7.28     5.21     5.31
> S341        4.80     5.04     7.23     7.20
> S342        6.01     6.24     7.25     7.20
> S343        2.04     2.16     2.16     2.01
> S351       46.63    48.65    21.82    46.46
> S1351      49.37    51.28    33.68    49.06
> S352       57.64    58.44    57.68    57.64
> S353        8.21     8.44     8.34     8.19
> S421       24.26    25.29    20.62    22.46
> S1421      25.18    26.16    15.85    24.76
> S422       80.08    81.51    79.22    78.99
> S423      155.02   155.21   154.56   154.38
> S424       22.62    23.35    11.42    22.36
> S431       57.22    59.82    27.59    57.16
> S441       13.27    14.23    12.88    12.81
> S442        5.99     6.13     6.96     6.90
> S443       17.37    17.77    17.15    16.95
> S451       48.92    48.99    49.03    49.14
> S452       42.97    39.57    14.64    96.03
> S453       28.06    28.07    14.60    14.40
> S471        8.27     8.56     8.39     8.43
> S481       10.93    11.23    12.04    12.00
> S482        9.21     9.42     9.19     9.17
> S491       11.31    11.60    11.37    11.28
> S4112       8.21     8.45     9.13     8.94
> S4113       8.65     8.95     8.86     8.85
> S4114      11.87    12.35    12.18    11.77
> S4115       8.28     8.51     8.95     8.59
> S4116       3.23     3.22     6.02     5.94
> S4117      13.97     9.69    10.16     9.98
> S4121       8.20     8.44     4.04     8.17
> va         28.50    29.33    23.58    48.46
> vag        12.37    12.93    13.58    13.20
> vas        13.46    14.15    13.03    12.47
> vif         4.55     4.79     5.06     4.92
> vpv        57.21    59.83    28.28    57.24
> vtv        57.92    60.42    28.40    57.63
> vpvtv      32.84    33.77    16.35    32.73
> vpvts       5.82     6.07     2.99     6.38
> vpvpv      32.87    33.84    16.54    32.85
> vtvtv      32.82    33.75    16.84    35.97
> vsumr      72.03    72.28    72.20    72.04
> vdotr      72.05    73.22    72.42    72.04
> vbor      205.24   381.18    99.80   372.05
> 
> I apologize for the multiple e-mails with a long list of numbers, but I
> think that this was worth it (and I did not want to be unfair to the
> clang developers).
> 
>  -Hal
> 
> > 
> > Here are improved results where the correct (and default)
> > vector-register size was used.
> > 
> > Loop       llvm-v   llvm     gcc-v    gcc
> > -------------------------------------------
> > S000        9.09     9.49     4.55    10.04
> > S111        7.28     7.37     7.68     7.83
> > S1111      13.78    14.48    16.14    16.30
> > S112       16.67    17.41    16.54    17.52
> > S1112      13.12    14.21    14.83    14.84
> > S113       22.12    22.88    22.05    22.05
> > S1113      11.06    11.42    11.03    11.01
> > S114       13.23    13.75    13.53    13.48
> > S115       32.76    33.24    49.98    49.99
> > S1115      13.68    14.18    13.65    13.66
> > S116       47.42    49.40    49.54    48.11
> > S118       10.84    11.26    10.79    10.50
> > S119        8.74     9.07    11.83    11.82
> > S1119       8.81     9.14     4.31    11.87
> > S121       17.28    18.78    14.84    17.31
> > S122        7.53     7.54     6.11     6.11
> > S123        6.90     7.38     7.42     7.41
> > S124        9.60     9.77     9.42     9.33
> > S125        6.92     7.22     4.67     7.81
> > S126        2.34     2.53     2.57     2.37
> > S127       12.19    12.97     7.06    14.50
> > S128       11.74    12.43    12.42    11.52
> > S131       28.75    29.91    25.17    28.94
> > S132       17.04    17.04    15.53    21.03
> > S141       12.28    12.26    12.38    12.05
> > S151       28.80    29.43    24.89    28.95
> > S152       15.54    16.03    11.19    15.63
> > S161        6.00     6.06     5.52     5.46
> > S1161      14.39    14.40     8.80     8.79
> > S162        8.19     9.05     5.36     8.18
> > S171       15.41     7.94     2.81     5.70
> > S172        5.71     5.89     2.75     5.70
> > S173       30.31    30.92    18.15    30.13
> > S174       30.18    31.66    18.51    30.16
> > S175        5.78     6.18     4.94     5.77
> > S176        5.59     5.83     4.41     7.65
> > S211       16.27    17.14    16.82    16.38
> > S212       13.21    14.28    13.34    13.18
> > S1213      12.81    13.46    12.80    12.43
> > S221       10.86    11.09     8.65     8.63
> > S1221       5.72     6.04     5.40     6.05
> > S222        6.02     6.26     5.70     5.72
> > S231       22.33    22.94    22.36    22.11
> > S232        6.88     6.88     6.89     6.89
> > S1232      15.30    15.34    15.05    15.10
> > S233       55.38    58.55    54.21    49.56
> > S2233      27.08    29.77    29.68    28.40
> > S235       44.00    44.92    46.94    43.93
> > S241       31.09    31.35    32.53    31.01
> > S242        7.19     7.20     7.20     7.20
> > S243       16.52    17.09    17.69    16.84
> > S244       14.45    14.83    16.91    16.82
> > S1244      14.71    14.83    14.77    14.40
> > S2244      10.04    10.62    10.40    10.06
> > S251       34.15    35.75    19.70    34.38
> > S1251      55.23    57.84    41.77    56.11
> > S2251      15.73    15.87    17.02    15.70
> > S3251      15.66    16.21    19.60    15.34
> > S252        6.18     6.32     7.72     7.26
> > S253       11.14    11.38    14.40    14.40
> > S254       18.41    18.70    28.23    28.06
> > S255        5.93     6.09     9.96     9.95
> > S256        3.08     3.42     3.10     3.09
> > S257        2.13     2.25     2.21     2.20
> > S258        1.79     1.82     1.84     1.84
> > S261       12.00    12.08    10.98    10.95
> > S271       32.82    33.04    33.25    33.01
> > S272       14.98    15.82    15.39    15.26
> > S273       13.92    14.04    16.86    16.80
> > S274       17.83    18.31    18.15    17.89
> > S275        2.92     3.02     3.36     2.98
> > S2275      32.80    33.50     8.97    33.60
> > S276       39.43    39.44    40.80    40.55
> > S277        4.80     4.80     4.81     4.80
> > S278       14.41    14.42    14.70    14.66
> > S279        8.03     8.29     7.25     7.27
> > S1279       9.71    10.06     9.34     9.25
> > S2710       7.71     8.04     7.86     7.56
> > S2711      35.53    35.55    36.56    36.00
> > S2712      32.94    33.17    34.24    33.47
> > S281       10.79    11.09    12.46    12.02
> > S1281      79.13    77.55    57.78    68.06
> > S291       11.80    11.78    14.03    14.03
> > S292        7.77     7.78     9.94     9.96
> > S293       15.50    15.87    19.32    19.33
> > S2101       2.56     2.58     2.59     2.60
> > S2102      16.71    17.53    16.68    16.75
> > S2111       5.60     5.60     5.85     5.85
> > S311       72.03    72.03    72.23    72.03
> > S31111      7.49     6.00     6.00     6.00
> > S312       96.04    96.04    96.05    96.03
> > S313       36.02    36.13    36.03    36.02
> > S314       36.01    36.07    74.67    72.42
> > S315        8.96     8.99     9.35     9.30
> > S316       36.02    36.06    72.08    74.87
> > S317      444.93   444.94   451.82   451.78
> > S318        9.05     9.07     7.30     7.30
> > S319       34.54    36.53    34.42    34.19
> > S3110       8.51     8.57     4.11     4.11
> > S13110      5.75     5.77    12.12    12.12
> > S3111       3.60     3.62     3.60     3.60
> > S3112       7.19     7.30     7.21     7.20
> > S3113      35.13    35.47    60.21    60.20
> > S321       16.79    16.81    16.80    16.80
> > S322       12.42    12.60    12.60    12.60
> > S323       10.86    11.02     8.48     8.51
> > S331        4.23     4.23     7.20     7.20
> > S332        7.20     7.21     5.21     5.31
> > S341        4.79     4.85     7.23     7.20
> > S342        6.01     6.09     7.25     7.20
> > S343        2.04     2.06     2.16     2.01
> > S351       46.61    47.34    21.82    46.46
> > S1351      49.28    50.35    33.68    49.06
> > S352       57.65    58.04    57.68    57.64
> > S353        8.21     8.38     8.34     8.19
> > S421       42.94    43.34    20.62    22.46
> > S1421      25.15    25.81    15.85    24.76
> > S422       87.39    87.53    79.22    78.99
> > S423      155.01   155.29   154.56   154.38
> > S424       36.51    37.51    11.42    22.36
> > S431       57.10    60.66    27.59    57.16
> > S441       14.04    13.29    12.88    12.81
> > S442        6.00     6.00     6.96     6.90
> > S443       17.28    17.77    17.15    16.95
> > S451       48.92    49.08    49.03    49.14
> > S452       42.98    39.32    14.64    96.03
> > S453       28.05    28.06    14.60    14.40
> > S471        8.24     8.65     8.39     8.43
> > S481       10.88    11.15    12.04    12.00
> > S482        9.21     9.31     9.19     9.17
> > S491       11.26    11.38    11.37    11.28
> > S4112       8.21     8.36     9.13     8.94
> > S4113       8.65     8.81     8.86     8.85
> > S4114      11.82    12.15    12.18    11.77
> > S4115       8.28     8.46     8.95     8.59
> > S4116       3.22     3.23     6.02     5.94
> > S4117      13.95     9.61    10.16     9.98
> > S4121       8.21     8.26     4.04     8.17
> > va         28.46    28.58    23.58    48.46
> > vag        12.35    12.36    13.58    13.20
> > vas        13.45    13.49    13.03    12.47
> > vif         4.55     4.57     5.06     4.92
> > vpv        57.08    57.22    28.28    57.24
> > vtv        57.81    57.83    28.40    57.63
> > vpvtv      32.82    32.84    16.35    32.73
> > vpvts       5.82     5.83     2.99     6.38
> > vpvpv      32.87    32.89    16.54    32.85
> > vtvtv      32.82    32.80    16.84    35.97
> > vsumr      72.04    72.03    72.20    72.04
> > vdotr      72.06    72.05    72.42    72.04
> > vbor      205.24   380.81    99.80   372.05
> > 
> >  -Hal
> > 
> > > > 
> > > > Loop       llvm-v   llvm     gcc-v    gcc
> > > > -------------------------------------------
> > > > S000        9.59     9.49     4.55    10.04
> > > > S111        7.67     7.37     7.68     7.83
> > > > S1111      13.98    14.48    16.14    16.30
> > > > S112       17.43    17.41    16.54    17.52
> > > > S1112      13.87    14.21    14.83    14.84
> > > > S113       22.97    22.88    22.05    22.05
> > > > S1113      11.46    11.42    11.03    11.01
> > > > S114       13.47    13.75    13.53    13.48
> > > > S115       33.06    33.24    49.98    49.99
> > > > S1115      13.91    14.18    13.65    13.66
> > > > S116       48.74    49.40    49.54    48.11
> > > > S118       11.04    11.26    10.79    10.50
> > > > S119        8.97     9.07    11.83    11.82
> > > > S1119       9.04     9.14     4.31    11.87
> > > > S121       18.06    18.78    14.84    17.31
> > > > S122        7.58     7.54     6.11     6.11
> > > > S123        7.02     7.38     7.42     7.41
> > > > S124        9.62     9.77     9.42     9.33
> > > > S125        7.14     7.22     4.67     7.81
> > > > S126        2.32     2.53     2.57     2.37
> > > > S127       12.87    12.97     7.06    14.50
> > > > S128       12.58    12.43    12.42    11.52
> > > > S131       29.91    29.91    25.17    28.94
> > > > S132       17.04    17.04    15.53    21.03
> > > > S141       12.59    12.26    12.38    12.05
> > > > S151       28.92    29.43    24.89    28.95
> > > > S152       15.68    16.03    11.19    15.63
> > > > S161        6.06     6.06     5.52     5.46
> > > > S1161      14.46    14.40     8.80     8.79
> > > > S162        8.31     9.05     5.36     8.18
> > > > S171       15.47     7.94     2.81     5.70
> > > > S172        5.92     5.89     2.75     5.70
> > > > S173       31.59    30.92    18.15    30.13
> > > > S174       31.16    31.66    18.51    30.16
> > > > S175        5.80     6.18     4.94     5.77
> > > > S176        5.69     5.83     4.41     7.65
> > > > S211       16.56    17.14    16.82    16.38
> > > > S212       13.46    14.28    13.34    13.18
> > > > S1213      13.12    13.46    12.80    12.43
> > > > S221       10.88    11.09     8.65     8.63
> > > > S1221       5.80     6.04     5.40     6.05
> > > > S222        6.01     6.26     5.70     5.72
> > > > S231       23.78    22.94    22.36    22.11
> > > > S232        6.88     6.88     6.89     6.89
> > > > S1232      16.00    15.34    15.05    15.10
> > > > S233       57.48    58.55    54.21    49.56
> > > > S2233      27.65    29.77    29.68    28.40
> > > > S235       46.40    44.92    46.94    43.93
> > > > S241       31.62    31.35    32.53    31.01
> > > > S242        7.20     7.20     7.20     7.20
> > > > S243       16.78    17.09    17.69    16.84
> > > > S244       14.64    14.83    16.91    16.82
> > > > S1244      14.98    14.83    14.77    14.40
> > > > S2244      10.47    10.62    10.40    10.06
> > > > S251       35.10    35.75    19.70    34.38
> > > > S1251      56.65    57.84    41.77    56.11
> > > > S2251      15.96    15.87    17.02    15.70
> > > > S3251      16.41    16.21    19.60    15.34
> > > > S252        7.24     6.32     7.72     7.26
> > > > S253       12.55    11.38    14.40    14.40
> > > > S254       19.08    18.70    28.23    28.06
> > > > S255        5.94     6.09     9.96     9.95
> > > > S256        3.14     3.42     3.10     3.09
> > > > S257        2.18     2.25     2.21     2.20
> > > > S258        1.80     1.82     1.84     1.84
> > > > S261       12.00    12.08    10.98    10.95
> > > > S271       32.93    33.04    33.25    33.01
> > > > S272       15.48    15.82    15.39    15.26
> > > > S273       13.99    14.04    16.86    16.80
> > > > S274       18.38    18.31    18.15    17.89
> > > > S275        3.02     3.02     3.36     2.98
> > > > S2275      33.71    33.50     8.97    33.60
> > > > S276       39.52    39.44    40.80    40.55
> > > > S277        4.81     4.80     4.81     4.80
> > > > S278       14.43    14.42    14.70    14.66
> > > > S279        8.10     8.29     7.25     7.27
> > > > S1279       9.77    10.06     9.34     9.25
> > > > S2710       7.85     8.04     7.86     7.56
> > > > S2711      35.54    35.55    36.56    36.00
> > > > S2712      33.16    33.17    34.24    33.47
> > > > S281       10.97    11.09    12.46    12.02
> > > > S1281      79.37    77.55    57.78    68.06
> > > > S291       11.94    11.78    14.03    14.03
> > > > S292        7.88     7.78     9.94     9.96
> > > > S293       15.90    15.87    19.32    19.33
> > > > S2101       2.59     2.58     2.59     2.60
> > > > S2102      17.63    17.53    16.68    16.75
> > > > S2111       5.63     5.60     5.85     5.85
> > > > S311       72.07    72.03    72.23    72.03
> > > > S31111      7.49     6.00     6.00     6.00
> > > > S312       96.06    96.04    96.05    96.03
> > > > S313       36.50    36.13    36.03    36.02
> > > > S314       36.10    36.07    74.67    72.42
> > > > S315        9.00     8.99     9.35     9.30
> > > > S316       36.11    36.06    72.08    74.87
> > > > S317      444.92   444.94   451.82   451.78
> > > > S318        9.04     9.07     7.30     7.30
> > > > S319       34.76    36.53    34.42    34.19
> > > > S3110       8.53     8.57     4.11     4.11
> > > > S13110      5.76     5.77    12.12    12.12
> > > > S3111       3.60     3.62     3.60     3.60
> > > > S3112       7.20     7.30     7.21     7.20
> > > > S3113      35.12    35.47    60.21    60.20
> > > > S321       16.81    16.81    16.80    16.80
> > > > S322       12.42    12.60    12.60    12.60
> > > > S323       10.93    11.02     8.48     8.51
> > > > S331        4.23     4.23     7.20     7.20
> > > > S332        7.21     7.21     5.21     5.31
> > > > S341        4.74     4.85     7.23     7.20
> > > > S342        6.02     6.09     7.25     7.20
> > > > S343        2.14     2.06     2.16     2.01
> > > > S351       49.26    47.34    21.82    46.46
> > > > S1351      50.85    50.35    33.68    49.06
> > > > S352       58.14    58.04    57.68    57.64
> > > > S353        8.35     8.38     8.34     8.19
> > > > S421       43.13    43.34    20.62    22.46
> > > > S1421      25.25    25.81    15.85    24.76
> > > > S422       88.36    87.53    79.22    78.99
> > > > S423      155.13   155.29   154.56   154.38
> > > > S424       37.11    37.51    11.42    22.36
> > > > S431       58.22    60.66    27.59    57.16
> > > > S441       14.05    13.29    12.88    12.81
> > > > S442        6.08     6.00     6.96     6.90
> > > > S443       17.60    17.77    17.15    16.95
> > > > S451       48.95    49.08    49.03    49.14
> > > > S452       42.98    39.32    14.64    96.03
> > > > S453       28.06    28.06    14.60    14.40
> > > > S471        8.53     8.65     8.39     8.43
> > > > S481       10.98    11.15    12.04    12.00
> > > > S482        9.31     9.31     9.19     9.17
> > > > S491       11.54    11.38    11.37    11.28
> > > > S4112       8.21     8.36     9.13     8.94
> > > > S4113       8.77     8.81     8.86     8.85
> > > > S4114      12.32    12.15    12.18    11.77
> > > > S4115       8.48     8.46     8.95     8.59
> > > > S4116       3.21     3.23     6.02     5.94
> > > > S4117      14.08     9.61    10.16     9.98
> > > > S4121       8.53     8.26     4.04     8.17
> > > > va         30.09    28.58    23.58    48.46
> > > > vag        12.35    12.36    13.58    13.20
> > > > vas        13.74    13.49    13.03    12.47
> > > > vif         4.49     4.57     5.06     4.92
> > > > vpv        58.59    57.22    28.28    57.24
> > > > vtv        59.15    57.83    28.40    57.63
> > > > vpvtv      33.18    32.84    16.35    32.73
> > > > vpvts       5.99     5.83     2.99     6.38
> > > > vpvpv      33.25    32.89    16.54    32.85
> > > > vtvtv      32.83    32.80    16.84    35.97
> > > > vsumr      72.03    72.03    72.20    72.04
> > > > vdotr      72.05    72.05    72.42    72.04
> > > > vbor      205.22   380.81    99.80   372.05
> > > > 
> > > > I've yet to go through these in detail (they just
finished running 5
> > > > minutes ago). But for the curious (and I've had several
requests for
> > > > benchmarks), here you go. There is obviously more work to
do.
> > > > 
> > > >  -Hal
> > > > 
> > > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote:
> > > > > Hi Hal,
> > > > > 
> > > > > those numbers look very promising, great work! :)
> > > > > 
> > > > > Best,
> > > > > Ralf
> > > > > 
> > > > > ----- Original Message -----
> > > > > > From: "Hal Finkel" <hfinkel at
anl.gov>
> > > > > > To: "Bruno Cardoso Lopes"
<bruno.cardoso at gmail.com>
> > > > > > Cc: llvm-commits at cs.uiuc.edu
> > > > > > Sent: Freitag, 28. Oktober 2011 13:50:00
> > > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock
Autovectorization Pass
> > > > > > 
> > > > > > Bruno, et al.,
> > > > > > 
> > > > > > I've attached a new version of the patch that
contains improvements
> > > > > > (and
> > > > > > a critical bug fix [the code output is not more
right, but the pass
> > > > > > in
> > > > > > the older patch would crash in certain cases and
now does not])
> > > > > > compared
> > > > > > to previous versions that I've posted.
> > > > > > 
> > > > > > First, these are preliminary results because I did
not do the things
> > > > > > necessary to make them real (explicitly quiet the
machine, bind the
> > > > > > processes to one cpu, etc.). But they should be
good enough for
> > > > > > discussion.
> > > > > > 
> > > > > > I'm using LLVM head r143101, with the attached
patch applied, and
> > > > > > clang
> > > > > > head r143100 on an x86_64 machine (some kind of
Intel Xeon). For the
> > > > > > gcc
> > > > > > comparison, I'm using build Ubuntu
4.4.3-4ubuntu5. gcc was run -O3
> > > > > > without any other optimization flags. opt was run
-vectorize
> > > > > > -unroll-allow-partial -O3 with no other
optimization flags (the patch
> > > > > > adds the -vectorize option). llc was just given
-O3.
> > > > > > 
> > > > > > It is not difficult to construct an example in
which vectorization
> > > > > > would
> > > > > > be useful: take a loop that does more computation
than load/stores,
> > > > > > and
> > > > > > (partially) unroll it. Here is a simple case:
> > > > > > 
> > > > > > #define ITER 5000
> > > > > > #define NUM 200
> > > > > > double a[NUM][NUM];
> > > > > > double b[NUM][NUM];
> > > > > > 
> > > > > > ...
> > > > > > 
> > > > > > int main()
> > > > > > {
> > > > > > ...
> > > > > > 
> > > > > >   for (int i = 0; i < ITER; ++i) {
> > > > > >     for (int x = 0; x < NUM; ++x)
> > > > > >     for (int y = 0; y < NUM; ++y) {
> > > > > >       double v = a[x][y], w = b[x][y];
> > > > > >       double z1 = v*w;
> > > > > >       double z2 = v+w;
> > > > > >       double z3 = z1*z2;
> > > > > >       double z4 = z3+v;
> > > > > >       double z5 = z2+w;
> > > > > >       double z6 = z4*z5;
> > > > > >       double z7 = z4+z5;
> > > > > >       a[x][y] = v*v-z6;
> > > > > >       b[x][y] = w-z7;
> > > > > >     }
> > > > > >   }
> > > > > > 
> > > > > >  ...
> > > > > > 
> > > > > >   return 0;
> > > > > > }
> > > > > > 
> > > > > > Results:
> > > > > > gcc -03: 0m1.790s
> > > > > > llvm -vectorize: 0m2.360s
> > > > > > llvm: 0m2.780s
> > > > > > gcc -fno-tree-vectorize: 0m2.810s
> > > > > > (these are the user times after I've run
enough for the times to
> > > > > > settle
> > > > > > to three decimal places)
> > > > > > 
> > > > > > So the vectorization gives a ~15% improvement in
the running time.
> > > > > > gcc's
> > > > > > vectorization still does a much better job,
however (yielding an ~36%
> > > > > > improvement). So there is still work to do ;)
> > > > > > 
> > > > > > Additionally, I've checked the
autovectorization on some classic
> > > > > > numerical benchmarks from netlib. On these
benchmarks, clang/llvm
> > > > > > already do a good job compared to gcc (gcc is only
about 10% better,
> > > > > > and
> > > > > > this is true regardless of whether gcc's
vectorization is on or off).
> > > > > > For these cases, autovectorization provides an
insignificant speedup
> > > > > > in
> > > > > > most cases (but does not tend to make things
worse, just not really
> > > > > > any
> > > > > > better either). Because gcc's vectorization
also did not really help
> > > > > > gcc
> > > > > > in these cases, I'm not surprised. A good
collection of these is
> > > > > > available here:
> > > > > >
http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz
> > > > > > 
> > > > > > I've yet to run the test suite using the pass
to validate it. That is
> > > > > > something that I plan to do. Actually, the
"Livermore Loops" test in
> > > > > > the
> > > > > > aforementioned archive contains checksums to
validate the results,
> > > > > > and
> > > > > > it looks like 1 or 2 of the loop results are wrong
with vectorization
> > > > > > turned on, so I'll have to investigate that.
> > > > > > 
> > > > > >  -Hal
> > > > > > 
> > > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso
Lopes wrote:
> > > > > > > Hi Hal,
> > > > > > > 
> > > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel
<hfinkel at anl.gov>
> > > > > > > wrote:
> > > > > > > > I've attached an initial version of
a basic-block
> > > > > > > > autovectorization
> > > > > > > > pass. It works by searching a basic
block for pairable
> > > > > > > > (independent)
> > > > > > > > instructions, and, using a chain-seeking
heuristic, selects
> > > > > > > > pairings
> > > > > > > > likely to provide an overall speedup (if
such pairings can be
> > > > > > > > found).
> > > > > > > > The selected pairs are then fused and,
if necessary, other
> > > > > > > > instructions
> > > > > > > > are moved in order to maintain data-flow
consistency. This works
> > > > > > > > only
> > > > > > > > within one basic block, but can do loop
vectorization in
> > > > > > > > combination
> > > > > > > > with (partial) unrolling. The basic idea
was inspired by the
> > > > > > > > Vienna MAP
> > > > > > > > Vectorizor, which has been used to
vectorize FFT kernels, but the
> > > > > > > > algorithm used here is different.
> > > > > > > >
> > > > > > > > To try it, use -bb-vectorize with opt.
There are a few options:
> > > > > > > > -bb-vectorize-req-chain-depth: default:
3 -- The depth of the
> > > > > > > > chain of
> > > > > > > > instruction pairs necessary in order to
consider the pairs that
> > > > > > > > compose
> > > > > > > > the chain worthy of vectorization.
> > > > > > > > -bb-vectorize-vector-bits: default: 128
-- The size of the target
> > > > > > > > vector
> > > > > > > > registers
> > > > > > > > -bb-vectorize-no-ints -- Don't
consider integer instructions
> > > > > > > > -bb-vectorize-no-floats -- Don't
consider floating-point
> > > > > > > > instructions
> > > > > > > >
> > > > > > > > The vectorizor generates a lot of
insert_element/extract_element
> > > > > > > > pairs;
> > > > > > > > The assumption is that other passes will
turn these into shuffles
> > > > > > > > when
> > > > > > > > possible (it looks like some work is
necessary here). It will
> > > > > > > > also
> > > > > > > > vectorize vector instructions, and
generates shuffles in this
> > > > > > > > case
> > > > > > > > (again, other passes should combine
these as appropriate).
> > > > > > > >
> > > > > > > > Currently, it does not fuse load or
store instructions, but that
> > > > > > > > is a
> > > > > > > > feature that I'd like to add. Of
course, alignment information is
> > > > > > > > an
> > > > > > > > issue for load/store vectorization (or
maybe I should just fuse
> > > > > > > > them
> > > > > > > > anyway and let isel deal with unaligned
cases?).
> > > > > > > >
> > > > > > > > Also, support needs to be added for
fusing known intrinsics (fma,
> > > > > > > > etc.),
> > > > > > > > and, as has been discussed on llvmdev,
we should add some
> > > > > > > > intrinsics to
> > > > > > > > allow the generation of addsub-type
instructions.
> > > > > > > >
> > > > > > > > I've included a few tests, but it
needs more. Please review (I'll
> > > > > > > > commit
> > > > > > > > if and when everyone is happy).
> > > > > > > >
> > > > > > > > Thanks in advance,
> > > > > > > > Hal
> > > > > > > >
> > > > > > > > P.S. There is another option (not so
useful right now, but could
> > > > > > > > be):
> > > > > > > > -bb-vectorize-fast-dep -- Don't do a
full inter-instruction
> > > > > > > > dependency
> > > > > > > > analysis; instead stop looking for
instruction pairs after the
> > > > > > > > first use
> > > > > > > > of an instruction's value. [This
makes the pass faster, but would
> > > > > > > > require a data-dependence-based
reordering pass in order to be
> > > > > > > > effective].
> > > > > > > 
> > > > > > > Cool! :)
> > > > > > > Have you run this pass with any benchmark or
the llvm testsuite?
> > > > > > > Does
> > > > > > > it presents any regression?
> > > > > > > Do you have any performance results?
> > > > > > > Cheers,
> > > > > > > 
> > > > > > 
> > > > > > --
> > > > > > Hal Finkel
> > > > > > Postdoctoral Appointee
> > > > > > Leadership Computing Facility
> > > > > > Argonne National Laboratory
> > > > > > 
> > > > > > _______________________________________________
> > > > > > llvm-commits mailing list
> > > > > > llvm-commits at cs.uiuc.edu
> > > > > >
http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > > > > 
> > > > 
> > > > _______________________________________________
> > > > llvm-commits mailing list
> > > > llvm-commits at cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> > > 
> > 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111031-2.diff
Type: text/x-patch
Size: 77125 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111031/d020bfe3/attachment.bin>

Hal Finkel

2011-Nov-08 10:45 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience ;) -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

(from these, I've excluded tests that took less that 0.1 seconds to
run).

Largest three compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
1276% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Not everything slows down, MultiSource/Benchmarks/Prolangs-C
++/city/city, for example, compiles 10% faster with vectorization
enabled; but, for the most part, things certainly take longer to compile
with vectorization enabled. The average slowdown over all tests was 29%,
the median was 11%. On the other hand, the average speedup over all
tests was 5.2%, the median was 1.3%.

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

 -Hal

On Tue, 2011-11-01 at 18:54 -0500, Hal Finkel wrote:> On Tue, 2011-11-01 at 16:59 -0500, Hal Finkel wrote:
> > On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote:
> > > On 11/01/2011 06:32 PM, Hal Finkel wrote:
> > > > Any objections to me committing this? [And some relevant
docs changes] I
> > > > think that it is ready at this point.
> > > 
> > > First of all. I think it is great to see work starting on an 
> > > autovectorizer for LLVM. Unfortunately I did not have time to
test your
> > > vectorizer pass intensively, but here my first comments:
> > > 
> > > 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build.
The
> > >     following patch fixes this for cmake:
> > >     0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch
> > > 
> > 
> > Thanks!
> > 
> > >     Can you check the autoconf build with --enable-shared?
> > 
> > I will check.
> 
> This appears to work as it should.
> 
> > 
> > > 
> > > 2. Did you run this pass on the llvm test-suite? Does your
vectorizer
> > >     introduce any correctness regressions? What are the top 10
compile
> > >     time increases/decreases. How about run time?
> > > 
> > 
> > I'll try to get this setup and post the results.
> > 
> > > 3. I did not really test this intensively, but I had the feeling
the
> > >     compile time increase for large basic blocks is quite a lot.
> > >     I still need to extract a test case. Any comments on the
complexity
> > >     of your vectorizer?
> > 
> > This may very will be true. As is, I would not recommend activating
this
> > pass by default (at -O3) because it is fairly slow and the resulting
> > performance increase, while significant in many cases, is not large
> > enough to, IMHO, justify the extra base compile-time increase.
Ideally,
> > this kind of vectorization should be the "vectorizer of last
resort" --
> > the pass that tries really hard to squeeze the last little bit of
> > vectorization possible out of the code. At the moment, it is all that
we
> > have, but I hope that will change. I've not yet done any real
profiling,
> > so I'll hold off on commenting about future performance
improvements.
> > 
> > Base complexity is a bit difficult, there are certainly a few stages,
> > including that initial one, that are O(n^2), where n is the number of
> > instructions in the block. The "connection-finding" stage
should also be
> > O(n^2) in practice, but is really iterating over instruction-user
pairs
> > and so could be worse in pathological cases. Note, however, that in
the
> > latter stages, that n^2 is not the number of instructions in the
block,
> > but rather the number of (unordered) candidate instruction pairs
(which
> > is going to be must less than the n^2 from just the number of
> > instructions in the block). It should be possible to generate a
> > compile-time scaling plot by taking a loop and compiling it with
partial
> > unrolling, looking at how the compile time changes with the unrolling
> > limit; I'll try and so that.
> 
> So for this test, I ran:
> time opt -S -O3 -unroll-allow-partial -vectorize -o /dev/null q.ll
> where q.ll contains the output from clang -O3 of the vbor function from
> the benchmarks I've been posting recently. The first column is the
value
> of -unroll-threshold, the second column is the time with vectorization,
> and the third column is the time without vectorization (time in seconds
> for a release build).
> 
> 100    0.030  0.000
> 200    0.130  0.030
> 300    0.770  0.030
> 400    1.240  0.040
> 500    1.280  0.050
> 600    9.450  0.060
> 700   29.300  0.060
> 
> I am not sure why the 400 and 500 times are so close. Obviously, it is
> not linear ;) I am not sure that enumerating the possible pairings can
> be done in a sub-quadratic way, but I will do some profiling and see if
> I can make things better. To be fair, this test creates a kind of a
> worse-case scenario: an increasingly large block of instructions, almost
> all of which are potentially fusable.
> 
> It may also be possible to design additional heuristics to help the
> situation. For example, we might introduce a target chain length such
> that if the vectorizer finds a chain of a given length, it selects it,
> foregoing the remainder of the search for the selected starting
> instruction. This kind of thing will require further research and
> testing.
> 
>  -Hal
> 
> > 
> > I'm writing a paper on the vectorizer, so within a few weeks there
will
> > be a very good description (complete with diagrams) :)
> > 
> > > 
> > > I plan to look into your vectorizer during the next couple of 
> > > days/weeks, but will most probably not have the time to do this
tonight.
> > > Sorry. :-(
> > 
> > Not a problem; it seems that I have some homework to do first ;)
> > 
> > Thanks,
> > Hal
> > 
> > > 
> > > Cheers
> > > Tobi
> > 
> 
-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: llvm_bb_vectorize-20111107.diff
Type: text/x-patch
Size: 79455 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20111108/3b39956c/attachment.bin>

Tobias Grosser

2011-Nov-08 11:12 UTC

head link

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

On 11/08/2011 11:45 AM, Hal Finkel wrote:> I've attached the latest version of my autovectorization patch.
>
> Working through the test suite has proved to be a productive
> experience ;) -- And almost all of the bugs that it revealed have now
> been fixed. There are still two programs that don't compile with
> vectorization turned on, and I'm working on those now, but in case
> anyone feels like playing with vectorization, this patch will probably
> work for you.
Hey Hal,

those are great news. Especially as the numbers seem to show that 
vectorization has a significant performance impact. What did you compare 
exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
> The largest three performance speedups are:
> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
>
> The largest three performance slowdowns are:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 114% slowdown
> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> slowdown
> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
>Interesting. Do you understand what causes these slowdowns? Can your 
heuristic be improved?
> Largest three compile-time slowdowns:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 1276% slowdown
> SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
> MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown
Yes, that is a lot. Do you understand if this time is invested well 
(does it give significant speedups)?

If I understood correctly it seems your vectorizer has quadratic 
complexity which may cause large slowdowns. Do you think it may be 
useful/possible to make it linear by introducing a constant upper bound 
somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and 
most of the vectorization opportunities are close by (in some sense), 
such that we get most of the speedup by locking at a subset of the problem.
> Not everything slows down, MultiSource/Benchmarks/Prolangs-C
> ++/city/city, for example, compiles 10% faster with vectorization
> enabled; but, for the most part, things certainly take longer to compile
> with vectorization enabled. The average slowdown over all tests was 29%,
> the median was 11%. On the other hand, the average speedup over all
> tests was 5.2%, the median was 1.3%.Nice. I think this is a great start.
> Compared to previous patches, which had a minimum required chain length
> of 3 or 4, I've now made the default 6. While using a chain length of 4
> worked well for targeted benchmarks, it caused an overall slowdown on
> almost all test-suite programs. Using a minimum length of 6 causes, on
> average, a speedup; so I think that is a better default choice.
I also try to understand if it is possible to use your vectorizer for 
Polly. My idea is to do some clever loop unrolling.

Starting from this loop.

for (int i = 0; i < 4; i++)
    A[i] += 1;
    A[i] = B[i] + 3;
    C[i] = A[i];

The classical unroller would create this code:

    A[0] += 1;
    A[0] = B[i] + 3;
    C[0] = A[i];

    A[1] += 1;
    A[1] = B[i] + 3;
    C[1] = A[i];

    A[2] += 1;
    A[2] = B[i] + 3;
    C[2] = A[i];

    A[3] += 1;
    A[3] = B[i] + 3;
    C[3] = A[i];

However, in case I can prove this loop is parallel, I want to create 
this code:

    A[0] += 1;
    A[1] += 1;
    A[2] += 1;
    A[3] += 1;

    A[0] = B[i] + 3;
    A[1] = B[i] + 3;
    A[2] = B[i] + 3;
    A[3] = B[i] + 3;

    C[0] = A[i];
    C[1] = A[i];
    C[2] = A[i];
    C[3] = A[i];

I assume this will allow the vectorization of test cases, that failed 
because of possible aliasing. However, I am more interested, if the
execution order change could also improve the vectorization outcome or 
reduce compile time overhead of your vectorizer.

Thanks for working on the vectorization
Cheers

Tobi

Possibly Parallel Threads

Search for more apparently analagous threads

llvm dev - Nov 2011 - [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Possibly Parallel Threads