Hal Finkel
2011-Oct-29 20:16 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote: > > Ralf, et al., > > > > Attached is the latest version of my autovectorization patch. llvmdev > > has been CC'd (as had been suggested to me); this e-mail contains > > additional benchmark results. > > > > First, these are preliminary results because I did not do the things > > necessary to make them real (explicitly quiet the machine, bind the > > processes to one cpu, etc.). But they should be good enough for > > discussion. > > > > I'm using LLVM head r143101, with the attached patch applied, and clang > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > without any other optimization flags. opt was run -vectorize > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > adds the -vectorize option). > > And opt had also been given the flag: -bb-vectorize-vector-bits=256And this was a mistake (because the machine on which the benchmarks were run does not have AVX). I've rerun, see better results below...> > -Hal > > > llc was just given -O3. > > > > Below I've included results using the benchmark program by Maleki, et > > al. See: > > An Evaluation of Vectorizing Compilers - PACT'11 > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of > > their benchmark program was retrieved from: > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz > > > > Also, when using clang, I had to pass -Dinline= on the command line: > > when using -emit-llvm, clang appears not to emit code for functions > > declared inline. This is a bug, but I've not yet tracked it down. There > > are two such small functions in the benchmark program, and the regular > > inliner *should* catch them anyway. > > > > Results: > > 0. Name of the loop > > 1. Time using LLVM with vectorization > > 2. Time using LLVM without vectorization > > 3. Time using gcc with vectorization > > 4. Time using gcc without vectorizationHere are improved results where the correct (and default) vector-register size was used. Loop llvm-v llvm gcc-v gcc ------------------------------------------- S000 9.09 9.49 4.55 10.04 S111 7.28 7.37 7.68 7.83 S1111 13.78 14.48 16.14 16.30 S112 16.67 17.41 16.54 17.52 S1112 13.12 14.21 14.83 14.84 S113 22.12 22.88 22.05 22.05 S1113 11.06 11.42 11.03 11.01 S114 13.23 13.75 13.53 13.48 S115 32.76 33.24 49.98 49.99 S1115 13.68 14.18 13.65 13.66 S116 47.42 49.40 49.54 48.11 S118 10.84 11.26 10.79 10.50 S119 8.74 9.07 11.83 11.82 S1119 8.81 9.14 4.31 11.87 S121 17.28 18.78 14.84 17.31 S122 7.53 7.54 6.11 6.11 S123 6.90 7.38 7.42 7.41 S124 9.60 9.77 9.42 9.33 S125 6.92 7.22 4.67 7.81 S126 2.34 2.53 2.57 2.37 S127 12.19 12.97 7.06 14.50 S128 11.74 12.43 12.42 11.52 S131 28.75 29.91 25.17 28.94 S132 17.04 17.04 15.53 21.03 S141 12.28 12.26 12.38 12.05 S151 28.80 29.43 24.89 28.95 S152 15.54 16.03 11.19 15.63 S161 6.00 6.06 5.52 5.46 S1161 14.39 14.40 8.80 8.79 S162 8.19 9.05 5.36 8.18 S171 15.41 7.94 2.81 5.70 S172 5.71 5.89 2.75 5.70 S173 30.31 30.92 18.15 30.13 S174 30.18 31.66 18.51 30.16 S175 5.78 6.18 4.94 5.77 S176 5.59 5.83 4.41 7.65 S211 16.27 17.14 16.82 16.38 S212 13.21 14.28 13.34 13.18 S1213 12.81 13.46 12.80 12.43 S221 10.86 11.09 8.65 8.63 S1221 5.72 6.04 5.40 6.05 S222 6.02 6.26 5.70 5.72 S231 22.33 22.94 22.36 22.11 S232 6.88 6.88 6.89 6.89 S1232 15.30 15.34 15.05 15.10 S233 55.38 58.55 54.21 49.56 S2233 27.08 29.77 29.68 28.40 S235 44.00 44.92 46.94 43.93 S241 31.09 31.35 32.53 31.01 S242 7.19 7.20 7.20 7.20 S243 16.52 17.09 17.69 16.84 S244 14.45 14.83 16.91 16.82 S1244 14.71 14.83 14.77 14.40 S2244 10.04 10.62 10.40 10.06 S251 34.15 35.75 19.70 34.38 S1251 55.23 57.84 41.77 56.11 S2251 15.73 15.87 17.02 15.70 S3251 15.66 16.21 19.60 15.34 S252 6.18 6.32 7.72 7.26 S253 11.14 11.38 14.40 14.40 S254 18.41 18.70 28.23 28.06 S255 5.93 6.09 9.96 9.95 S256 3.08 3.42 3.10 3.09 S257 2.13 2.25 2.21 2.20 S258 1.79 1.82 1.84 1.84 S261 12.00 12.08 10.98 10.95 S271 32.82 33.04 33.25 33.01 S272 14.98 15.82 15.39 15.26 S273 13.92 14.04 16.86 16.80 S274 17.83 18.31 18.15 17.89 S275 2.92 3.02 3.36 2.98 S2275 32.80 33.50 8.97 33.60 S276 39.43 39.44 40.80 40.55 S277 4.80 4.80 4.81 4.80 S278 14.41 14.42 14.70 14.66 S279 8.03 8.29 7.25 7.27 S1279 9.71 10.06 9.34 9.25 S2710 7.71 8.04 7.86 7.56 S2711 35.53 35.55 36.56 36.00 S2712 32.94 33.17 34.24 33.47 S281 10.79 11.09 12.46 12.02 S1281 79.13 77.55 57.78 68.06 S291 11.80 11.78 14.03 14.03 S292 7.77 7.78 9.94 9.96 S293 15.50 15.87 19.32 19.33 S2101 2.56 2.58 2.59 2.60 S2102 16.71 17.53 16.68 16.75 S2111 5.60 5.60 5.85 5.85 S311 72.03 72.03 72.23 72.03 S31111 7.49 6.00 6.00 6.00 S312 96.04 96.04 96.05 96.03 S313 36.02 36.13 36.03 36.02 S314 36.01 36.07 74.67 72.42 S315 8.96 8.99 9.35 9.30 S316 36.02 36.06 72.08 74.87 S317 444.93 444.94 451.82 451.78 S318 9.05 9.07 7.30 7.30 S319 34.54 36.53 34.42 34.19 S3110 8.51 8.57 4.11 4.11 S13110 5.75 5.77 12.12 12.12 S3111 3.60 3.62 3.60 3.60 S3112 7.19 7.30 7.21 7.20 S3113 35.13 35.47 60.21 60.20 S321 16.79 16.81 16.80 16.80 S322 12.42 12.60 12.60 12.60 S323 10.86 11.02 8.48 8.51 S331 4.23 4.23 7.20 7.20 S332 7.20 7.21 5.21 5.31 S341 4.79 4.85 7.23 7.20 S342 6.01 6.09 7.25 7.20 S343 2.04 2.06 2.16 2.01 S351 46.61 47.34 21.82 46.46 S1351 49.28 50.35 33.68 49.06 S352 57.65 58.04 57.68 57.64 S353 8.21 8.38 8.34 8.19 S421 42.94 43.34 20.62 22.46 S1421 25.15 25.81 15.85 24.76 S422 87.39 87.53 79.22 78.99 S423 155.01 155.29 154.56 154.38 S424 36.51 37.51 11.42 22.36 S431 57.10 60.66 27.59 57.16 S441 14.04 13.29 12.88 12.81 S442 6.00 6.00 6.96 6.90 S443 17.28 17.77 17.15 16.95 S451 48.92 49.08 49.03 49.14 S452 42.98 39.32 14.64 96.03 S453 28.05 28.06 14.60 14.40 S471 8.24 8.65 8.39 8.43 S481 10.88 11.15 12.04 12.00 S482 9.21 9.31 9.19 9.17 S491 11.26 11.38 11.37 11.28 S4112 8.21 8.36 9.13 8.94 S4113 8.65 8.81 8.86 8.85 S4114 11.82 12.15 12.18 11.77 S4115 8.28 8.46 8.95 8.59 S4116 3.22 3.23 6.02 5.94 S4117 13.95 9.61 10.16 9.98 S4121 8.21 8.26 4.04 8.17 va 28.46 28.58 23.58 48.46 vag 12.35 12.36 13.58 13.20 vas 13.45 13.49 13.03 12.47 vif 4.55 4.57 5.06 4.92 vpv 57.08 57.22 28.28 57.24 vtv 57.81 57.83 28.40 57.63 vpvtv 32.82 32.84 16.35 32.73 vpvts 5.82 5.83 2.99 6.38 vpvpv 32.87 32.89 16.54 32.85 vtvtv 32.82 32.80 16.84 35.97 vsumr 72.04 72.03 72.20 72.04 vdotr 72.06 72.05 72.42 72.04 vbor 205.24 380.81 99.80 372.05 -Hal> > > > Loop llvm-v llvm gcc-v gcc > > ------------------------------------------- > > S000 9.59 9.49 4.55 10.04 > > S111 7.67 7.37 7.68 7.83 > > S1111 13.98 14.48 16.14 16.30 > > S112 17.43 17.41 16.54 17.52 > > S1112 13.87 14.21 14.83 14.84 > > S113 22.97 22.88 22.05 22.05 > > S1113 11.46 11.42 11.03 11.01 > > S114 13.47 13.75 13.53 13.48 > > S115 33.06 33.24 49.98 49.99 > > S1115 13.91 14.18 13.65 13.66 > > S116 48.74 49.40 49.54 48.11 > > S118 11.04 11.26 10.79 10.50 > > S119 8.97 9.07 11.83 11.82 > > S1119 9.04 9.14 4.31 11.87 > > S121 18.06 18.78 14.84 17.31 > > S122 7.58 7.54 6.11 6.11 > > S123 7.02 7.38 7.42 7.41 > > S124 9.62 9.77 9.42 9.33 > > S125 7.14 7.22 4.67 7.81 > > S126 2.32 2.53 2.57 2.37 > > S127 12.87 12.97 7.06 14.50 > > S128 12.58 12.43 12.42 11.52 > > S131 29.91 29.91 25.17 28.94 > > S132 17.04 17.04 15.53 21.03 > > S141 12.59 12.26 12.38 12.05 > > S151 28.92 29.43 24.89 28.95 > > S152 15.68 16.03 11.19 15.63 > > S161 6.06 6.06 5.52 5.46 > > S1161 14.46 14.40 8.80 8.79 > > S162 8.31 9.05 5.36 8.18 > > S171 15.47 7.94 2.81 5.70 > > S172 5.92 5.89 2.75 5.70 > > S173 31.59 30.92 18.15 30.13 > > S174 31.16 31.66 18.51 30.16 > > S175 5.80 6.18 4.94 5.77 > > S176 5.69 5.83 4.41 7.65 > > S211 16.56 17.14 16.82 16.38 > > S212 13.46 14.28 13.34 13.18 > > S1213 13.12 13.46 12.80 12.43 > > S221 10.88 11.09 8.65 8.63 > > S1221 5.80 6.04 5.40 6.05 > > S222 6.01 6.26 5.70 5.72 > > S231 23.78 22.94 22.36 22.11 > > S232 6.88 6.88 6.89 6.89 > > S1232 16.00 15.34 15.05 15.10 > > S233 57.48 58.55 54.21 49.56 > > S2233 27.65 29.77 29.68 28.40 > > S235 46.40 44.92 46.94 43.93 > > S241 31.62 31.35 32.53 31.01 > > S242 7.20 7.20 7.20 7.20 > > S243 16.78 17.09 17.69 16.84 > > S244 14.64 14.83 16.91 16.82 > > S1244 14.98 14.83 14.77 14.40 > > S2244 10.47 10.62 10.40 10.06 > > S251 35.10 35.75 19.70 34.38 > > S1251 56.65 57.84 41.77 56.11 > > S2251 15.96 15.87 17.02 15.70 > > S3251 16.41 16.21 19.60 15.34 > > S252 7.24 6.32 7.72 7.26 > > S253 12.55 11.38 14.40 14.40 > > S254 19.08 18.70 28.23 28.06 > > S255 5.94 6.09 9.96 9.95 > > S256 3.14 3.42 3.10 3.09 > > S257 2.18 2.25 2.21 2.20 > > S258 1.80 1.82 1.84 1.84 > > S261 12.00 12.08 10.98 10.95 > > S271 32.93 33.04 33.25 33.01 > > S272 15.48 15.82 15.39 15.26 > > S273 13.99 14.04 16.86 16.80 > > S274 18.38 18.31 18.15 17.89 > > S275 3.02 3.02 3.36 2.98 > > S2275 33.71 33.50 8.97 33.60 > > S276 39.52 39.44 40.80 40.55 > > S277 4.81 4.80 4.81 4.80 > > S278 14.43 14.42 14.70 14.66 > > S279 8.10 8.29 7.25 7.27 > > S1279 9.77 10.06 9.34 9.25 > > S2710 7.85 8.04 7.86 7.56 > > S2711 35.54 35.55 36.56 36.00 > > S2712 33.16 33.17 34.24 33.47 > > S281 10.97 11.09 12.46 12.02 > > S1281 79.37 77.55 57.78 68.06 > > S291 11.94 11.78 14.03 14.03 > > S292 7.88 7.78 9.94 9.96 > > S293 15.90 15.87 19.32 19.33 > > S2101 2.59 2.58 2.59 2.60 > > S2102 17.63 17.53 16.68 16.75 > > S2111 5.63 5.60 5.85 5.85 > > S311 72.07 72.03 72.23 72.03 > > S31111 7.49 6.00 6.00 6.00 > > S312 96.06 96.04 96.05 96.03 > > S313 36.50 36.13 36.03 36.02 > > S314 36.10 36.07 74.67 72.42 > > S315 9.00 8.99 9.35 9.30 > > S316 36.11 36.06 72.08 74.87 > > S317 444.92 444.94 451.82 451.78 > > S318 9.04 9.07 7.30 7.30 > > S319 34.76 36.53 34.42 34.19 > > S3110 8.53 8.57 4.11 4.11 > > S13110 5.76 5.77 12.12 12.12 > > S3111 3.60 3.62 3.60 3.60 > > S3112 7.20 7.30 7.21 7.20 > > S3113 35.12 35.47 60.21 60.20 > > S321 16.81 16.81 16.80 16.80 > > S322 12.42 12.60 12.60 12.60 > > S323 10.93 11.02 8.48 8.51 > > S331 4.23 4.23 7.20 7.20 > > S332 7.21 7.21 5.21 5.31 > > S341 4.74 4.85 7.23 7.20 > > S342 6.02 6.09 7.25 7.20 > > S343 2.14 2.06 2.16 2.01 > > S351 49.26 47.34 21.82 46.46 > > S1351 50.85 50.35 33.68 49.06 > > S352 58.14 58.04 57.68 57.64 > > S353 8.35 8.38 8.34 8.19 > > S421 43.13 43.34 20.62 22.46 > > S1421 25.25 25.81 15.85 24.76 > > S422 88.36 87.53 79.22 78.99 > > S423 155.13 155.29 154.56 154.38 > > S424 37.11 37.51 11.42 22.36 > > S431 58.22 60.66 27.59 57.16 > > S441 14.05 13.29 12.88 12.81 > > S442 6.08 6.00 6.96 6.90 > > S443 17.60 17.77 17.15 16.95 > > S451 48.95 49.08 49.03 49.14 > > S452 42.98 39.32 14.64 96.03 > > S453 28.06 28.06 14.60 14.40 > > S471 8.53 8.65 8.39 8.43 > > S481 10.98 11.15 12.04 12.00 > > S482 9.31 9.31 9.19 9.17 > > S491 11.54 11.38 11.37 11.28 > > S4112 8.21 8.36 9.13 8.94 > > S4113 8.77 8.81 8.86 8.85 > > S4114 12.32 12.15 12.18 11.77 > > S4115 8.48 8.46 8.95 8.59 > > S4116 3.21 3.23 6.02 5.94 > > S4117 14.08 9.61 10.16 9.98 > > S4121 8.53 8.26 4.04 8.17 > > va 30.09 28.58 23.58 48.46 > > vag 12.35 12.36 13.58 13.20 > > vas 13.74 13.49 13.03 12.47 > > vif 4.49 4.57 5.06 4.92 > > vpv 58.59 57.22 28.28 57.24 > > vtv 59.15 57.83 28.40 57.63 > > vpvtv 33.18 32.84 16.35 32.73 > > vpvts 5.99 5.83 2.99 6.38 > > vpvpv 33.25 32.89 16.54 32.85 > > vtvtv 32.83 32.80 16.84 35.97 > > vsumr 72.03 72.03 72.20 72.04 > > vdotr 72.05 72.05 72.42 72.04 > > vbor 205.22 380.81 99.80 372.05 > > > > I've yet to go through these in detail (they just finished running 5 > > minutes ago). But for the curious (and I've had several requests for > > benchmarks), here you go. There is obviously more work to do. > > > > -Hal > > > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote: > > > Hi Hal, > > > > > > those numbers look very promising, great work! :) > > > > > > Best, > > > Ralf > > > > > > ----- Original Message ----- > > > > From: "Hal Finkel" <hfinkel at anl.gov> > > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com> > > > > Cc: llvm-commits at cs.uiuc.edu > > > > Sent: Freitag, 28. Oktober 2011 13:50:00 > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass > > > > > > > > Bruno, et al., > > > > > > > > I've attached a new version of the patch that contains improvements > > > > (and > > > > a critical bug fix [the code output is not more right, but the pass > > > > in > > > > the older patch would crash in certain cases and now does not]) > > > > compared > > > > to previous versions that I've posted. > > > > > > > > First, these are preliminary results because I did not do the things > > > > necessary to make them real (explicitly quiet the machine, bind the > > > > processes to one cpu, etc.). But they should be good enough for > > > > discussion. > > > > > > > > I'm using LLVM head r143101, with the attached patch applied, and > > > > clang > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the > > > > gcc > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > > > without any other optimization flags. opt was run -vectorize > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > > > adds the -vectorize option). llc was just given -O3. > > > > > > > > It is not difficult to construct an example in which vectorization > > > > would > > > > be useful: take a loop that does more computation than load/stores, > > > > and > > > > (partially) unroll it. Here is a simple case: > > > > > > > > #define ITER 5000 > > > > #define NUM 200 > > > > double a[NUM][NUM]; > > > > double b[NUM][NUM]; > > > > > > > > ... > > > > > > > > int main() > > > > { > > > > ... > > > > > > > > for (int i = 0; i < ITER; ++i) { > > > > for (int x = 0; x < NUM; ++x) > > > > for (int y = 0; y < NUM; ++y) { > > > > double v = a[x][y], w = b[x][y]; > > > > double z1 = v*w; > > > > double z2 = v+w; > > > > double z3 = z1*z2; > > > > double z4 = z3+v; > > > > double z5 = z2+w; > > > > double z6 = z4*z5; > > > > double z7 = z4+z5; > > > > a[x][y] = v*v-z6; > > > > b[x][y] = w-z7; > > > > } > > > > } > > > > > > > > ... > > > > > > > > return 0; > > > > } > > > > > > > > Results: > > > > gcc -03: 0m1.790s > > > > llvm -vectorize: 0m2.360s > > > > llvm: 0m2.780s > > > > gcc -fno-tree-vectorize: 0m2.810s > > > > (these are the user times after I've run enough for the times to > > > > settle > > > > to three decimal places) > > > > > > > > So the vectorization gives a ~15% improvement in the running time. > > > > gcc's > > > > vectorization still does a much better job, however (yielding an ~36% > > > > improvement). So there is still work to do ;) > > > > > > > > Additionally, I've checked the autovectorization on some classic > > > > numerical benchmarks from netlib. On these benchmarks, clang/llvm > > > > already do a good job compared to gcc (gcc is only about 10% better, > > > > and > > > > this is true regardless of whether gcc's vectorization is on or off). > > > > For these cases, autovectorization provides an insignificant speedup > > > > in > > > > most cases (but does not tend to make things worse, just not really > > > > any > > > > better either). Because gcc's vectorization also did not really help > > > > gcc > > > > in these cases, I'm not surprised. A good collection of these is > > > > available here: > > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz > > > > > > > > I've yet to run the test suite using the pass to validate it. That is > > > > something that I plan to do. Actually, the "Livermore Loops" test in > > > > the > > > > aforementioned archive contains checksums to validate the results, > > > > and > > > > it looks like 1 or 2 of the loop results are wrong with vectorization > > > > turned on, so I'll have to investigate that. > > > > > > > > -Hal > > > > > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote: > > > > > Hi Hal, > > > > > > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov> > > > > > wrote: > > > > > > I've attached an initial version of a basic-block > > > > > > autovectorization > > > > > > pass. It works by searching a basic block for pairable > > > > > > (independent) > > > > > > instructions, and, using a chain-seeking heuristic, selects > > > > > > pairings > > > > > > likely to provide an overall speedup (if such pairings can be > > > > > > found). > > > > > > The selected pairs are then fused and, if necessary, other > > > > > > instructions > > > > > > are moved in order to maintain data-flow consistency. This works > > > > > > only > > > > > > within one basic block, but can do loop vectorization in > > > > > > combination > > > > > > with (partial) unrolling. The basic idea was inspired by the > > > > > > Vienna MAP > > > > > > Vectorizor, which has been used to vectorize FFT kernels, but the > > > > > > algorithm used here is different. > > > > > > > > > > > > To try it, use -bb-vectorize with opt. There are a few options: > > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the > > > > > > chain of > > > > > > instruction pairs necessary in order to consider the pairs that > > > > > > compose > > > > > > the chain worthy of vectorization. > > > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target > > > > > > vector > > > > > > registers > > > > > > -bb-vectorize-no-ints -- Don't consider integer instructions > > > > > > -bb-vectorize-no-floats -- Don't consider floating-point > > > > > > instructions > > > > > > > > > > > > The vectorizor generates a lot of insert_element/extract_element > > > > > > pairs; > > > > > > The assumption is that other passes will turn these into shuffles > > > > > > when > > > > > > possible (it looks like some work is necessary here). It will > > > > > > also > > > > > > vectorize vector instructions, and generates shuffles in this > > > > > > case > > > > > > (again, other passes should combine these as appropriate). > > > > > > > > > > > > Currently, it does not fuse load or store instructions, but that > > > > > > is a > > > > > > feature that I'd like to add. Of course, alignment information is > > > > > > an > > > > > > issue for load/store vectorization (or maybe I should just fuse > > > > > > them > > > > > > anyway and let isel deal with unaligned cases?). > > > > > > > > > > > > Also, support needs to be added for fusing known intrinsics (fma, > > > > > > etc.), > > > > > > and, as has been discussed on llvmdev, we should add some > > > > > > intrinsics to > > > > > > allow the generation of addsub-type instructions. > > > > > > > > > > > > I've included a few tests, but it needs more. Please review (I'll > > > > > > commit > > > > > > if and when everyone is happy). > > > > > > > > > > > > Thanks in advance, > > > > > > Hal > > > > > > > > > > > > P.S. There is another option (not so useful right now, but could > > > > > > be): > > > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction > > > > > > dependency > > > > > > analysis; instead stop looking for instruction pairs after the > > > > > > first use > > > > > > of an instruction's value. [This makes the pass faster, but would > > > > > > require a data-dependence-based reordering pass in order to be > > > > > > effective]. > > > > > > > > > > Cool! :) > > > > > Have you run this pass with any benchmark or the llvm testsuite? > > > > > Does > > > > > it presents any regression? > > > > > Do you have any performance results? > > > > > Cheers, > > > > > > > > > > > > > -- > > > > Hal Finkel > > > > Postdoctoral Appointee > > > > Leadership Computing Facility > > > > Argonne National Laboratory > > > > > > > > _______________________________________________ > > > > llvm-commits mailing list > > > > llvm-commits at cs.uiuc.edu > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > > > > > > > _______________________________________________ > > llvm-commits mailing list > > llvm-commits at cs.uiuc.edu > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits >-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2011-Oct-29 22:56 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On Sat, 2011-10-29 at 15:16 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote: > > On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote: > > > Ralf, et al., > > > > > > Attached is the latest version of my autovectorization patch. llvmdev > > > has been CC'd (as had been suggested to me); this e-mail contains > > > additional benchmark results. > > > > > > First, these are preliminary results because I did not do the things > > > necessary to make them real (explicitly quiet the machine, bind the > > > processes to one cpu, etc.). But they should be good enough for > > > discussion. > > > > > > I'm using LLVM head r143101, with the attached patch applied, and clang > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > > without any other optimization flags. opt was run -vectorize > > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > > adds the -vectorize option). > > > > And opt had also been given the flag: -bb-vectorize-vector-bits=256 > > And this was a mistake (because the machine on which the benchmarks were > run does not have AVX). I've rerun, see better results below... > > > > > -Hal > > > > > llc was just given -O3. > > > > > > Below I've included results using the benchmark program by Maleki, et > > > al. See: > > > An Evaluation of Vectorizing Compilers - PACT'11 > > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of > > > their benchmark program was retrieved from: > > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz > > > > > > Also, when using clang, I had to pass -Dinline= on the command line: > > > when using -emit-llvm, clang appears not to emit code for functions > > > declared inline. This is a bug, but I've not yet tracked it down. There > > > are two such small functions in the benchmark program, and the regular > > > inliner *should* catch them anyway. > > > > > > Results: > > > 0. Name of the loop > > > 1. Time using LLVM with vectorization > > > 2. Time using LLVM without vectorization > > > 3. Time using gcc with vectorization > > > 4. Time using gcc without vectorizationAs Peter Collingbourne indirectly pointed out to me, clang's optimizations are still important (even if it is emitting only llvm). I've rerun the llvm code generation steps, adding -O3 to clang. Here are the results (they are significantly better): Loop llvm-v llvm gcc-v gcc ------------------------------------------- S000 9.10 9.59 4.55 10.04 S111 7.29 7.65 7.68 7.83 S1111 13.87 14.72 16.14 16.30 S112 16.67 17.45 16.54 17.52 S1112 13.16 13.87 14.83 14.84 S113 22.14 22.98 22.05 22.05 S1113 11.06 11.48 11.03 11.01 S114 13.21 13.81 13.53 13.48 S115 32.82 33.36 49.98 49.99 S1115 13.67 14.23 13.65 13.66 S116 47.37 49.43 49.54 48.11 S118 10.81 11.25 10.79 10.50 S119 8.73 9.09 11.83 11.82 S1119 8.82 9.15 4.31 11.87 S121 17.29 18.06 14.84 17.31 S122 7.53 7.70 6.11 6.11 S123 6.93 7.10 7.42 7.41 S124 9.63 9.84 9.42 9.33 S125 6.94 7.10 4.67 7.81 S126 2.34 2.55 2.57 2.37 S127 12.23 12.68 7.06 14.50 S128 11.78 12.41 12.42 11.52 S131 28.79 30.11 25.17 28.94 S132 17.04 17.04 15.53 21.03 S141 12.26 12.85 12.38 12.05 S151 28.79 30.11 24.89 28.95 S152 15.53 16.03 11.19 15.63 S161 6.00 6.12 5.52 5.46 S1161 14.40 14.50 8.80 8.79 S162 8.19 8.41 5.36 8.18 S171 15.41 7.96 2.81 5.70 S172 5.70 5.97 2.75 5.70 S173 30.32 31.69 18.15 30.13 S174 30.20 31.53 18.51 30.16 S175 5.79 6.04 4.94 5.77 S176 5.59 5.83 4.41 7.65 S211 16.31 16.89 16.82 16.38 S212 13.23 13.50 13.34 13.18 S1213 12.82 13.35 12.80 12.43 S221 10.87 11.09 8.65 8.63 S1221 5.72 6.03 5.40 6.05 S222 6.01 6.29 5.70 5.72 S231 22.38 24.22 22.36 22.11 S232 6.89 6.94 6.89 6.89 S1232 15.31 16.43 15.05 15.10 S233 55.47 59.98 54.21 49.56 S2233 27.23 29.71 29.68 28.40 S235 44.08 47.85 46.94 43.93 S241 31.14 31.72 32.53 31.01 S242 7.20 7.21 7.20 7.20 S243 16.54 16.99 17.69 16.84 S244 14.51 14.93 16.91 16.82 S1244 14.72 15.02 14.77 14.40 S2244 10.09 10.60 10.40 10.06 S251 34.42 35.55 19.70 34.38 S1251 55.39 57.11 41.77 56.11 S2251 15.69 16.26 17.02 15.70 S3251 15.69 16.52 19.60 15.34 S252 6.18 6.46 7.72 7.26 S253 11.19 11.52 14.40 14.40 S254 18.00 18.98 28.23 28.06 S255 5.94 6.14 9.96 9.95 S256 3.09 3.39 3.10 3.09 S257 2.13 2.31 2.21 2.20 S258 1.80 1.87 1.84 1.84 S261 12.00 12.22 10.98 10.95 S271 32.81 33.76 33.25 33.01 S272 15.04 15.52 15.39 15.26 S273 13.93 14.10 16.86 16.80 S274 17.83 18.53 18.15 17.89 S275 2.92 3.14 3.36 2.98 S2275 32.81 34.95 8.97 33.60 S276 41.26 41.97 40.80 40.55 S277 4.80 4.93 4.81 4.80 S278 14.43 14.76 14.70 14.66 S279 8.05 8.24 7.25 7.27 S1279 9.72 9.92 9.34 9.25 S2710 7.73 8.07 7.86 7.56 S2711 36.49 37.10 36.56 36.00 S2712 32.96 33.96 34.24 33.47 S281 10.80 11.32 12.46 12.02 S1281 79.10 78.11 57.78 68.06 S291 11.79 12.27 14.03 14.03 S292 6.70 6.91 9.94 9.96 S293 15.50 16.24 19.32 19.33 S2101 2.56 2.67 2.59 2.60 S2102 16.74 18.45 16.68 16.75 S2111 5.59 5.63 5.85 5.85 S311 72.04 72.27 72.23 72.03 S31111 7.50 6.01 6.00 6.00 S312 96.04 96.17 96.05 96.03 S313 36.02 36.61 36.03 36.02 S314 36.01 36.12 74.67 72.42 S315 9.11 9.21 9.35 9.30 S316 36.01 36.12 72.08 74.87 S317 444.91 444.94 451.82 451.78 S318 9.07 9.12 7.30 7.30 S319 34.57 36.46 34.42 34.19 S3110 8.52 8.61 4.11 4.11 S13110 5.75 5.78 12.12 12.12 S3111 3.60 3.64 3.60 3.60 S3112 7.20 7.30 7.21 7.20 S3113 33.68 34.18 60.21 60.20 S321 16.80 16.87 16.80 16.80 S322 12.42 12.64 12.60 12.60 S323 10.88 11.24 8.48 8.51 S331 4.23 4.36 7.20 7.20 S332 7.20 7.28 5.21 5.31 S341 4.80 5.04 7.23 7.20 S342 6.01 6.24 7.25 7.20 S343 2.04 2.16 2.16 2.01 S351 46.63 48.65 21.82 46.46 S1351 49.37 51.28 33.68 49.06 S352 57.64 58.44 57.68 57.64 S353 8.21 8.44 8.34 8.19 S421 24.26 25.29 20.62 22.46 S1421 25.18 26.16 15.85 24.76 S422 80.08 81.51 79.22 78.99 S423 155.02 155.21 154.56 154.38 S424 22.62 23.35 11.42 22.36 S431 57.22 59.82 27.59 57.16 S441 13.27 14.23 12.88 12.81 S442 5.99 6.13 6.96 6.90 S443 17.37 17.77 17.15 16.95 S451 48.92 48.99 49.03 49.14 S452 42.97 39.57 14.64 96.03 S453 28.06 28.07 14.60 14.40 S471 8.27 8.56 8.39 8.43 S481 10.93 11.23 12.04 12.00 S482 9.21 9.42 9.19 9.17 S491 11.31 11.60 11.37 11.28 S4112 8.21 8.45 9.13 8.94 S4113 8.65 8.95 8.86 8.85 S4114 11.87 12.35 12.18 11.77 S4115 8.28 8.51 8.95 8.59 S4116 3.23 3.22 6.02 5.94 S4117 13.97 9.69 10.16 9.98 S4121 8.20 8.44 4.04 8.17 va 28.50 29.33 23.58 48.46 vag 12.37 12.93 13.58 13.20 vas 13.46 14.15 13.03 12.47 vif 4.55 4.79 5.06 4.92 vpv 57.21 59.83 28.28 57.24 vtv 57.92 60.42 28.40 57.63 vpvtv 32.84 33.77 16.35 32.73 vpvts 5.82 6.07 2.99 6.38 vpvpv 32.87 33.84 16.54 32.85 vtvtv 32.82 33.75 16.84 35.97 vsumr 72.03 72.28 72.20 72.04 vdotr 72.05 73.22 72.42 72.04 vbor 205.24 381.18 99.80 372.05 I apologize for the multiple e-mails with a long list of numbers, but I think that this was worth it (and I did not want to be unfair to the clang developers). -Hal> > Here are improved results where the correct (and default) > vector-register size was used. > > Loop llvm-v llvm gcc-v gcc > ------------------------------------------- > S000 9.09 9.49 4.55 10.04 > S111 7.28 7.37 7.68 7.83 > S1111 13.78 14.48 16.14 16.30 > S112 16.67 17.41 16.54 17.52 > S1112 13.12 14.21 14.83 14.84 > S113 22.12 22.88 22.05 22.05 > S1113 11.06 11.42 11.03 11.01 > S114 13.23 13.75 13.53 13.48 > S115 32.76 33.24 49.98 49.99 > S1115 13.68 14.18 13.65 13.66 > S116 47.42 49.40 49.54 48.11 > S118 10.84 11.26 10.79 10.50 > S119 8.74 9.07 11.83 11.82 > S1119 8.81 9.14 4.31 11.87 > S121 17.28 18.78 14.84 17.31 > S122 7.53 7.54 6.11 6.11 > S123 6.90 7.38 7.42 7.41 > S124 9.60 9.77 9.42 9.33 > S125 6.92 7.22 4.67 7.81 > S126 2.34 2.53 2.57 2.37 > S127 12.19 12.97 7.06 14.50 > S128 11.74 12.43 12.42 11.52 > S131 28.75 29.91 25.17 28.94 > S132 17.04 17.04 15.53 21.03 > S141 12.28 12.26 12.38 12.05 > S151 28.80 29.43 24.89 28.95 > S152 15.54 16.03 11.19 15.63 > S161 6.00 6.06 5.52 5.46 > S1161 14.39 14.40 8.80 8.79 > S162 8.19 9.05 5.36 8.18 > S171 15.41 7.94 2.81 5.70 > S172 5.71 5.89 2.75 5.70 > S173 30.31 30.92 18.15 30.13 > S174 30.18 31.66 18.51 30.16 > S175 5.78 6.18 4.94 5.77 > S176 5.59 5.83 4.41 7.65 > S211 16.27 17.14 16.82 16.38 > S212 13.21 14.28 13.34 13.18 > S1213 12.81 13.46 12.80 12.43 > S221 10.86 11.09 8.65 8.63 > S1221 5.72 6.04 5.40 6.05 > S222 6.02 6.26 5.70 5.72 > S231 22.33 22.94 22.36 22.11 > S232 6.88 6.88 6.89 6.89 > S1232 15.30 15.34 15.05 15.10 > S233 55.38 58.55 54.21 49.56 > S2233 27.08 29.77 29.68 28.40 > S235 44.00 44.92 46.94 43.93 > S241 31.09 31.35 32.53 31.01 > S242 7.19 7.20 7.20 7.20 > S243 16.52 17.09 17.69 16.84 > S244 14.45 14.83 16.91 16.82 > S1244 14.71 14.83 14.77 14.40 > S2244 10.04 10.62 10.40 10.06 > S251 34.15 35.75 19.70 34.38 > S1251 55.23 57.84 41.77 56.11 > S2251 15.73 15.87 17.02 15.70 > S3251 15.66 16.21 19.60 15.34 > S252 6.18 6.32 7.72 7.26 > S253 11.14 11.38 14.40 14.40 > S254 18.41 18.70 28.23 28.06 > S255 5.93 6.09 9.96 9.95 > S256 3.08 3.42 3.10 3.09 > S257 2.13 2.25 2.21 2.20 > S258 1.79 1.82 1.84 1.84 > S261 12.00 12.08 10.98 10.95 > S271 32.82 33.04 33.25 33.01 > S272 14.98 15.82 15.39 15.26 > S273 13.92 14.04 16.86 16.80 > S274 17.83 18.31 18.15 17.89 > S275 2.92 3.02 3.36 2.98 > S2275 32.80 33.50 8.97 33.60 > S276 39.43 39.44 40.80 40.55 > S277 4.80 4.80 4.81 4.80 > S278 14.41 14.42 14.70 14.66 > S279 8.03 8.29 7.25 7.27 > S1279 9.71 10.06 9.34 9.25 > S2710 7.71 8.04 7.86 7.56 > S2711 35.53 35.55 36.56 36.00 > S2712 32.94 33.17 34.24 33.47 > S281 10.79 11.09 12.46 12.02 > S1281 79.13 77.55 57.78 68.06 > S291 11.80 11.78 14.03 14.03 > S292 7.77 7.78 9.94 9.96 > S293 15.50 15.87 19.32 19.33 > S2101 2.56 2.58 2.59 2.60 > S2102 16.71 17.53 16.68 16.75 > S2111 5.60 5.60 5.85 5.85 > S311 72.03 72.03 72.23 72.03 > S31111 7.49 6.00 6.00 6.00 > S312 96.04 96.04 96.05 96.03 > S313 36.02 36.13 36.03 36.02 > S314 36.01 36.07 74.67 72.42 > S315 8.96 8.99 9.35 9.30 > S316 36.02 36.06 72.08 74.87 > S317 444.93 444.94 451.82 451.78 > S318 9.05 9.07 7.30 7.30 > S319 34.54 36.53 34.42 34.19 > S3110 8.51 8.57 4.11 4.11 > S13110 5.75 5.77 12.12 12.12 > S3111 3.60 3.62 3.60 3.60 > S3112 7.19 7.30 7.21 7.20 > S3113 35.13 35.47 60.21 60.20 > S321 16.79 16.81 16.80 16.80 > S322 12.42 12.60 12.60 12.60 > S323 10.86 11.02 8.48 8.51 > S331 4.23 4.23 7.20 7.20 > S332 7.20 7.21 5.21 5.31 > S341 4.79 4.85 7.23 7.20 > S342 6.01 6.09 7.25 7.20 > S343 2.04 2.06 2.16 2.01 > S351 46.61 47.34 21.82 46.46 > S1351 49.28 50.35 33.68 49.06 > S352 57.65 58.04 57.68 57.64 > S353 8.21 8.38 8.34 8.19 > S421 42.94 43.34 20.62 22.46 > S1421 25.15 25.81 15.85 24.76 > S422 87.39 87.53 79.22 78.99 > S423 155.01 155.29 154.56 154.38 > S424 36.51 37.51 11.42 22.36 > S431 57.10 60.66 27.59 57.16 > S441 14.04 13.29 12.88 12.81 > S442 6.00 6.00 6.96 6.90 > S443 17.28 17.77 17.15 16.95 > S451 48.92 49.08 49.03 49.14 > S452 42.98 39.32 14.64 96.03 > S453 28.05 28.06 14.60 14.40 > S471 8.24 8.65 8.39 8.43 > S481 10.88 11.15 12.04 12.00 > S482 9.21 9.31 9.19 9.17 > S491 11.26 11.38 11.37 11.28 > S4112 8.21 8.36 9.13 8.94 > S4113 8.65 8.81 8.86 8.85 > S4114 11.82 12.15 12.18 11.77 > S4115 8.28 8.46 8.95 8.59 > S4116 3.22 3.23 6.02 5.94 > S4117 13.95 9.61 10.16 9.98 > S4121 8.21 8.26 4.04 8.17 > va 28.46 28.58 23.58 48.46 > vag 12.35 12.36 13.58 13.20 > vas 13.45 13.49 13.03 12.47 > vif 4.55 4.57 5.06 4.92 > vpv 57.08 57.22 28.28 57.24 > vtv 57.81 57.83 28.40 57.63 > vpvtv 32.82 32.84 16.35 32.73 > vpvts 5.82 5.83 2.99 6.38 > vpvpv 32.87 32.89 16.54 32.85 > vtvtv 32.82 32.80 16.84 35.97 > vsumr 72.04 72.03 72.20 72.04 > vdotr 72.06 72.05 72.42 72.04 > vbor 205.24 380.81 99.80 372.05 > > -Hal > > > > > > > Loop llvm-v llvm gcc-v gcc > > > ------------------------------------------- > > > S000 9.59 9.49 4.55 10.04 > > > S111 7.67 7.37 7.68 7.83 > > > S1111 13.98 14.48 16.14 16.30 > > > S112 17.43 17.41 16.54 17.52 > > > S1112 13.87 14.21 14.83 14.84 > > > S113 22.97 22.88 22.05 22.05 > > > S1113 11.46 11.42 11.03 11.01 > > > S114 13.47 13.75 13.53 13.48 > > > S115 33.06 33.24 49.98 49.99 > > > S1115 13.91 14.18 13.65 13.66 > > > S116 48.74 49.40 49.54 48.11 > > > S118 11.04 11.26 10.79 10.50 > > > S119 8.97 9.07 11.83 11.82 > > > S1119 9.04 9.14 4.31 11.87 > > > S121 18.06 18.78 14.84 17.31 > > > S122 7.58 7.54 6.11 6.11 > > > S123 7.02 7.38 7.42 7.41 > > > S124 9.62 9.77 9.42 9.33 > > > S125 7.14 7.22 4.67 7.81 > > > S126 2.32 2.53 2.57 2.37 > > > S127 12.87 12.97 7.06 14.50 > > > S128 12.58 12.43 12.42 11.52 > > > S131 29.91 29.91 25.17 28.94 > > > S132 17.04 17.04 15.53 21.03 > > > S141 12.59 12.26 12.38 12.05 > > > S151 28.92 29.43 24.89 28.95 > > > S152 15.68 16.03 11.19 15.63 > > > S161 6.06 6.06 5.52 5.46 > > > S1161 14.46 14.40 8.80 8.79 > > > S162 8.31 9.05 5.36 8.18 > > > S171 15.47 7.94 2.81 5.70 > > > S172 5.92 5.89 2.75 5.70 > > > S173 31.59 30.92 18.15 30.13 > > > S174 31.16 31.66 18.51 30.16 > > > S175 5.80 6.18 4.94 5.77 > > > S176 5.69 5.83 4.41 7.65 > > > S211 16.56 17.14 16.82 16.38 > > > S212 13.46 14.28 13.34 13.18 > > > S1213 13.12 13.46 12.80 12.43 > > > S221 10.88 11.09 8.65 8.63 > > > S1221 5.80 6.04 5.40 6.05 > > > S222 6.01 6.26 5.70 5.72 > > > S231 23.78 22.94 22.36 22.11 > > > S232 6.88 6.88 6.89 6.89 > > > S1232 16.00 15.34 15.05 15.10 > > > S233 57.48 58.55 54.21 49.56 > > > S2233 27.65 29.77 29.68 28.40 > > > S235 46.40 44.92 46.94 43.93 > > > S241 31.62 31.35 32.53 31.01 > > > S242 7.20 7.20 7.20 7.20 > > > S243 16.78 17.09 17.69 16.84 > > > S244 14.64 14.83 16.91 16.82 > > > S1244 14.98 14.83 14.77 14.40 > > > S2244 10.47 10.62 10.40 10.06 > > > S251 35.10 35.75 19.70 34.38 > > > S1251 56.65 57.84 41.77 56.11 > > > S2251 15.96 15.87 17.02 15.70 > > > S3251 16.41 16.21 19.60 15.34 > > > S252 7.24 6.32 7.72 7.26 > > > S253 12.55 11.38 14.40 14.40 > > > S254 19.08 18.70 28.23 28.06 > > > S255 5.94 6.09 9.96 9.95 > > > S256 3.14 3.42 3.10 3.09 > > > S257 2.18 2.25 2.21 2.20 > > > S258 1.80 1.82 1.84 1.84 > > > S261 12.00 12.08 10.98 10.95 > > > S271 32.93 33.04 33.25 33.01 > > > S272 15.48 15.82 15.39 15.26 > > > S273 13.99 14.04 16.86 16.80 > > > S274 18.38 18.31 18.15 17.89 > > > S275 3.02 3.02 3.36 2.98 > > > S2275 33.71 33.50 8.97 33.60 > > > S276 39.52 39.44 40.80 40.55 > > > S277 4.81 4.80 4.81 4.80 > > > S278 14.43 14.42 14.70 14.66 > > > S279 8.10 8.29 7.25 7.27 > > > S1279 9.77 10.06 9.34 9.25 > > > S2710 7.85 8.04 7.86 7.56 > > > S2711 35.54 35.55 36.56 36.00 > > > S2712 33.16 33.17 34.24 33.47 > > > S281 10.97 11.09 12.46 12.02 > > > S1281 79.37 77.55 57.78 68.06 > > > S291 11.94 11.78 14.03 14.03 > > > S292 7.88 7.78 9.94 9.96 > > > S293 15.90 15.87 19.32 19.33 > > > S2101 2.59 2.58 2.59 2.60 > > > S2102 17.63 17.53 16.68 16.75 > > > S2111 5.63 5.60 5.85 5.85 > > > S311 72.07 72.03 72.23 72.03 > > > S31111 7.49 6.00 6.00 6.00 > > > S312 96.06 96.04 96.05 96.03 > > > S313 36.50 36.13 36.03 36.02 > > > S314 36.10 36.07 74.67 72.42 > > > S315 9.00 8.99 9.35 9.30 > > > S316 36.11 36.06 72.08 74.87 > > > S317 444.92 444.94 451.82 451.78 > > > S318 9.04 9.07 7.30 7.30 > > > S319 34.76 36.53 34.42 34.19 > > > S3110 8.53 8.57 4.11 4.11 > > > S13110 5.76 5.77 12.12 12.12 > > > S3111 3.60 3.62 3.60 3.60 > > > S3112 7.20 7.30 7.21 7.20 > > > S3113 35.12 35.47 60.21 60.20 > > > S321 16.81 16.81 16.80 16.80 > > > S322 12.42 12.60 12.60 12.60 > > > S323 10.93 11.02 8.48 8.51 > > > S331 4.23 4.23 7.20 7.20 > > > S332 7.21 7.21 5.21 5.31 > > > S341 4.74 4.85 7.23 7.20 > > > S342 6.02 6.09 7.25 7.20 > > > S343 2.14 2.06 2.16 2.01 > > > S351 49.26 47.34 21.82 46.46 > > > S1351 50.85 50.35 33.68 49.06 > > > S352 58.14 58.04 57.68 57.64 > > > S353 8.35 8.38 8.34 8.19 > > > S421 43.13 43.34 20.62 22.46 > > > S1421 25.25 25.81 15.85 24.76 > > > S422 88.36 87.53 79.22 78.99 > > > S423 155.13 155.29 154.56 154.38 > > > S424 37.11 37.51 11.42 22.36 > > > S431 58.22 60.66 27.59 57.16 > > > S441 14.05 13.29 12.88 12.81 > > > S442 6.08 6.00 6.96 6.90 > > > S443 17.60 17.77 17.15 16.95 > > > S451 48.95 49.08 49.03 49.14 > > > S452 42.98 39.32 14.64 96.03 > > > S453 28.06 28.06 14.60 14.40 > > > S471 8.53 8.65 8.39 8.43 > > > S481 10.98 11.15 12.04 12.00 > > > S482 9.31 9.31 9.19 9.17 > > > S491 11.54 11.38 11.37 11.28 > > > S4112 8.21 8.36 9.13 8.94 > > > S4113 8.77 8.81 8.86 8.85 > > > S4114 12.32 12.15 12.18 11.77 > > > S4115 8.48 8.46 8.95 8.59 > > > S4116 3.21 3.23 6.02 5.94 > > > S4117 14.08 9.61 10.16 9.98 > > > S4121 8.53 8.26 4.04 8.17 > > > va 30.09 28.58 23.58 48.46 > > > vag 12.35 12.36 13.58 13.20 > > > vas 13.74 13.49 13.03 12.47 > > > vif 4.49 4.57 5.06 4.92 > > > vpv 58.59 57.22 28.28 57.24 > > > vtv 59.15 57.83 28.40 57.63 > > > vpvtv 33.18 32.84 16.35 32.73 > > > vpvts 5.99 5.83 2.99 6.38 > > > vpvpv 33.25 32.89 16.54 32.85 > > > vtvtv 32.83 32.80 16.84 35.97 > > > vsumr 72.03 72.03 72.20 72.04 > > > vdotr 72.05 72.05 72.42 72.04 > > > vbor 205.22 380.81 99.80 372.05 > > > > > > I've yet to go through these in detail (they just finished running 5 > > > minutes ago). But for the curious (and I've had several requests for > > > benchmarks), here you go. There is obviously more work to do. > > > > > > -Hal > > > > > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote: > > > > Hi Hal, > > > > > > > > those numbers look very promising, great work! :) > > > > > > > > Best, > > > > Ralf > > > > > > > > ----- Original Message ----- > > > > > From: "Hal Finkel" <hfinkel at anl.gov> > > > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com> > > > > > Cc: llvm-commits at cs.uiuc.edu > > > > > Sent: Freitag, 28. Oktober 2011 13:50:00 > > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass > > > > > > > > > > Bruno, et al., > > > > > > > > > > I've attached a new version of the patch that contains improvements > > > > > (and > > > > > a critical bug fix [the code output is not more right, but the pass > > > > > in > > > > > the older patch would crash in certain cases and now does not]) > > > > > compared > > > > > to previous versions that I've posted. > > > > > > > > > > First, these are preliminary results because I did not do the things > > > > > necessary to make them real (explicitly quiet the machine, bind the > > > > > processes to one cpu, etc.). But they should be good enough for > > > > > discussion. > > > > > > > > > > I'm using LLVM head r143101, with the attached patch applied, and > > > > > clang > > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the > > > > > gcc > > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > > > > without any other optimization flags. opt was run -vectorize > > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > > > > adds the -vectorize option). llc was just given -O3. > > > > > > > > > > It is not difficult to construct an example in which vectorization > > > > > would > > > > > be useful: take a loop that does more computation than load/stores, > > > > > and > > > > > (partially) unroll it. Here is a simple case: > > > > > > > > > > #define ITER 5000 > > > > > #define NUM 200 > > > > > double a[NUM][NUM]; > > > > > double b[NUM][NUM]; > > > > > > > > > > ... > > > > > > > > > > int main() > > > > > { > > > > > ... > > > > > > > > > > for (int i = 0; i < ITER; ++i) { > > > > > for (int x = 0; x < NUM; ++x) > > > > > for (int y = 0; y < NUM; ++y) { > > > > > double v = a[x][y], w = b[x][y]; > > > > > double z1 = v*w; > > > > > double z2 = v+w; > > > > > double z3 = z1*z2; > > > > > double z4 = z3+v; > > > > > double z5 = z2+w; > > > > > double z6 = z4*z5; > > > > > double z7 = z4+z5; > > > > > a[x][y] = v*v-z6; > > > > > b[x][y] = w-z7; > > > > > } > > > > > } > > > > > > > > > > ... > > > > > > > > > > return 0; > > > > > } > > > > > > > > > > Results: > > > > > gcc -03: 0m1.790s > > > > > llvm -vectorize: 0m2.360s > > > > > llvm: 0m2.780s > > > > > gcc -fno-tree-vectorize: 0m2.810s > > > > > (these are the user times after I've run enough for the times to > > > > > settle > > > > > to three decimal places) > > > > > > > > > > So the vectorization gives a ~15% improvement in the running time. > > > > > gcc's > > > > > vectorization still does a much better job, however (yielding an ~36% > > > > > improvement). So there is still work to do ;) > > > > > > > > > > Additionally, I've checked the autovectorization on some classic > > > > > numerical benchmarks from netlib. On these benchmarks, clang/llvm > > > > > already do a good job compared to gcc (gcc is only about 10% better, > > > > > and > > > > > this is true regardless of whether gcc's vectorization is on or off). > > > > > For these cases, autovectorization provides an insignificant speedup > > > > > in > > > > > most cases (but does not tend to make things worse, just not really > > > > > any > > > > > better either). Because gcc's vectorization also did not really help > > > > > gcc > > > > > in these cases, I'm not surprised. A good collection of these is > > > > > available here: > > > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz > > > > > > > > > > I've yet to run the test suite using the pass to validate it. That is > > > > > something that I plan to do. Actually, the "Livermore Loops" test in > > > > > the > > > > > aforementioned archive contains checksums to validate the results, > > > > > and > > > > > it looks like 1 or 2 of the loop results are wrong with vectorization > > > > > turned on, so I'll have to investigate that. > > > > > > > > > > -Hal > > > > > > > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote: > > > > > > Hi Hal, > > > > > > > > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov> > > > > > > wrote: > > > > > > > I've attached an initial version of a basic-block > > > > > > > autovectorization > > > > > > > pass. It works by searching a basic block for pairable > > > > > > > (independent) > > > > > > > instructions, and, using a chain-seeking heuristic, selects > > > > > > > pairings > > > > > > > likely to provide an overall speedup (if such pairings can be > > > > > > > found). > > > > > > > The selected pairs are then fused and, if necessary, other > > > > > > > instructions > > > > > > > are moved in order to maintain data-flow consistency. This works > > > > > > > only > > > > > > > within one basic block, but can do loop vectorization in > > > > > > > combination > > > > > > > with (partial) unrolling. The basic idea was inspired by the > > > > > > > Vienna MAP > > > > > > > Vectorizor, which has been used to vectorize FFT kernels, but the > > > > > > > algorithm used here is different. > > > > > > > > > > > > > > To try it, use -bb-vectorize with opt. There are a few options: > > > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the > > > > > > > chain of > > > > > > > instruction pairs necessary in order to consider the pairs that > > > > > > > compose > > > > > > > the chain worthy of vectorization. > > > > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target > > > > > > > vector > > > > > > > registers > > > > > > > -bb-vectorize-no-ints -- Don't consider integer instructions > > > > > > > -bb-vectorize-no-floats -- Don't consider floating-point > > > > > > > instructions > > > > > > > > > > > > > > The vectorizor generates a lot of insert_element/extract_element > > > > > > > pairs; > > > > > > > The assumption is that other passes will turn these into shuffles > > > > > > > when > > > > > > > possible (it looks like some work is necessary here). It will > > > > > > > also > > > > > > > vectorize vector instructions, and generates shuffles in this > > > > > > > case > > > > > > > (again, other passes should combine these as appropriate). > > > > > > > > > > > > > > Currently, it does not fuse load or store instructions, but that > > > > > > > is a > > > > > > > feature that I'd like to add. Of course, alignment information is > > > > > > > an > > > > > > > issue for load/store vectorization (or maybe I should just fuse > > > > > > > them > > > > > > > anyway and let isel deal with unaligned cases?). > > > > > > > > > > > > > > Also, support needs to be added for fusing known intrinsics (fma, > > > > > > > etc.), > > > > > > > and, as has been discussed on llvmdev, we should add some > > > > > > > intrinsics to > > > > > > > allow the generation of addsub-type instructions. > > > > > > > > > > > > > > I've included a few tests, but it needs more. Please review (I'll > > > > > > > commit > > > > > > > if and when everyone is happy). > > > > > > > > > > > > > > Thanks in advance, > > > > > > > Hal > > > > > > > > > > > > > > P.S. There is another option (not so useful right now, but could > > > > > > > be): > > > > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction > > > > > > > dependency > > > > > > > analysis; instead stop looking for instruction pairs after the > > > > > > > first use > > > > > > > of an instruction's value. [This makes the pass faster, but would > > > > > > > require a data-dependence-based reordering pass in order to be > > > > > > > effective]. > > > > > > > > > > > > Cool! :) > > > > > > Have you run this pass with any benchmark or the llvm testsuite? > > > > > > Does > > > > > > it presents any regression? > > > > > > Do you have any performance results? > > > > > > Cheers, > > > > > > > > > > > > > > > > -- > > > > > Hal Finkel > > > > > Postdoctoral Appointee > > > > > Leadership Computing Facility > > > > > Argonne National Laboratory > > > > > > > > > > _______________________________________________ > > > > > llvm-commits mailing list > > > > > llvm-commits at cs.uiuc.edu > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > > > > > > > > > > _______________________________________________ > > > llvm-commits mailing list > > > llvm-commits at cs.uiuc.edu > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > >-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory
Hal Finkel
2011-Nov-01 00:50 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
I've attached the latest version of my autovectorization patch. This version is significantly faster (in compile time) than the version I posted a couple of days ago, and generally produces better output. At this point, next steps in enhancing the vectorization include: 1. Add an add/sub and/or alternating-negation vector intrinsic to provide for generating add-subtract, and more generally, asymmetric fma instructions. 2. Make -vectorize imply -unroll-allow-partial [Is there an easy way to do this?] 3. Add a -fvectorize flag to clang along the same lines. Updated vectorization benchmark: Loop llvm-v llvm gcc-v gcc ------------------------------------------- S000 9.00 9.59 4.55 10.04 S111 7.25 7.65 7.68 7.83 S1111 13.63 14.72 16.14 16.30 S112 16.60 17.45 16.54 17.52 S1112 12.99 13.87 14.83 14.84 S113 22.03 22.98 22.05 22.05 S1113 11.01 11.48 11.03 11.01 S114 13.14 13.81 13.53 13.48 S115 32.92 33.36 49.98 49.99 S1115 13.61 14.23 13.65 13.66 S116 46.90 49.43 49.54 48.11 S118 10.76 11.25 10.79 10.50 S119 8.68 9.09 11.83 11.82 S1119 8.75 9.15 4.31 11.87 S121 17.17 18.06 14.84 17.31 S122 7.53 7.70 6.11 6.11 S123 6.92 7.10 7.42 7.41 S124 9.60 9.84 9.42 9.33 S125 6.89 7.10 4.67 7.81 S126 2.33 2.55 2.57 2.37 S127 12.18 12.68 7.06 14.50 S128 11.66 12.41 12.42 11.52 S131 28.59 30.11 25.17 28.94 S132 17.04 17.04 15.53 21.03 S141 12.18 12.85 12.38 12.05 S151 28.61 30.11 24.89 28.95 S152 15.47 16.03 11.19 15.63 S161 6.00 6.12 5.52 5.46 S1161 14.40 14.50 8.80 8.79 S162 8.18 8.41 5.36 8.18 S171 14.05 7.96 2.81 5.70 S172 5.67 5.97 2.75 5.70 S173 30.17 31.69 18.15 30.13 S174 30.12 31.53 18.51 30.16 S175 5.75 6.04 4.94 5.77 S176 5.57 5.83 4.41 7.65 S211 16.23 16.89 16.82 16.38 S212 13.19 13.50 13.34 13.18 S1213 12.83 13.35 12.80 12.43 S221 10.86 11.09 8.65 8.63 S1221 5.71 6.03 5.40 6.05 S222 6.00 6.29 5.70 5.72 S231 22.23 24.22 22.36 22.11 S232 6.89 6.94 6.89 6.89 S1232 15.23 16.43 15.05 15.10 S233 55.17 59.98 54.21 49.56 S2233 27.07 29.71 29.68 28.40 S235 43.79 47.85 46.94 43.93 S241 31.00 31.72 32.53 31.01 S242 7.20 7.21 7.20 7.20 S243 16.48 16.99 17.69 16.84 S244 14.47 14.93 16.91 16.82 S1244 14.75 15.02 14.77 14.40 S2244 9.97 10.60 10.40 10.06 S251 34.20 35.55 19.70 34.38 S1251 55.09 57.11 41.77 56.11 S2251 15.64 16.26 17.02 15.70 S3251 15.55 16.52 19.60 15.34 S252 6.14 6.46 7.72 7.26 S253 11.18 11.52 14.40 14.40 S254 17.72 18.98 28.23 28.06 S255 5.93 6.14 9.96 9.95 S256 3.06 3.39 3.10 3.09 S257 2.12 2.31 2.21 2.20 S258 1.79 1.87 1.84 1.84 S261 12.01 12.22 10.98 10.95 S271 32.76 33.76 33.25 33.01 S272 14.93 15.52 15.39 15.26 S273 13.92 14.10 16.86 16.80 S274 17.77 18.53 18.15 17.89 S275 2.90 3.14 3.36 2.98 S2275 32.65 34.95 8.97 33.60 S276 41.38 41.97 40.80 40.55 S277 4.81 4.93 4.81 4.80 S278 14.41 14.76 14.70 14.66 S279 8.04 8.24 7.25 7.27 S1279 9.71 9.92 9.34 9.25 S2710 7.68 8.07 7.86 7.56 S2711 35.53 37.10 36.56 36.00 S2712 32.91 33.96 34.24 33.47 S281 10.75 11.32 12.46 12.02 S1281 104.13 78.11 57.78 68.06 S291 11.75 12.27 14.03 14.03 S292 6.70 6.91 9.94 9.96 S293 15.38 16.24 19.32 19.33 S2101 2.50 2.67 2.59 2.60 S2102 16.56 18.45 16.68 16.75 S2111 5.59 5.63 5.85 5.85 S311 72.04 72.27 72.23 72.03 S31111 6.37 6.01 6.00 6.00 S312 96.04 96.17 96.05 96.03 S313 36.03 36.61 36.03 36.02 S314 36.02 36.12 74.67 72.42 S315 9.11 9.21 9.35 9.30 S316 36.02 36.12 72.08 74.87 S317 444.96 444.94 451.82 451.78 S318 9.07 9.12 7.30 7.30 S319 34.49 36.46 34.42 34.19 S3110 8.53 8.61 4.11 4.11 S13110 5.75 5.78 12.12 12.12 S3111 3.60 3.64 3.60 3.60 S3112 7.21 7.30 7.21 7.20 S3113 33.68 34.18 60.21 60.20 S321 16.80 16.87 16.80 16.80 S322 12.42 12.64 12.60 12.60 S323 10.89 11.24 8.48 8.51 S331 4.23 4.36 7.20 7.20 S332 7.21 7.28 5.21 5.31 S341 4.76 5.04 7.23 7.20 S342 6.02 6.24 7.25 7.20 S343 2.02 2.16 2.16 2.01 S351 46.33 48.65 21.82 46.46 S1351 49.07 51.28 33.68 49.06 S352 57.65 58.44 57.68 57.64 S353 8.19 8.44 8.34 8.19 S421 24.17 25.29 20.62 22.46 S1421 25.09 26.16 15.85 24.76 S422 79.95 81.51 79.22 78.99 S423 154.93 155.21 154.56 154.38 S424 22.61 23.35 11.42 22.36 S431 56.88 59.82 27.59 57.16 S441 14.05 14.23 12.88 12.81 S442 5.99 6.13 6.96 6.90 S443 17.33 17.77 17.15 16.95 S451 48.94 48.99 49.03 49.14 S452 43.01 39.57 14.64 96.03 S453 28.07 28.07 14.60 14.40 S471 8.20 8.56 8.39 8.43 S481 10.89 11.23 12.04 12.00 S482 9.20 9.42 9.19 9.17 S491 11.25 11.60 11.37 11.28 S4112 8.20 8.45 9.13 8.94 S4113 8.64 8.95 8.86 8.85 S4114 11.82 12.35 12.18 11.77 S4115 8.27 8.51 8.95 8.59 S4116 3.22 3.22 6.02 5.94 S4117 13.96 9.69 10.16 9.98 S4121 8.19 8.44 4.04 8.17 va 28.39 29.33 23.58 48.46 vag 12.26 12.93 13.58 13.20 vas 13.36 14.15 13.03 12.47 vif 4.50 4.79 5.06 4.92 vpv 56.84 59.83 28.28 57.24 vtv 57.58 60.42 28.40 57.63 vpvtv 32.78 33.77 16.35 32.73 vpvts 5.78 6.07 2.99 6.38 vpvpv 32.78 33.84 16.54 32.85 vtvtv 32.76 33.75 16.84 35.97 vsumr 72.04 72.28 72.20 72.04 vdotr 72.05 73.22 72.42 72.04 vbor 227.55 381.18 99.80 372.05 -Hal On Sat, 2011-10-29 at 17:56 -0500, Hal Finkel wrote:> On Sat, 2011-10-29 at 15:16 -0500, Hal Finkel wrote: > > On Sat, 2011-10-29 at 14:02 -0500, Hal Finkel wrote: > > > On Sat, 2011-10-29 at 12:30 -0500, Hal Finkel wrote: > > > > Ralf, et al., > > > > > > > > Attached is the latest version of my autovectorization patch. llvmdev > > > > has been CC'd (as had been suggested to me); this e-mail contains > > > > additional benchmark results. > > > > > > > > First, these are preliminary results because I did not do the things > > > > necessary to make them real (explicitly quiet the machine, bind the > > > > processes to one cpu, etc.). But they should be good enough for > > > > discussion. > > > > > > > > I'm using LLVM head r143101, with the attached patch applied, and clang > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > > > without any other optimization flags. opt was run -vectorize > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > > > adds the -vectorize option). > > > > > > And opt had also been given the flag: -bb-vectorize-vector-bits=256 > > > > And this was a mistake (because the machine on which the benchmarks were > > run does not have AVX). I've rerun, see better results below... > > > > > > > > -Hal > > > > > > > llc was just given -O3. > > > > > > > > Below I've included results using the benchmark program by Maleki, et > > > > al. See: > > > > An Evaluation of Vectorizing Compilers - PACT'11 > > > > (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of > > > > their benchmark program was retrieved from: > > > > http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz > > > > > > > > Also, when using clang, I had to pass -Dinline= on the command line: > > > > when using -emit-llvm, clang appears not to emit code for functions > > > > declared inline. This is a bug, but I've not yet tracked it down. There > > > > are two such small functions in the benchmark program, and the regular > > > > inliner *should* catch them anyway. > > > > > > > > Results: > > > > 0. Name of the loop > > > > 1. Time using LLVM with vectorization > > > > 2. Time using LLVM without vectorization > > > > 3. Time using gcc with vectorization > > > > 4. Time using gcc without vectorization > > As Peter Collingbourne indirectly pointed out to me, clang's > optimizations are still important (even if it is emitting only llvm). > I've rerun the llvm code generation steps, adding -O3 to clang. Here are > the results (they are significantly better): > > Loop llvm-v llvm gcc-v gcc > ------------------------------------------- > S000 9.10 9.59 4.55 10.04 > S111 7.29 7.65 7.68 7.83 > S1111 13.87 14.72 16.14 16.30 > S112 16.67 17.45 16.54 17.52 > S1112 13.16 13.87 14.83 14.84 > S113 22.14 22.98 22.05 22.05 > S1113 11.06 11.48 11.03 11.01 > S114 13.21 13.81 13.53 13.48 > S115 32.82 33.36 49.98 49.99 > S1115 13.67 14.23 13.65 13.66 > S116 47.37 49.43 49.54 48.11 > S118 10.81 11.25 10.79 10.50 > S119 8.73 9.09 11.83 11.82 > S1119 8.82 9.15 4.31 11.87 > S121 17.29 18.06 14.84 17.31 > S122 7.53 7.70 6.11 6.11 > S123 6.93 7.10 7.42 7.41 > S124 9.63 9.84 9.42 9.33 > S125 6.94 7.10 4.67 7.81 > S126 2.34 2.55 2.57 2.37 > S127 12.23 12.68 7.06 14.50 > S128 11.78 12.41 12.42 11.52 > S131 28.79 30.11 25.17 28.94 > S132 17.04 17.04 15.53 21.03 > S141 12.26 12.85 12.38 12.05 > S151 28.79 30.11 24.89 28.95 > S152 15.53 16.03 11.19 15.63 > S161 6.00 6.12 5.52 5.46 > S1161 14.40 14.50 8.80 8.79 > S162 8.19 8.41 5.36 8.18 > S171 15.41 7.96 2.81 5.70 > S172 5.70 5.97 2.75 5.70 > S173 30.32 31.69 18.15 30.13 > S174 30.20 31.53 18.51 30.16 > S175 5.79 6.04 4.94 5.77 > S176 5.59 5.83 4.41 7.65 > S211 16.31 16.89 16.82 16.38 > S212 13.23 13.50 13.34 13.18 > S1213 12.82 13.35 12.80 12.43 > S221 10.87 11.09 8.65 8.63 > S1221 5.72 6.03 5.40 6.05 > S222 6.01 6.29 5.70 5.72 > S231 22.38 24.22 22.36 22.11 > S232 6.89 6.94 6.89 6.89 > S1232 15.31 16.43 15.05 15.10 > S233 55.47 59.98 54.21 49.56 > S2233 27.23 29.71 29.68 28.40 > S235 44.08 47.85 46.94 43.93 > S241 31.14 31.72 32.53 31.01 > S242 7.20 7.21 7.20 7.20 > S243 16.54 16.99 17.69 16.84 > S244 14.51 14.93 16.91 16.82 > S1244 14.72 15.02 14.77 14.40 > S2244 10.09 10.60 10.40 10.06 > S251 34.42 35.55 19.70 34.38 > S1251 55.39 57.11 41.77 56.11 > S2251 15.69 16.26 17.02 15.70 > S3251 15.69 16.52 19.60 15.34 > S252 6.18 6.46 7.72 7.26 > S253 11.19 11.52 14.40 14.40 > S254 18.00 18.98 28.23 28.06 > S255 5.94 6.14 9.96 9.95 > S256 3.09 3.39 3.10 3.09 > S257 2.13 2.31 2.21 2.20 > S258 1.80 1.87 1.84 1.84 > S261 12.00 12.22 10.98 10.95 > S271 32.81 33.76 33.25 33.01 > S272 15.04 15.52 15.39 15.26 > S273 13.93 14.10 16.86 16.80 > S274 17.83 18.53 18.15 17.89 > S275 2.92 3.14 3.36 2.98 > S2275 32.81 34.95 8.97 33.60 > S276 41.26 41.97 40.80 40.55 > S277 4.80 4.93 4.81 4.80 > S278 14.43 14.76 14.70 14.66 > S279 8.05 8.24 7.25 7.27 > S1279 9.72 9.92 9.34 9.25 > S2710 7.73 8.07 7.86 7.56 > S2711 36.49 37.10 36.56 36.00 > S2712 32.96 33.96 34.24 33.47 > S281 10.80 11.32 12.46 12.02 > S1281 79.10 78.11 57.78 68.06 > S291 11.79 12.27 14.03 14.03 > S292 6.70 6.91 9.94 9.96 > S293 15.50 16.24 19.32 19.33 > S2101 2.56 2.67 2.59 2.60 > S2102 16.74 18.45 16.68 16.75 > S2111 5.59 5.63 5.85 5.85 > S311 72.04 72.27 72.23 72.03 > S31111 7.50 6.01 6.00 6.00 > S312 96.04 96.17 96.05 96.03 > S313 36.02 36.61 36.03 36.02 > S314 36.01 36.12 74.67 72.42 > S315 9.11 9.21 9.35 9.30 > S316 36.01 36.12 72.08 74.87 > S317 444.91 444.94 451.82 451.78 > S318 9.07 9.12 7.30 7.30 > S319 34.57 36.46 34.42 34.19 > S3110 8.52 8.61 4.11 4.11 > S13110 5.75 5.78 12.12 12.12 > S3111 3.60 3.64 3.60 3.60 > S3112 7.20 7.30 7.21 7.20 > S3113 33.68 34.18 60.21 60.20 > S321 16.80 16.87 16.80 16.80 > S322 12.42 12.64 12.60 12.60 > S323 10.88 11.24 8.48 8.51 > S331 4.23 4.36 7.20 7.20 > S332 7.20 7.28 5.21 5.31 > S341 4.80 5.04 7.23 7.20 > S342 6.01 6.24 7.25 7.20 > S343 2.04 2.16 2.16 2.01 > S351 46.63 48.65 21.82 46.46 > S1351 49.37 51.28 33.68 49.06 > S352 57.64 58.44 57.68 57.64 > S353 8.21 8.44 8.34 8.19 > S421 24.26 25.29 20.62 22.46 > S1421 25.18 26.16 15.85 24.76 > S422 80.08 81.51 79.22 78.99 > S423 155.02 155.21 154.56 154.38 > S424 22.62 23.35 11.42 22.36 > S431 57.22 59.82 27.59 57.16 > S441 13.27 14.23 12.88 12.81 > S442 5.99 6.13 6.96 6.90 > S443 17.37 17.77 17.15 16.95 > S451 48.92 48.99 49.03 49.14 > S452 42.97 39.57 14.64 96.03 > S453 28.06 28.07 14.60 14.40 > S471 8.27 8.56 8.39 8.43 > S481 10.93 11.23 12.04 12.00 > S482 9.21 9.42 9.19 9.17 > S491 11.31 11.60 11.37 11.28 > S4112 8.21 8.45 9.13 8.94 > S4113 8.65 8.95 8.86 8.85 > S4114 11.87 12.35 12.18 11.77 > S4115 8.28 8.51 8.95 8.59 > S4116 3.23 3.22 6.02 5.94 > S4117 13.97 9.69 10.16 9.98 > S4121 8.20 8.44 4.04 8.17 > va 28.50 29.33 23.58 48.46 > vag 12.37 12.93 13.58 13.20 > vas 13.46 14.15 13.03 12.47 > vif 4.55 4.79 5.06 4.92 > vpv 57.21 59.83 28.28 57.24 > vtv 57.92 60.42 28.40 57.63 > vpvtv 32.84 33.77 16.35 32.73 > vpvts 5.82 6.07 2.99 6.38 > vpvpv 32.87 33.84 16.54 32.85 > vtvtv 32.82 33.75 16.84 35.97 > vsumr 72.03 72.28 72.20 72.04 > vdotr 72.05 73.22 72.42 72.04 > vbor 205.24 381.18 99.80 372.05 > > I apologize for the multiple e-mails with a long list of numbers, but I > think that this was worth it (and I did not want to be unfair to the > clang developers). > > -Hal > > > > > Here are improved results where the correct (and default) > > vector-register size was used. > > > > Loop llvm-v llvm gcc-v gcc > > ------------------------------------------- > > S000 9.09 9.49 4.55 10.04 > > S111 7.28 7.37 7.68 7.83 > > S1111 13.78 14.48 16.14 16.30 > > S112 16.67 17.41 16.54 17.52 > > S1112 13.12 14.21 14.83 14.84 > > S113 22.12 22.88 22.05 22.05 > > S1113 11.06 11.42 11.03 11.01 > > S114 13.23 13.75 13.53 13.48 > > S115 32.76 33.24 49.98 49.99 > > S1115 13.68 14.18 13.65 13.66 > > S116 47.42 49.40 49.54 48.11 > > S118 10.84 11.26 10.79 10.50 > > S119 8.74 9.07 11.83 11.82 > > S1119 8.81 9.14 4.31 11.87 > > S121 17.28 18.78 14.84 17.31 > > S122 7.53 7.54 6.11 6.11 > > S123 6.90 7.38 7.42 7.41 > > S124 9.60 9.77 9.42 9.33 > > S125 6.92 7.22 4.67 7.81 > > S126 2.34 2.53 2.57 2.37 > > S127 12.19 12.97 7.06 14.50 > > S128 11.74 12.43 12.42 11.52 > > S131 28.75 29.91 25.17 28.94 > > S132 17.04 17.04 15.53 21.03 > > S141 12.28 12.26 12.38 12.05 > > S151 28.80 29.43 24.89 28.95 > > S152 15.54 16.03 11.19 15.63 > > S161 6.00 6.06 5.52 5.46 > > S1161 14.39 14.40 8.80 8.79 > > S162 8.19 9.05 5.36 8.18 > > S171 15.41 7.94 2.81 5.70 > > S172 5.71 5.89 2.75 5.70 > > S173 30.31 30.92 18.15 30.13 > > S174 30.18 31.66 18.51 30.16 > > S175 5.78 6.18 4.94 5.77 > > S176 5.59 5.83 4.41 7.65 > > S211 16.27 17.14 16.82 16.38 > > S212 13.21 14.28 13.34 13.18 > > S1213 12.81 13.46 12.80 12.43 > > S221 10.86 11.09 8.65 8.63 > > S1221 5.72 6.04 5.40 6.05 > > S222 6.02 6.26 5.70 5.72 > > S231 22.33 22.94 22.36 22.11 > > S232 6.88 6.88 6.89 6.89 > > S1232 15.30 15.34 15.05 15.10 > > S233 55.38 58.55 54.21 49.56 > > S2233 27.08 29.77 29.68 28.40 > > S235 44.00 44.92 46.94 43.93 > > S241 31.09 31.35 32.53 31.01 > > S242 7.19 7.20 7.20 7.20 > > S243 16.52 17.09 17.69 16.84 > > S244 14.45 14.83 16.91 16.82 > > S1244 14.71 14.83 14.77 14.40 > > S2244 10.04 10.62 10.40 10.06 > > S251 34.15 35.75 19.70 34.38 > > S1251 55.23 57.84 41.77 56.11 > > S2251 15.73 15.87 17.02 15.70 > > S3251 15.66 16.21 19.60 15.34 > > S252 6.18 6.32 7.72 7.26 > > S253 11.14 11.38 14.40 14.40 > > S254 18.41 18.70 28.23 28.06 > > S255 5.93 6.09 9.96 9.95 > > S256 3.08 3.42 3.10 3.09 > > S257 2.13 2.25 2.21 2.20 > > S258 1.79 1.82 1.84 1.84 > > S261 12.00 12.08 10.98 10.95 > > S271 32.82 33.04 33.25 33.01 > > S272 14.98 15.82 15.39 15.26 > > S273 13.92 14.04 16.86 16.80 > > S274 17.83 18.31 18.15 17.89 > > S275 2.92 3.02 3.36 2.98 > > S2275 32.80 33.50 8.97 33.60 > > S276 39.43 39.44 40.80 40.55 > > S277 4.80 4.80 4.81 4.80 > > S278 14.41 14.42 14.70 14.66 > > S279 8.03 8.29 7.25 7.27 > > S1279 9.71 10.06 9.34 9.25 > > S2710 7.71 8.04 7.86 7.56 > > S2711 35.53 35.55 36.56 36.00 > > S2712 32.94 33.17 34.24 33.47 > > S281 10.79 11.09 12.46 12.02 > > S1281 79.13 77.55 57.78 68.06 > > S291 11.80 11.78 14.03 14.03 > > S292 7.77 7.78 9.94 9.96 > > S293 15.50 15.87 19.32 19.33 > > S2101 2.56 2.58 2.59 2.60 > > S2102 16.71 17.53 16.68 16.75 > > S2111 5.60 5.60 5.85 5.85 > > S311 72.03 72.03 72.23 72.03 > > S31111 7.49 6.00 6.00 6.00 > > S312 96.04 96.04 96.05 96.03 > > S313 36.02 36.13 36.03 36.02 > > S314 36.01 36.07 74.67 72.42 > > S315 8.96 8.99 9.35 9.30 > > S316 36.02 36.06 72.08 74.87 > > S317 444.93 444.94 451.82 451.78 > > S318 9.05 9.07 7.30 7.30 > > S319 34.54 36.53 34.42 34.19 > > S3110 8.51 8.57 4.11 4.11 > > S13110 5.75 5.77 12.12 12.12 > > S3111 3.60 3.62 3.60 3.60 > > S3112 7.19 7.30 7.21 7.20 > > S3113 35.13 35.47 60.21 60.20 > > S321 16.79 16.81 16.80 16.80 > > S322 12.42 12.60 12.60 12.60 > > S323 10.86 11.02 8.48 8.51 > > S331 4.23 4.23 7.20 7.20 > > S332 7.20 7.21 5.21 5.31 > > S341 4.79 4.85 7.23 7.20 > > S342 6.01 6.09 7.25 7.20 > > S343 2.04 2.06 2.16 2.01 > > S351 46.61 47.34 21.82 46.46 > > S1351 49.28 50.35 33.68 49.06 > > S352 57.65 58.04 57.68 57.64 > > S353 8.21 8.38 8.34 8.19 > > S421 42.94 43.34 20.62 22.46 > > S1421 25.15 25.81 15.85 24.76 > > S422 87.39 87.53 79.22 78.99 > > S423 155.01 155.29 154.56 154.38 > > S424 36.51 37.51 11.42 22.36 > > S431 57.10 60.66 27.59 57.16 > > S441 14.04 13.29 12.88 12.81 > > S442 6.00 6.00 6.96 6.90 > > S443 17.28 17.77 17.15 16.95 > > S451 48.92 49.08 49.03 49.14 > > S452 42.98 39.32 14.64 96.03 > > S453 28.05 28.06 14.60 14.40 > > S471 8.24 8.65 8.39 8.43 > > S481 10.88 11.15 12.04 12.00 > > S482 9.21 9.31 9.19 9.17 > > S491 11.26 11.38 11.37 11.28 > > S4112 8.21 8.36 9.13 8.94 > > S4113 8.65 8.81 8.86 8.85 > > S4114 11.82 12.15 12.18 11.77 > > S4115 8.28 8.46 8.95 8.59 > > S4116 3.22 3.23 6.02 5.94 > > S4117 13.95 9.61 10.16 9.98 > > S4121 8.21 8.26 4.04 8.17 > > va 28.46 28.58 23.58 48.46 > > vag 12.35 12.36 13.58 13.20 > > vas 13.45 13.49 13.03 12.47 > > vif 4.55 4.57 5.06 4.92 > > vpv 57.08 57.22 28.28 57.24 > > vtv 57.81 57.83 28.40 57.63 > > vpvtv 32.82 32.84 16.35 32.73 > > vpvts 5.82 5.83 2.99 6.38 > > vpvpv 32.87 32.89 16.54 32.85 > > vtvtv 32.82 32.80 16.84 35.97 > > vsumr 72.04 72.03 72.20 72.04 > > vdotr 72.06 72.05 72.42 72.04 > > vbor 205.24 380.81 99.80 372.05 > > > > -Hal > > > > > > > > > > Loop llvm-v llvm gcc-v gcc > > > > ------------------------------------------- > > > > S000 9.59 9.49 4.55 10.04 > > > > S111 7.67 7.37 7.68 7.83 > > > > S1111 13.98 14.48 16.14 16.30 > > > > S112 17.43 17.41 16.54 17.52 > > > > S1112 13.87 14.21 14.83 14.84 > > > > S113 22.97 22.88 22.05 22.05 > > > > S1113 11.46 11.42 11.03 11.01 > > > > S114 13.47 13.75 13.53 13.48 > > > > S115 33.06 33.24 49.98 49.99 > > > > S1115 13.91 14.18 13.65 13.66 > > > > S116 48.74 49.40 49.54 48.11 > > > > S118 11.04 11.26 10.79 10.50 > > > > S119 8.97 9.07 11.83 11.82 > > > > S1119 9.04 9.14 4.31 11.87 > > > > S121 18.06 18.78 14.84 17.31 > > > > S122 7.58 7.54 6.11 6.11 > > > > S123 7.02 7.38 7.42 7.41 > > > > S124 9.62 9.77 9.42 9.33 > > > > S125 7.14 7.22 4.67 7.81 > > > > S126 2.32 2.53 2.57 2.37 > > > > S127 12.87 12.97 7.06 14.50 > > > > S128 12.58 12.43 12.42 11.52 > > > > S131 29.91 29.91 25.17 28.94 > > > > S132 17.04 17.04 15.53 21.03 > > > > S141 12.59 12.26 12.38 12.05 > > > > S151 28.92 29.43 24.89 28.95 > > > > S152 15.68 16.03 11.19 15.63 > > > > S161 6.06 6.06 5.52 5.46 > > > > S1161 14.46 14.40 8.80 8.79 > > > > S162 8.31 9.05 5.36 8.18 > > > > S171 15.47 7.94 2.81 5.70 > > > > S172 5.92 5.89 2.75 5.70 > > > > S173 31.59 30.92 18.15 30.13 > > > > S174 31.16 31.66 18.51 30.16 > > > > S175 5.80 6.18 4.94 5.77 > > > > S176 5.69 5.83 4.41 7.65 > > > > S211 16.56 17.14 16.82 16.38 > > > > S212 13.46 14.28 13.34 13.18 > > > > S1213 13.12 13.46 12.80 12.43 > > > > S221 10.88 11.09 8.65 8.63 > > > > S1221 5.80 6.04 5.40 6.05 > > > > S222 6.01 6.26 5.70 5.72 > > > > S231 23.78 22.94 22.36 22.11 > > > > S232 6.88 6.88 6.89 6.89 > > > > S1232 16.00 15.34 15.05 15.10 > > > > S233 57.48 58.55 54.21 49.56 > > > > S2233 27.65 29.77 29.68 28.40 > > > > S235 46.40 44.92 46.94 43.93 > > > > S241 31.62 31.35 32.53 31.01 > > > > S242 7.20 7.20 7.20 7.20 > > > > S243 16.78 17.09 17.69 16.84 > > > > S244 14.64 14.83 16.91 16.82 > > > > S1244 14.98 14.83 14.77 14.40 > > > > S2244 10.47 10.62 10.40 10.06 > > > > S251 35.10 35.75 19.70 34.38 > > > > S1251 56.65 57.84 41.77 56.11 > > > > S2251 15.96 15.87 17.02 15.70 > > > > S3251 16.41 16.21 19.60 15.34 > > > > S252 7.24 6.32 7.72 7.26 > > > > S253 12.55 11.38 14.40 14.40 > > > > S254 19.08 18.70 28.23 28.06 > > > > S255 5.94 6.09 9.96 9.95 > > > > S256 3.14 3.42 3.10 3.09 > > > > S257 2.18 2.25 2.21 2.20 > > > > S258 1.80 1.82 1.84 1.84 > > > > S261 12.00 12.08 10.98 10.95 > > > > S271 32.93 33.04 33.25 33.01 > > > > S272 15.48 15.82 15.39 15.26 > > > > S273 13.99 14.04 16.86 16.80 > > > > S274 18.38 18.31 18.15 17.89 > > > > S275 3.02 3.02 3.36 2.98 > > > > S2275 33.71 33.50 8.97 33.60 > > > > S276 39.52 39.44 40.80 40.55 > > > > S277 4.81 4.80 4.81 4.80 > > > > S278 14.43 14.42 14.70 14.66 > > > > S279 8.10 8.29 7.25 7.27 > > > > S1279 9.77 10.06 9.34 9.25 > > > > S2710 7.85 8.04 7.86 7.56 > > > > S2711 35.54 35.55 36.56 36.00 > > > > S2712 33.16 33.17 34.24 33.47 > > > > S281 10.97 11.09 12.46 12.02 > > > > S1281 79.37 77.55 57.78 68.06 > > > > S291 11.94 11.78 14.03 14.03 > > > > S292 7.88 7.78 9.94 9.96 > > > > S293 15.90 15.87 19.32 19.33 > > > > S2101 2.59 2.58 2.59 2.60 > > > > S2102 17.63 17.53 16.68 16.75 > > > > S2111 5.63 5.60 5.85 5.85 > > > > S311 72.07 72.03 72.23 72.03 > > > > S31111 7.49 6.00 6.00 6.00 > > > > S312 96.06 96.04 96.05 96.03 > > > > S313 36.50 36.13 36.03 36.02 > > > > S314 36.10 36.07 74.67 72.42 > > > > S315 9.00 8.99 9.35 9.30 > > > > S316 36.11 36.06 72.08 74.87 > > > > S317 444.92 444.94 451.82 451.78 > > > > S318 9.04 9.07 7.30 7.30 > > > > S319 34.76 36.53 34.42 34.19 > > > > S3110 8.53 8.57 4.11 4.11 > > > > S13110 5.76 5.77 12.12 12.12 > > > > S3111 3.60 3.62 3.60 3.60 > > > > S3112 7.20 7.30 7.21 7.20 > > > > S3113 35.12 35.47 60.21 60.20 > > > > S321 16.81 16.81 16.80 16.80 > > > > S322 12.42 12.60 12.60 12.60 > > > > S323 10.93 11.02 8.48 8.51 > > > > S331 4.23 4.23 7.20 7.20 > > > > S332 7.21 7.21 5.21 5.31 > > > > S341 4.74 4.85 7.23 7.20 > > > > S342 6.02 6.09 7.25 7.20 > > > > S343 2.14 2.06 2.16 2.01 > > > > S351 49.26 47.34 21.82 46.46 > > > > S1351 50.85 50.35 33.68 49.06 > > > > S352 58.14 58.04 57.68 57.64 > > > > S353 8.35 8.38 8.34 8.19 > > > > S421 43.13 43.34 20.62 22.46 > > > > S1421 25.25 25.81 15.85 24.76 > > > > S422 88.36 87.53 79.22 78.99 > > > > S423 155.13 155.29 154.56 154.38 > > > > S424 37.11 37.51 11.42 22.36 > > > > S431 58.22 60.66 27.59 57.16 > > > > S441 14.05 13.29 12.88 12.81 > > > > S442 6.08 6.00 6.96 6.90 > > > > S443 17.60 17.77 17.15 16.95 > > > > S451 48.95 49.08 49.03 49.14 > > > > S452 42.98 39.32 14.64 96.03 > > > > S453 28.06 28.06 14.60 14.40 > > > > S471 8.53 8.65 8.39 8.43 > > > > S481 10.98 11.15 12.04 12.00 > > > > S482 9.31 9.31 9.19 9.17 > > > > S491 11.54 11.38 11.37 11.28 > > > > S4112 8.21 8.36 9.13 8.94 > > > > S4113 8.77 8.81 8.86 8.85 > > > > S4114 12.32 12.15 12.18 11.77 > > > > S4115 8.48 8.46 8.95 8.59 > > > > S4116 3.21 3.23 6.02 5.94 > > > > S4117 14.08 9.61 10.16 9.98 > > > > S4121 8.53 8.26 4.04 8.17 > > > > va 30.09 28.58 23.58 48.46 > > > > vag 12.35 12.36 13.58 13.20 > > > > vas 13.74 13.49 13.03 12.47 > > > > vif 4.49 4.57 5.06 4.92 > > > > vpv 58.59 57.22 28.28 57.24 > > > > vtv 59.15 57.83 28.40 57.63 > > > > vpvtv 33.18 32.84 16.35 32.73 > > > > vpvts 5.99 5.83 2.99 6.38 > > > > vpvpv 33.25 32.89 16.54 32.85 > > > > vtvtv 32.83 32.80 16.84 35.97 > > > > vsumr 72.03 72.03 72.20 72.04 > > > > vdotr 72.05 72.05 72.42 72.04 > > > > vbor 205.22 380.81 99.80 372.05 > > > > > > > > I've yet to go through these in detail (they just finished running 5 > > > > minutes ago). But for the curious (and I've had several requests for > > > > benchmarks), here you go. There is obviously more work to do. > > > > > > > > -Hal > > > > > > > > On Fri, 2011-10-28 at 14:30 +0200, Ralf Karrenberg wrote: > > > > > Hi Hal, > > > > > > > > > > those numbers look very promising, great work! :) > > > > > > > > > > Best, > > > > > Ralf > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Hal Finkel" <hfinkel at anl.gov> > > > > > > To: "Bruno Cardoso Lopes" <bruno.cardoso at gmail.com> > > > > > > Cc: llvm-commits at cs.uiuc.edu > > > > > > Sent: Freitag, 28. Oktober 2011 13:50:00 > > > > > > Subject: Re: [llvm-commits] [PATCH] BasicBlock Autovectorization Pass > > > > > > > > > > > > Bruno, et al., > > > > > > > > > > > > I've attached a new version of the patch that contains improvements > > > > > > (and > > > > > > a critical bug fix [the code output is not more right, but the pass > > > > > > in > > > > > > the older patch would crash in certain cases and now does not]) > > > > > > compared > > > > > > to previous versions that I've posted. > > > > > > > > > > > > First, these are preliminary results because I did not do the things > > > > > > necessary to make them real (explicitly quiet the machine, bind the > > > > > > processes to one cpu, etc.). But they should be good enough for > > > > > > discussion. > > > > > > > > > > > > I'm using LLVM head r143101, with the attached patch applied, and > > > > > > clang > > > > > > head r143100 on an x86_64 machine (some kind of Intel Xeon). For the > > > > > > gcc > > > > > > comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3 > > > > > > without any other optimization flags. opt was run -vectorize > > > > > > -unroll-allow-partial -O3 with no other optimization flags (the patch > > > > > > adds the -vectorize option). llc was just given -O3. > > > > > > > > > > > > It is not difficult to construct an example in which vectorization > > > > > > would > > > > > > be useful: take a loop that does more computation than load/stores, > > > > > > and > > > > > > (partially) unroll it. Here is a simple case: > > > > > > > > > > > > #define ITER 5000 > > > > > > #define NUM 200 > > > > > > double a[NUM][NUM]; > > > > > > double b[NUM][NUM]; > > > > > > > > > > > > ... > > > > > > > > > > > > int main() > > > > > > { > > > > > > ... > > > > > > > > > > > > for (int i = 0; i < ITER; ++i) { > > > > > > for (int x = 0; x < NUM; ++x) > > > > > > for (int y = 0; y < NUM; ++y) { > > > > > > double v = a[x][y], w = b[x][y]; > > > > > > double z1 = v*w; > > > > > > double z2 = v+w; > > > > > > double z3 = z1*z2; > > > > > > double z4 = z3+v; > > > > > > double z5 = z2+w; > > > > > > double z6 = z4*z5; > > > > > > double z7 = z4+z5; > > > > > > a[x][y] = v*v-z6; > > > > > > b[x][y] = w-z7; > > > > > > } > > > > > > } > > > > > > > > > > > > ... > > > > > > > > > > > > return 0; > > > > > > } > > > > > > > > > > > > Results: > > > > > > gcc -03: 0m1.790s > > > > > > llvm -vectorize: 0m2.360s > > > > > > llvm: 0m2.780s > > > > > > gcc -fno-tree-vectorize: 0m2.810s > > > > > > (these are the user times after I've run enough for the times to > > > > > > settle > > > > > > to three decimal places) > > > > > > > > > > > > So the vectorization gives a ~15% improvement in the running time. > > > > > > gcc's > > > > > > vectorization still does a much better job, however (yielding an ~36% > > > > > > improvement). So there is still work to do ;) > > > > > > > > > > > > Additionally, I've checked the autovectorization on some classic > > > > > > numerical benchmarks from netlib. On these benchmarks, clang/llvm > > > > > > already do a good job compared to gcc (gcc is only about 10% better, > > > > > > and > > > > > > this is true regardless of whether gcc's vectorization is on or off). > > > > > > For these cases, autovectorization provides an insignificant speedup > > > > > > in > > > > > > most cases (but does not tend to make things worse, just not really > > > > > > any > > > > > > better either). Because gcc's vectorization also did not really help > > > > > > gcc > > > > > > in these cases, I'm not surprised. A good collection of these is > > > > > > available here: > > > > > > http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz > > > > > > > > > > > > I've yet to run the test suite using the pass to validate it. That is > > > > > > something that I plan to do. Actually, the "Livermore Loops" test in > > > > > > the > > > > > > aforementioned archive contains checksums to validate the results, > > > > > > and > > > > > > it looks like 1 or 2 of the loop results are wrong with vectorization > > > > > > turned on, so I'll have to investigate that. > > > > > > > > > > > > -Hal > > > > > > > > > > > > On Wed, 2011-10-26 at 18:49 -0200, Bruno Cardoso Lopes wrote: > > > > > > > Hi Hal, > > > > > > > > > > > > > > On Fri, Oct 21, 2011 at 7:04 PM, Hal Finkel <hfinkel at anl.gov> > > > > > > > wrote: > > > > > > > > I've attached an initial version of a basic-block > > > > > > > > autovectorization > > > > > > > > pass. It works by searching a basic block for pairable > > > > > > > > (independent) > > > > > > > > instructions, and, using a chain-seeking heuristic, selects > > > > > > > > pairings > > > > > > > > likely to provide an overall speedup (if such pairings can be > > > > > > > > found). > > > > > > > > The selected pairs are then fused and, if necessary, other > > > > > > > > instructions > > > > > > > > are moved in order to maintain data-flow consistency. This works > > > > > > > > only > > > > > > > > within one basic block, but can do loop vectorization in > > > > > > > > combination > > > > > > > > with (partial) unrolling. The basic idea was inspired by the > > > > > > > > Vienna MAP > > > > > > > > Vectorizor, which has been used to vectorize FFT kernels, but the > > > > > > > > algorithm used here is different. > > > > > > > > > > > > > > > > To try it, use -bb-vectorize with opt. There are a few options: > > > > > > > > -bb-vectorize-req-chain-depth: default: 3 -- The depth of the > > > > > > > > chain of > > > > > > > > instruction pairs necessary in order to consider the pairs that > > > > > > > > compose > > > > > > > > the chain worthy of vectorization. > > > > > > > > -bb-vectorize-vector-bits: default: 128 -- The size of the target > > > > > > > > vector > > > > > > > > registers > > > > > > > > -bb-vectorize-no-ints -- Don't consider integer instructions > > > > > > > > -bb-vectorize-no-floats -- Don't consider floating-point > > > > > > > > instructions > > > > > > > > > > > > > > > > The vectorizor generates a lot of insert_element/extract_element > > > > > > > > pairs; > > > > > > > > The assumption is that other passes will turn these into shuffles > > > > > > > > when > > > > > > > > possible (it looks like some work is necessary here). It will > > > > > > > > also > > > > > > > > vectorize vector instructions, and generates shuffles in this > > > > > > > > case > > > > > > > > (again, other passes should combine these as appropriate). > > > > > > > > > > > > > > > > Currently, it does not fuse load or store instructions, but that > > > > > > > > is a > > > > > > > > feature that I'd like to add. Of course, alignment information is > > > > > > > > an > > > > > > > > issue for load/store vectorization (or maybe I should just fuse > > > > > > > > them > > > > > > > > anyway and let isel deal with unaligned cases?). > > > > > > > > > > > > > > > > Also, support needs to be added for fusing known intrinsics (fma, > > > > > > > > etc.), > > > > > > > > and, as has been discussed on llvmdev, we should add some > > > > > > > > intrinsics to > > > > > > > > allow the generation of addsub-type instructions. > > > > > > > > > > > > > > > > I've included a few tests, but it needs more. Please review (I'll > > > > > > > > commit > > > > > > > > if and when everyone is happy). > > > > > > > > > > > > > > > > Thanks in advance, > > > > > > > > Hal > > > > > > > > > > > > > > > > P.S. There is another option (not so useful right now, but could > > > > > > > > be): > > > > > > > > -bb-vectorize-fast-dep -- Don't do a full inter-instruction > > > > > > > > dependency > > > > > > > > analysis; instead stop looking for instruction pairs after the > > > > > > > > first use > > > > > > > > of an instruction's value. [This makes the pass faster, but would > > > > > > > > require a data-dependence-based reordering pass in order to be > > > > > > > > effective]. > > > > > > > > > > > > > > Cool! :) > > > > > > > Have you run this pass with any benchmark or the llvm testsuite? > > > > > > > Does > > > > > > > it presents any regression? > > > > > > > Do you have any performance results? > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > -- > > > > > > Hal Finkel > > > > > > Postdoctoral Appointee > > > > > > Leadership Computing Facility > > > > > > Argonne National Laboratory > > > > > > > > > > > > _______________________________________________ > > > > > > llvm-commits mailing list > > > > > > llvm-commits at cs.uiuc.edu > > > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > > > > > > > > > > > > > _______________________________________________ > > > > llvm-commits mailing list > > > > llvm-commits at cs.uiuc.edu > > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits > > > > > >-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: llvm_bb_vectorize-20111031-2.diff Type: text/x-patch Size: 77125 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111031/d020bfe3/attachment.bin>
Hal Finkel
2011-Nov-08 10:45 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
I've attached the latest version of my autovectorization patch. Working through the test suite has proved to be a productive experience ;) -- And almost all of the bugs that it revealed have now been fixed. There are still two programs that don't compile with vectorization turned on, and I'm working on those now, but in case anyone feels like playing with vectorization, this patch will probably work for you. The largest three performance speedups are: SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup SingleSource/UnitTests/Vector/multiplies - 57.7% speedup SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup The largest three performance slowdowns are: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - 114% slowdown MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6% slowdown SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown (from these, I've excluded tests that took less that 0.1 seconds to run). Largest three compile-time slowdowns: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - 1276% slowdown SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown Not everything slows down, MultiSource/Benchmarks/Prolangs-C ++/city/city, for example, compiles 10% faster with vectorization enabled; but, for the most part, things certainly take longer to compile with vectorization enabled. The average slowdown over all tests was 29%, the median was 11%. On the other hand, the average speedup over all tests was 5.2%, the median was 1.3%. Compared to previous patches, which had a minimum required chain length of 3 or 4, I've now made the default 6. While using a chain length of 4 worked well for targeted benchmarks, it caused an overall slowdown on almost all test-suite programs. Using a minimum length of 6 causes, on average, a speedup; so I think that is a better default choice. -Hal On Tue, 2011-11-01 at 18:54 -0500, Hal Finkel wrote:> On Tue, 2011-11-01 at 16:59 -0500, Hal Finkel wrote: > > On Tue, 2011-11-01 at 19:19 +0000, Tobias Grosser wrote: > > > On 11/01/2011 06:32 PM, Hal Finkel wrote: > > > > Any objections to me committing this? [And some relevant docs changes] I > > > > think that it is ready at this point. > > > > > > First of all. I think it is great to see work starting on an > > > autovectorizer for LLVM. Unfortunately I did not have time to test your > > > vectorizer pass intensively, but here my first comments: > > > > > > 1. This patch breaks the --enable-shared/BUILD_SHARED_LIBS build. The > > > following patch fixes this for cmake: > > > 0001-Add-vectorizer-to-libraries-used-by-Transforms-IPO.patch > > > > > > > Thanks! > > > > > Can you check the autoconf build with --enable-shared? > > > > I will check. > > This appears to work as it should. > > > > > > > > > 2. Did you run this pass on the llvm test-suite? Does your vectorizer > > > introduce any correctness regressions? What are the top 10 compile > > > time increases/decreases. How about run time? > > > > > > > I'll try to get this setup and post the results. > > > > > 3. I did not really test this intensively, but I had the feeling the > > > compile time increase for large basic blocks is quite a lot. > > > I still need to extract a test case. Any comments on the complexity > > > of your vectorizer? > > > > This may very will be true. As is, I would not recommend activating this > > pass by default (at -O3) because it is fairly slow and the resulting > > performance increase, while significant in many cases, is not large > > enough to, IMHO, justify the extra base compile-time increase. Ideally, > > this kind of vectorization should be the "vectorizer of last resort" -- > > the pass that tries really hard to squeeze the last little bit of > > vectorization possible out of the code. At the moment, it is all that we > > have, but I hope that will change. I've not yet done any real profiling, > > so I'll hold off on commenting about future performance improvements. > > > > Base complexity is a bit difficult, there are certainly a few stages, > > including that initial one, that are O(n^2), where n is the number of > > instructions in the block. The "connection-finding" stage should also be > > O(n^2) in practice, but is really iterating over instruction-user pairs > > and so could be worse in pathological cases. Note, however, that in the > > latter stages, that n^2 is not the number of instructions in the block, > > but rather the number of (unordered) candidate instruction pairs (which > > is going to be must less than the n^2 from just the number of > > instructions in the block). It should be possible to generate a > > compile-time scaling plot by taking a loop and compiling it with partial > > unrolling, looking at how the compile time changes with the unrolling > > limit; I'll try and so that. > > So for this test, I ran: > time opt -S -O3 -unroll-allow-partial -vectorize -o /dev/null q.ll > where q.ll contains the output from clang -O3 of the vbor function from > the benchmarks I've been posting recently. The first column is the value > of -unroll-threshold, the second column is the time with vectorization, > and the third column is the time without vectorization (time in seconds > for a release build). > > 100 0.030 0.000 > 200 0.130 0.030 > 300 0.770 0.030 > 400 1.240 0.040 > 500 1.280 0.050 > 600 9.450 0.060 > 700 29.300 0.060 > > I am not sure why the 400 and 500 times are so close. Obviously, it is > not linear ;) I am not sure that enumerating the possible pairings can > be done in a sub-quadratic way, but I will do some profiling and see if > I can make things better. To be fair, this test creates a kind of a > worse-case scenario: an increasingly large block of instructions, almost > all of which are potentially fusable. > > It may also be possible to design additional heuristics to help the > situation. For example, we might introduce a target chain length such > that if the vectorizer finds a chain of a given length, it selects it, > foregoing the remainder of the search for the selected starting > instruction. This kind of thing will require further research and > testing. > > -Hal > > > > > I'm writing a paper on the vectorizer, so within a few weeks there will > > be a very good description (complete with diagrams) :) > > > > > > > > I plan to look into your vectorizer during the next couple of > > > days/weeks, but will most probably not have the time to do this tonight. > > > Sorry. :-( > > > > Not a problem; it seems that I have some homework to do first ;) > > > > Thanks, > > Hal > > > > > > > > Cheers > > > Tobi > > >-- Hal Finkel Postdoctoral Appointee Leadership Computing Facility Argonne National Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: llvm_bb_vectorize-20111107.diff Type: text/x-patch Size: 79455 bytes Desc: not available URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20111108/3b39956c/attachment.bin>
Tobias Grosser
2011-Nov-08 11:12 UTC
[LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
On 11/08/2011 11:45 AM, Hal Finkel wrote:> I've attached the latest version of my autovectorization patch. > > Working through the test suite has proved to be a productive > experience ;) -- And almost all of the bugs that it revealed have now > been fixed. There are still two programs that don't compile with > vectorization turned on, and I'm working on those now, but in case > anyone feels like playing with vectorization, this patch will probably > work for you.Hey Hal, those are great news. Especially as the numbers seem to show that vectorization has a significant performance impact. What did you compare exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?> The largest three performance speedups are: > SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup > SingleSource/UnitTests/Vector/multiplies - 57.7% speedup > SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup > > The largest three performance slowdowns are: > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - > 114% slowdown > MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6% > slowdown > SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown >Interesting. Do you understand what causes these slowdowns? Can your heuristic be improved?> Largest three compile-time slowdowns: > MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael - > 1276% slowdown > SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown > MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdownYes, that is a lot. Do you understand if this time is invested well (does it give significant speedups)? If I understood correctly it seems your vectorizer has quadratic complexity which may cause large slowdowns. Do you think it may be useful/possible to make it linear by introducing a constant upper bound somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and most of the vectorization opportunities are close by (in some sense), such that we get most of the speedup by locking at a subset of the problem.> Not everything slows down, MultiSource/Benchmarks/Prolangs-C > ++/city/city, for example, compiles 10% faster with vectorization > enabled; but, for the most part, things certainly take longer to compile > with vectorization enabled. The average slowdown over all tests was 29%, > the median was 11%. On the other hand, the average speedup over all > tests was 5.2%, the median was 1.3%.Nice. I think this is a great start.> Compared to previous patches, which had a minimum required chain length > of 3 or 4, I've now made the default 6. While using a chain length of 4 > worked well for targeted benchmarks, it caused an overall slowdown on > almost all test-suite programs. Using a minimum length of 6 causes, on > average, a speedup; so I think that is a better default choice.I also try to understand if it is possible to use your vectorizer for Polly. My idea is to do some clever loop unrolling. Starting from this loop. for (int i = 0; i < 4; i++) A[i] += 1; A[i] = B[i] + 3; C[i] = A[i]; The classical unroller would create this code: A[0] += 1; A[0] = B[i] + 3; C[0] = A[i]; A[1] += 1; A[1] = B[i] + 3; C[1] = A[i]; A[2] += 1; A[2] = B[i] + 3; C[2] = A[i]; A[3] += 1; A[3] = B[i] + 3; C[3] = A[i]; However, in case I can prove this loop is parallel, I want to create this code: A[0] += 1; A[1] += 1; A[2] += 1; A[3] += 1; A[0] = B[i] + 3; A[1] = B[i] + 3; A[2] = B[i] + 3; A[3] = B[i] + 3; C[0] = A[i]; C[1] = A[i]; C[2] = A[i]; C[3] = A[i]; I assume this will allow the vectorization of test cases, that failed because of possible aliasing. However, I am more interested, if the execution order change could also improve the vectorization outcome or reduce compile time overhead of your vectorizer. Thanks for working on the vectorization Cheers Tobi
Apparently Analagous Threads
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- [LLVMdev] [llvm-commits] [PATCH] BasicBlock Autovectorization Pass
- Slow RAID Check/high %iowait during check after updgrade from CentOS 6.5 -> CentOS 7.2
- Partial correlation test