Henrik Bengtsson
2010-Aug-23 21:33 UTC
[Rd] Internal state indicating if a data object has NAs/no NAs/not sure (Was: Re: Speeding up matrix multiplies)
Hi, I'm just following your messages the overhead that the code for dealing with possible NA/NaN values brings. When I was setting up part of the matrixStats package, I've also though about this. I was thinking of having an additional logical argument 'hasNA'/'has.na' where you as a user can specify whether there is NA/NaN:s or not, allowing the code to avoid it or not. Of course, if hasNA=FALSE but there really are NAs you'll get very funny results. Even better would be if there would be an internal state attached to the basic data types in R indicating one of three states: (i) Has NAs: we know there exist NA/NaN values, (ii) Has No NAs: we know they do *not* exist, (iii) Not Sure: we are uncertain, i.e. it may be either (i) or (ii). This would allow one to optimize the internal code further, but if ignored everything is backward compatible. The default state should be (iii), which would be backward compatible with all existing code. For instance y <- sum(x, na.rm=FALSE) would return NA immediately if the state of 'x' is "Has NAs". Then different computational algorithms are free to set the state, if they know it. For instance, matrixStats::anyMissing(x) would flag the "x" object (i) if returning TRUE and (ii) if returning FALSE. Next time it is called, the response could be instant/O(1). Similarly, y <- rowSums(x, na.rm=TRUE/FALSE) can set the state on y, and so on. Just some thoughts /Henrik On Mon, Aug 23, 2010 at 1:50 PM, Radford Neal <radford at cs.toronto.edu> wrote:> I've looked at the code for matrix multiplies in R, and found that it > can speeded up quite a bit, for multiplies that are really vector dot > products, and for other multiplies in which the result matrix is small. > > Here's my test program: > > u <- seq(0,1,length=1000) > v <- seq(0,2,length=1000) > > A2 <- matrix(2.1,2,1000) > A5 <- matrix(2.1,5,1000) > B3 <- matrix(3.2,1000,3) > > A4 <- matrix(2.1,4,1000) > B4 <- matrix(3.2,1000,4) > > print(system.time ({for (i in 1:100000) w <- u %*% v})) > print(system.time ({for (i in 1:100000) w <- A5 %*% v})) > print(system.time ({for (i in 1:100000) R <- A2 %*% B3})) > print(system.time ({for (i in 1:100000) R <- A5 %*% B3})) > print(system.time ({for (i in 1:100000) R <- A4 %*% B4})) > > Here are the five system.time results above using R 2.11.1: > > ? user ?system elapsed > ?1.680 ? 0.000 ? 1.683 > ? user ?system elapsed > ?4.350 ? 0.000 ? 4.349 > ? user ?system elapsed > ?4.820 ? 0.000 ? 4.818 > ? user ?system elapsed > ?7.370 ? 0.020 ? 7.383 > ? user ?system elapsed > ?7.990 ? 0.000 ? 7.991 > > and here are the results with my modified version of R 2.11.1: > > ? user ?system elapsed > ?0.270 ? 0.000 ? 0.271 > ? user ?system elapsed > ?1.290 ? 0.000 ? 1.287 > ? user ?system elapsed > ?1.080 ? 0.000 ? 1.086 > ? user ?system elapsed > ?3.640 ? 0.000 ? 3.642 > ? user ?system elapsed > ?7.950 ? 0.000 ? 7.953 > > The main issue here is again ISNAN. ?The R code doesn't trust the > Fortran routine it calls to handle NA and NaN correctly, so it checks > whether the arguments have any NAs or NaNs, and if so does the matrix > multiply itself, with simple nested loops. ?For multiplies in which > the number of elements in the result matrix is large, this check will > take a negligible amount of time. ?But otherwise it can drastically > slow things down. ?This is true in particular for dot products, of a > 1xN times an Nx1 matrix, giving a 1x1 result. > > I changed the code to handle dot product specially, with no need for > ISNAN checks, and otherwise to go straight to the nested loop code > (bypassing Fortran and the ISNAN checks) if the result matrix has no > more than 15 elements. ?One can see that the times are much improved > above, except of course for the multiply that produces a 4x4 result, > which takes the same amount of time as before, since nothing is > different for that. ?From the results, it seems that a threshold a bit > above 15 might be even better, though this will of course depend on > the architecture, compiler, etc. > > Simon Urbaneck points out in reply to my message about speeding up > "sum" and "prod" that the effects of changes like this depend on the > machine architecure, compiler, etc. ?This is of course true. ?The > magnitude of variation in his results (not the absolute times, but the > relative times of different versions) is rather larger than one would > expect, though. ?Perhaps his time measurements were affected by random > factors like system load, or perhaps they are due to non-random but > essentially arbitrary factors like the exact alignment of code > generated by the C compiler, which could change with even unrelated > modifications to the program. > > The changes that I made seem generally desirable for any architecture, > or will at least make things no worse (the code expansion is trivial), > apart from the possibility of such arbitrary factors. ?I'm not doing > anything like manually unrolling loops, or replacing array indexing > with pointer arithmetic, which might be seen as undesirable attempts > at out-smarting the optimizer. > > The times above are for an Intel Linux machine with gcc version below: > > Using built-in specs. > Target: i486-linux-gnu > Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu > Thread model: posix > gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4) > > I didn't change any R configuration options regarding compilation. > > Below is my modified version of the matprod function in src/main/array.c. > > ? Radford Neal > > ------------------------------------------------------------------------ > > static void matprod(double *x, int nrx, int ncx, > ? ? ? ? ? ? ? ? ? ?double *y, int nry, int ncy, double *z) > { > ? ?char *transa = "N", *transb = "N"; > ? ?int i, ?j, k; > ? ?double one = 1.0, zero = 0.0; > ? ?LDOUBLE sum; > ? ?Rboolean do_it_here; > > ? ?/* NOTE: ncx must be equal to nry. */ > > ? ?if (nrx==1 && ncy==1) { > > ? ? ? ?/* Do dot product quickly. */ > > ? ? ? ?sum = 0.0; > ? ? ? ?for (i = 0; i<ncx; i++) > ? ? ? ? ? ?sum += x[i] * y[i]; > ? ? ? ?z[0] = sum; > > ? ?} else if (nrx > 0 && ncx > 0 && nry > 0 && ncy > 0) { > > ? ? ? ?/* Faster to just do it here if the result is small, especially > ? ? ? ? ? considering the NA/NaN check below. */ > > ? ? ? ?do_it_here = nrx*ncy <= 15; > > ? ? ? ?/* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 > ? ? ? ? * The test is only O(n) here > ? ? ? ? */ > > ? ? ? ?if (!do_it_here) > ? ? ? ? ? ?for (i = 0; i < nrx*ncx; i++) > ? ? ? ? ? ? ? ?if (ISNAN(x[i])) { do_it_here = TRUE; break; } > ? ? ? ?if (!do_it_here) > ? ? ? ? ? ?for (i = 0; i < nry*ncy; i++) > ? ? ? ? ? ? ? ?if (ISNAN(y[i])) { do_it_here = TRUE; break; } > > ? ? ? ?if (do_it_here) { > ? ? ? ? ? ?for (i = 0; i < nrx; i++) > ? ? ? ? ? ? ? ?for (k = 0; k < ncy; k++) { > ? ? ? ? ? ? ? ? ? ?sum = 0.0; > ? ? ? ? ? ? ? ? ? ?for (j = 0; j < ncx; j++) > ? ? ? ? ? ? ? ? ? ? ? ?sum += x[i + j * nrx] * y[j + k * nry]; > ? ? ? ? ? ? ? ? ? ?z[i + k * nrx] = sum; > ? ? ? ? ? ? ? ?} > ? ? ? ?} else > ? ? ? ? ? ?F77_CALL(dgemm)(transa, transb, &nrx, &ncy, &ncx, &one, > ? ? ? ? ? ? ? ? ? ? ? ? ? ?x, &nrx, y, &nry, &zero, z, &nrx); > > ? ?} else /* zero-extent operations should return zeroes */ > > ? ? ? ?for(i = 0; i < nrx*ncy; i++) z[i] = 0; > } > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Simon Urbanek
2010-Aug-24 01:53 UTC
[Rd] Internal state indicating if a data object has NAs/no NAs/not sure (Was: Re: Speeding up matrix multiplies)
Henrik, On Aug 23, 2010, at 5:33 PM, Henrik Bengtsson wrote:> Hi, I'm just following your messages the overhead that the code for > dealing with possible NA/NaN values brings. When I was setting up > part of the matrixStats package, I've also though about this. I was > thinking of having an additional logical argument 'hasNA'/'has.na' > where you as a user can specify whether there is NA/NaN:s or not, > allowing the code to avoid it or not. Of course, if hasNA=FALSE but > there really are NAs you'll get very funny results. > > Even better would be if there would be an internal state attached to > the basic data types in R indicating one of three states: > > (i) Has NAs: we know there exist NA/NaN values, > (ii) Has No NAs: we know they do *not* exist, > (iii) Not Sure: we are uncertain, i.e. it may be either (i) or (ii). > > This would allow one to optimize the internal code further, but if > ignored everything is backward compatible. The default state should > be (iii), which would be backward compatible with all existing code. > For instance y <- sum(x, na.rm=FALSE) would return NA immediately if > the state of 'x' is "Has NAs". >Those thoughts are noble [I was thinking along the same lines - simply using a bit in the SEXP], but there is the issue of the API and packages. Since they get unrestricted access to the contents of the vector (e.g. REAL(x)) there is no way of tracking the presence of NAs as they can be added at any time without R knowing. The API has no provision for write-barrier on real/integer/logical vectors so the only "safe" way would be to flag all vectors as "don't know" which sort of defeats the purpose (note that packages can use allocVector and others so simply restricting .Call to mark arguments is not sufficient). The best I can think of would be for some R-internal operations to flag the results accordingly but a) you'll incur the penalty of flagging at the exit of all such R functions since you have to assume that the input will be "don't know" and b) determining from the most common "don't know" input the actual state could be costly. Ideally, someone should try it and see if it brings any benefits, but as usual this is highly dependent on the tasks at hand. As I was already pointing out the issue is in general not clear-cut because good compilers will perform the necessary optimizations and bad ones will penalize for optimizing, but in addition to that all the tests are purely artificial. You would be surprised how much more time you can save in a lot of R code (internal or packages) in real-life scenarios -- unfortunately they rarely match the simple test cases. So my point is that we should be looking at real benchmarks and collect those to have a more realistic picture. (FWIW no one mentioned parallelization where you can gain much more ... Luke Tierney did some great work in that direction that could be extended and gives you speed up of orders of magnitude, not just a few percent ...) Cheers, Simon> Then different computational algorithms are free to set the state, if > they know it. For instance, matrixStats::anyMissing(x) would flag the > "x" object (i) if returning TRUE and (ii) if returning FALSE. Next > time it is called, the response could be instant/O(1). Similarly, y > <- rowSums(x, na.rm=TRUE/FALSE) can set the state on y, and so on. > > Just some thoughts > > /Henrik > > > On Mon, Aug 23, 2010 at 1:50 PM, Radford Neal <radford at cs.toronto.edu> wrote: >> I've looked at the code for matrix multiplies in R, and found that it >> can speeded up quite a bit, for multiplies that are really vector dot >> products, and for other multiplies in which the result matrix is small. >> >> Here's my test program: >> >> u <- seq(0,1,length=1000) >> v <- seq(0,2,length=1000) >> >> A2 <- matrix(2.1,2,1000) >> A5 <- matrix(2.1,5,1000) >> B3 <- matrix(3.2,1000,3) >> >> A4 <- matrix(2.1,4,1000) >> B4 <- matrix(3.2,1000,4) >> >> print(system.time ({for (i in 1:100000) w <- u %*% v})) >> print(system.time ({for (i in 1:100000) w <- A5 %*% v})) >> print(system.time ({for (i in 1:100000) R <- A2 %*% B3})) >> print(system.time ({for (i in 1:100000) R <- A5 %*% B3})) >> print(system.time ({for (i in 1:100000) R <- A4 %*% B4})) >> >> Here are the five system.time results above using R 2.11.1: >> >> user system elapsed >> 1.680 0.000 1.683 >> user system elapsed >> 4.350 0.000 4.349 >> user system elapsed >> 4.820 0.000 4.818 >> user system elapsed >> 7.370 0.020 7.383 >> user system elapsed >> 7.990 0.000 7.991 >> >> and here are the results with my modified version of R 2.11.1: >> >> user system elapsed >> 0.270 0.000 0.271 >> user system elapsed >> 1.290 0.000 1.287 >> user system elapsed >> 1.080 0.000 1.086 >> user system elapsed >> 3.640 0.000 3.642 >> user system elapsed >> 7.950 0.000 7.953 >> >> The main issue here is again ISNAN. The R code doesn't trust the >> Fortran routine it calls to handle NA and NaN correctly, so it checks >> whether the arguments have any NAs or NaNs, and if so does the matrix >> multiply itself, with simple nested loops. For multiplies in which >> the number of elements in the result matrix is large, this check will >> take a negligible amount of time. But otherwise it can drastically >> slow things down. This is true in particular for dot products, of a >> 1xN times an Nx1 matrix, giving a 1x1 result. >> >> I changed the code to handle dot product specially, with no need for >> ISNAN checks, and otherwise to go straight to the nested loop code >> (bypassing Fortran and the ISNAN checks) if the result matrix has no >> more than 15 elements. One can see that the times are much improved >> above, except of course for the multiply that produces a 4x4 result, >> which takes the same amount of time as before, since nothing is >> different for that. From the results, it seems that a threshold a bit >> above 15 might be even better, though this will of course depend on >> the architecture, compiler, etc. >> >> Simon Urbaneck points out in reply to my message about speeding up >> "sum" and "prod" that the effects of changes like this depend on the >> machine architecure, compiler, etc. This is of course true. The >> magnitude of variation in his results (not the absolute times, but the >> relative times of different versions) is rather larger than one would >> expect, though. Perhaps his time measurements were affected by random >> factors like system load, or perhaps they are due to non-random but >> essentially arbitrary factors like the exact alignment of code >> generated by the C compiler, which could change with even unrelated >> modifications to the program. >> >> The changes that I made seem generally desirable for any architecture, >> or will at least make things no worse (the code expansion is trivial), >> apart from the possibility of such arbitrary factors. I'm not doing >> anything like manually unrolling loops, or replacing array indexing >> with pointer arithmetic, which might be seen as undesirable attempts >> at out-smarting the optimizer. >> >> The times above are for an Intel Linux machine with gcc version below: >> >> Using built-in specs. >> Target: i486-linux-gnu >> Configured with: ../src/configure -v --enable-languages=c,c++,fortran,objc,obj-c++,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.2 --program-suffix=-4.2 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-targets=all --enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu --target=i486-linux-gnu >> Thread model: posix >> gcc version 4.2.4 (Ubuntu 4.2.4-1ubuntu4) >> >> I didn't change any R configuration options regarding compilation. >> >> Below is my modified version of the matprod function in src/main/array.c. >> >> Radford Neal >> >> ------------------------------------------------------------------------ >> >> static void matprod(double *x, int nrx, int ncx, >> double *y, int nry, int ncy, double *z) >> { >> char *transa = "N", *transb = "N"; >> int i, j, k; >> double one = 1.0, zero = 0.0; >> LDOUBLE sum; >> Rboolean do_it_here; >> >> /* NOTE: ncx must be equal to nry. */ >> >> if (nrx==1 && ncy==1) { >> >> /* Do dot product quickly. */ >> >> sum = 0.0; >> for (i = 0; i<ncx; i++) >> sum += x[i] * y[i]; >> z[0] = sum; >> >> } else if (nrx > 0 && ncx > 0 && nry > 0 && ncy > 0) { >> >> /* Faster to just do it here if the result is small, especially >> considering the NA/NaN check below. */ >> >> do_it_here = nrx*ncy <= 15; >> >> /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 >> * The test is only O(n) here >> */ >> >> if (!do_it_here) >> for (i = 0; i < nrx*ncx; i++) >> if (ISNAN(x[i])) { do_it_here = TRUE; break; } >> if (!do_it_here) >> for (i = 0; i < nry*ncy; i++) >> if (ISNAN(y[i])) { do_it_here = TRUE; break; } >> >> if (do_it_here) { >> for (i = 0; i < nrx; i++) >> for (k = 0; k < ncy; k++) { >> sum = 0.0; >> for (j = 0; j < ncx; j++) >> sum += x[i + j * nrx] * y[j + k * nry]; >> z[i + k * nrx] = sum; >> } >> } else >> F77_CALL(dgemm)(transa, transb, &nrx, &ncy, &ncx, &one, >> x, &nrx, y, &nry, &zero, z, &nrx); >> >> } else /* zero-extent operations should return zeroes */ >> >> for(i = 0; i < nrx*ncy; i++) z[i] = 0; >> } >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > >