I am using R to multiply some large (30k x 30k double) matrices on a 64 core machine (xeon phi). I added some timers to src/main/array.c to see where the time is going. All of the time is being spent in the matprod function, most of that time is spent in dgemm. 15 seconds is in matprod in some code that is checking if there are NaNs.> system.time (C <- B %*% A)nancheck: wall time 15.240282s ? dgemm: wall time 43.111064s matprod: wall time 58.351572s ??? user?? system? elapsed 2710.154?? 20.999?? 58.398 The NaN checking code is not being vectorized because of the early exit when NaN is detected: /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 * The test is only O(n) here. */ for (R_xlen_t i = 0; i < NRX*ncx; i++) if (ISNAN(x[i])) {have_na = TRUE; break;} if (!have_na) for (R_xlen_t i = 0; i < NRY*ncy; i++) if (ISNAN(y[i])) {have_na = TRUE; break;} I tried deleting the 'break'. By inspecting the asm code, I verified that the loop was not being vectorized before, but now is vectorized. Total time goes down: system.time (C <- B %*% A) nancheck: wall time 1.898667s ? dgemm: wall time 43.913621s matprod: wall time 45.812468s ?? user?? system? elapsed 2727.877?? 20.723?? 45.859 The break accelerates the case when there is a NaN, at the expense of the much more common case when there isn't a NaN. If a NaN is detected, it doesn't call dgemm and calls its own matrix multiply, which makes the NaN check time insignificant so I doubt the early exit provides any benefit. I was a little surprised that the O(n) NaN check is costly compared to the O(n**2) dgemm that follows. I think the reason is that nan check is single thread and not vectorized, and my machine can do 2048 floating point ops/cycle when you consider the cores/dual issue/8 way SIMD/muladd, and the constant factor will be significant for even large matrices. Would you consider deleting the breaks? I can submit a patch if that will help. Thanks. Robert
>>>>> Cohn, Robert S <robert.s.cohn at intel.com> >>>>> on Sat, 7 Jan 2017 16:41:42 +0000 writes:> I am using R to multiply some large (30k x 30k double) > matrices on a 64 core machine (xeon phi). I added some timers > to src/main/array.c to see where the time is going. All of the > time is being spent in the matprod function, most of that time > is spent in dgemm. 15 seconds is in matprod in some code that > is checking if there are NaNs.> > system.time (C <- B %*% A) > nancheck: wall time 15.240282s > dgemm: wall time 43.111064s > matprod: wall time 58.351572s > user system elapsed > 2710.154 20.999 58.398 > > The NaN checking code is not being vectorized because of the > early exit when NaN is detected: > > /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 > * The test is only O(n) here. > */ > for (R_xlen_t i = 0; i < NRX*ncx; i++) > if (ISNAN(x[i])) {have_na = TRUE; break;} > if (!have_na) > for (R_xlen_t i = 0; i < NRY*ncy; i++) > if (ISNAN(y[i])) {have_na = TRUE; break;} > > I tried deleting the 'break'. By inspecting the asm code, I > verified that the loop was not being vectorized before, but > now is vectorized. Total time goes down: > > system.time (C <- B %*% A) > nancheck: wall time 1.898667s > dgemm: wall time 43.913621s > matprod: wall time 45.812468s > user system elapsed > 2727.877 20.723 45.859 > > The break accelerates the case when there is a NaN, at the > expense of the much more common case when there isn't a > NaN. If a NaN is detected, it doesn't call dgemm and calls its > own matrix multiply, which makes the NaN check time > insignificant so I doubt the early exit provides any benefit. > > I was a little surprised that the O(n) NaN check is costly > compared to the O(n**2) dgemm that follows. I think the reason > is that nan check is single thread and not vectorized, and my > machine can do 2048 floating point ops/cycle when you consider > the cores/dual issue/8 way SIMD/muladd, and the constant > factor will be significant for even large matrices. > > Would you consider deleting the breaks? I can submit a patch > if that will help. Thanks. > > RobertThank you Robert for bringing the issue up ("again", possibly). Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang. As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used. Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning? So we get to some reproducible benchmarks? With best regards, Martin Maechler
> Do you have R code (including set.seed(.) if relevant) to show on how to generate > the large square matrices you've mentioned in the beginning? So we get to some > reproducible benchmarks?Hi Martin, Here is the program I used. I only generate 2 random numbers and reuse them to make the benchmark run faster. Let me know if there is something I can do to help--alternate benchmarks, tests, experiments with compilers other than icc. MKL LAPACK behavior is undefined for NaN's so I left the check in, just made it more efficient on a CPU with SIMD. Thanks for looking at this. set.seed (1) m <- 30000 n <- 30000 A <- matrix (runif(2),nrow=m,ncol=n) B <- matrix (runif(2),nrow=m,ncol=n) print(typeof(A[1,2])) print(A[1,2]) # Matrix multiply system.time (C <- B %*% A) system.time (C <- B %*% A) system.time (C <- B %*% A) -----Original Message----- From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] Sent: Tuesday, January 10, 2017 8:59 AM To: Cohn, Robert S <robert.s.cohn at intel.com> Cc: r-devel at r-project.org Subject: Re: [Rd] accelerating matrix multiply>>>>> Cohn, Robert S <robert.s.cohn at intel.com> >>>>> on Sat, 7 Jan 2017 16:41:42 +0000 writes:> I am using R to multiply some large (30k x 30k double) matrices on a > 64 core machine (xeon phi). I added some timers to src/main/array.c > to see where the time is going. All of the time is being spent in the > matprod function, most of that time is spent in dgemm. 15 seconds is > in matprod in some code that is checking if there are NaNs.> > system.time (C <- B %*% A) > nancheck: wall time 15.240282s > dgemm: wall time 43.111064s > matprod: wall time 58.351572s > user system elapsed > 2710.154 20.999 58.398 > > The NaN checking code is not being vectorized because of the early > exit when NaN is detected: > > /* Don't trust the BLAS to handle NA/NaNs correctly: PR#4582 > * The test is only O(n) here. > */ > for (R_xlen_t i = 0; i < NRX*ncx; i++) > if (ISNAN(x[i])) {have_na = TRUE; break;} > if (!have_na) > for (R_xlen_t i = 0; i < NRY*ncy; i++) > if (ISNAN(y[i])) {have_na = TRUE; break;} > > I tried deleting the 'break'. By inspecting the asm code, I verified > that the loop was not being vectorized before, but now is vectorized. > Total time goes down: > > system.time (C <- B %*% A) > nancheck: wall time 1.898667s > dgemm: wall time 43.913621s > matprod: wall time 45.812468s > user system elapsed > 2727.877 20.723 45.859 > > The break accelerates the case when there is a NaN, at the expense of > the much more common case when there isn't a NaN. If a NaN is > detected, it doesn't call dgemm and calls its own matrix multiply, > which makes the NaN check time insignificant so I doubt the early exit > provides any benefit. > > I was a little surprised that the O(n) NaN check is costly compared to > the O(n**2) dgemm that follows. I think the reason is that nan check > is single thread and not vectorized, and my machine can do 2048 > floating point ops/cycle when you consider the cores/dual issue/8 way > SIMD/muladd, and the constant factor will be significant for even > large matrices. > > Would you consider deleting the breaks? I can submit a patch if that > will help. Thanks. > > RobertThank you Robert for bringing the issue up ("again", possibly). Within R core, some have seen somewhat similar timing on some platforms (gcc) .. but much less dramatical differences e.g. on macOS with clang. As seen in the source code you cite above, the current implementation was triggered by a nasty BLAS bug .. actually also showing up only on some platforms, possibly depending on runtime libraries in addition to the compilers used. Do you have R code (including set.seed(.) if relevant) to show on how to generate the large square matrices you've mentioned in the beginning? So we get to some reproducible benchmarks? With best regards, Martin Maechler