thr3ads.net - R devel - [Rd] svd() of a 30000 by 30000 matrix segfaults: 32-bit length overflow in LAPACK? [Nov 2023]

If this information is useful, please help other people find it:
Share via:
Ivan Krylov
2023-Nov-03 14:54 UTC
[Rd] svd() of a 30000 by 30000 matrix segfaults: 32-bit length overflow in LAPACK?

Dear Dr. Robert M Flight, dear R-developers,

By an accident, I've noticed this problem reported on Mastodon [1].

On a computer with ?32G of RAM, running the following code (which may
take 5 CPU-hours and allocate 27G of RAM!) results in a segfault:

n_val <- 30000
tmp_matrix <- matrix(rnorm((n_val ^ 2) / 2), nrow = n_val, ncol = n_val)
svd_res <- svd(tmp_matrix)

Tim Taylor says [2] that this doesn't fail for a 20000 by 20000 matrix,
which suggests a 32-bit integer overflow: 30000*30000*8 is 7.2e9, which
is more than 4.3e9, which is more than 2^32, while 20000*20000*8 is
3.2e9, slightly below the threshold.

Here's the backtrace from the crash:

#0  0x00007ffff462ac71 in dlasd3 (nl=468, nr=467, sqre=1, k=921, d=..., q=...,
ldq=921, dsigma=..., u=...,
    ldu=30000, u2=..., ldu2=936, vt=..., ldvt=30000, vt2=..., ldvt2=937,
idxc=..., ctot=..., z=..., info=0)
    at ../../../../src/modules/lapack/dlapack.f:82312
#1  0x00007ffff464578b in dlasd1 (nl=468, nr=467, sqre=1, d=...,
alpha=-0.58048560936028271,
    beta=-0.40955467846435345, u=..., ldu=30000, vt=..., ldvt=30000, idxq=...,
iwork=..., work=..., info=0)
    at ../../../../src/modules/lapack/dlapack.f:81271
#2  0x00007ffff4681bf2 in dlasd0 (n=29964, sqre=0, d=..., e=..., u=...,
ldu=30000, vt=..., ldvt=30000, smlsiz=25,
    iwork=..., work=..., info=0) at
../../../../src/modules/lapack/dlapack.f:80956
#3  0x00007ffff468fd5e in dbdsdc (uplo=..., compq=..., n=30000, d=..., e=...,
u=..., ldu=30000, vt=...,
    ldvt=30000, q=..., iq=..., work=..., iwork=..., info=0, _uplo=1, _compq=1)
    at ../../../../src/modules/lapack/dlapack.f:1525
#4  0x00007ffff46c1f81 in dgesdd (jobz=..., m=30000, n=30000, a=..., lda=30000,
s=..., u=..., ldu=30000, vt=...,
    ldvt=30000, work=..., lwork=2010000, iwork=..., info=0, _jobz=1)
    at ../../../../src/modules/lapack/dlapack.f:23312
#5  0x00007ffff4c5550f in La_svd (jobu=<optimized out>, x=<optimized
out>, s=0x555557b88670, u=0x7ffbc379f010,
    vt=0x7ffa1652a010) at ../../../../src/include/Rinlinedfuns.h:120

The place of the crash and the state of the registers suggests some
kind of buffer overrun:

(gdb) frame 0
#0  0x00007ffff462ac71 in dlasd3 (nl=468, nr=467, sqre=1, k=921, d=..., q=...,
ldq=921, dsigma=..., u=...,
    ldu=30000, u2=..., ldu2=936, vt=..., ldvt=30000, vt2=..., ldvt2=937,
idxc=..., ctot=..., z=..., info=0)
    at ../../../../src/modules/lapack/dlapack.f:82312
82312               Q( J, I ) = U( JC, I ) / TEMP
(gdb) disas
...
   0x00007ffff462ac60 <+1824>:  movslq -0x4(%r10,%rdx,4),%rsi
   0x00007ffff462ac65 <+1829>:  add    %r12,%rsi
   0x00007ffff462ac68 <+1832>:  movsd  (%rcx,%rsi,8),%xmm1
   0x00007ffff462ac6d <+1837>:  divsd  %xmm0,%xmm1
=> 0x00007ffff462ac71 <+1841>:  movsd  %xmm1,(%rbx,%rdx,8)
...
(gdb) info reg
rbx            0x7ffff45c38d8      140737293072600
rcx            0x7ffbc379f040      140719293067328
rdx            0xe5                229
rsi            0x517d2c            5340460

0x7ffff45c38d8 looks precariously close to the end of the valid
userspace address range for processes typically running on my computer.
I was also able to reproduce this with OpenBLAS (0.3.5, LAPACK 3.8.0).

We seem to be overflowing the WORK array. The DGESDD documentation says
that for JOBZ = 'S', WORK must be of length >= 4*mn*mn + 7*mn, where
mn = min(M, N). For M = N = 30000, this comes out to 3600210000, while
the actual length of the WORK array is only 2010000 (as returned from a
previous call to DGESDD with LWORK = -1).

I think this is caused by an overflow in the value of the BDSPAC
variable. At the point where MINWRK is assigned (MAX(MINWRK,MAXWRK) to
become LWORK later), the following can be observed:

22721                           MINWRK = 3*N + MAX( M, BDSPAC )
0x7ffff7e9a2e0 <dgesdd_+7632>   cmp    %ecx,%esi
(gdb) p M
$48 = 30000
(gdb) p BDSPAC
$49 = <optimized out>
(gdb) p $ecx
$50 = -1594847296
(gdb) p $esi
$51 = 30000

The debugger insists that the BDSPAC variable is optimised out for most
of the duration of its existence, but it can be proven that it
overflows:

22613                  BDSPAC = 3*N*N + 4*N
(gdb) ptype BDSPAC
type = integer(kind=4)
(gdb) p BDSPAC
$60 = <optimized out>
(gdb) p (int32_t)(3*M*M + 4*M)
$61 = -1594847296

Is there anything R can do at this point, given the required length of
LWORK being impossible to represent using a signed 32-bit integer?

-- 
Best regards,
Ivan

[1] https://mastodon.social/@rmflight/111341279750995911

[2] https://mastodon.social/@_TimTaylor at fosstodon.org/111341425112070441
R devel - Nov 2023 - svd() of a 30000 by 30000 matrix segfaults: 32-bit length overflow in LAPACK?

[Rd] svd() of a 30000 by 30000 matrix segfaults: 32-bit length overflow in LAPACK?