Hi Stefan,
Thank you very much for pointing me to the wordspace package. It does the job a
bit faster than my C code but is 100 times more convenient.
By the way, since the tcrossprod function in the Matrix package is so fast, the
Euclidean distance can be computed very fast:
euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <-
rowSums(m*m);? sqrt(outer(sq,sq,"+") - 2*mtm)}
It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803
columns, while dist.matrix with method="euclidean" takes almost 10
minutes (which is still orders of magnitude faster than dist).
From: Stefan Evert <stefanML at collocations.de>
To: Moshe Olshansky <m_olshansky at yahoo.com>
Cc: R-devel Mailing List <r-devel at r-project.org>
Sent: Sunday, 18 June 2017, 2:33
Subject: Re: [Rd] dist function in R is very slow
> On 17 Jun 2017, at 08:47, Moshe Olshansky via R-devel <r-devel at
r-project.org> wrote:
>
> I am visualising high dimensional genomic data and for this purpose I need
to compute pairwise distances between many points in a high-dimensional space
(say I have a matrix of 5,000 rows and 20,000 columns, so the result is a
5,000x5,000 matrix or it's upper diagonal).Computing such thing in R takes
many hours (I am doing this on a Linux server with more than 100 GB of RAM, so
this is not the problem). When I write the matrix to disk, read it ans compute
the distances in C, write them to the disk and read them into R it takes 10 - 15
minutes (and I did not spend much time on optimising my C code).The question is
why the R function is so slow? I understand that it calls C (or C++) to compute
the distance. My suspicion is that the transposed matrix is passed to C and so
each time a distance between two columns of a matrix is computed, and since C
stores matrices by rows it is very inefficient and causes many cache misses (my
first C implementation was like this and I had to stop the run after an hour
when it failed to complete).
There are two many reasons for the relatively low speed of the built-in dist()
function: (i) it operates on row vectors, which leads to many cache misses
because matrices are stored by column in R (as you guessed); (ii) the function
takes care to handle missing values correctly, which adds a relatively expensive
test and conditional branch to each iteration of the inner loop.
A faster implementation, which omits the NA test and can compute distances
between column vectors, is available as dist.matrix() in the
"wordspace" package.? However, it cannot be used with matrices that
might contain NAs (and doesn't warn about such arguments).
If you want the best possible speed, use cosine similarity (or equivalently,
angular distance).? The underlying cross product is very efficient with a
suitable BLAS implementation.
Best,
Stefan
[[alternative HTML version deleted]]
> By the way, since the tcrossprod function in the Matrix package is so fast, the Euclidean distance can be computed very fast:Indeed.> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <- rowSums(m*m); sqrt(outer(sq,sq,"+") - 2*mtm)}There are two reasons why I didn't use this optimization in "wordspace": 1) It can be inaccurate for small distances between vectors of large Euclidean length because of loss of significance in the subtraction step. This is not just a theoretical concern ? I've seen data sets were this became a real problem. 2) It incurs substantial memory overhead for a large distance matrix. Your code allocates at least five matrices of this size: outer(?), mtm, 2 * mtm, outer(?) - 2*mtm, and the final result obtained by taking the square root. [Actually, there is additional overhead for m*m (an even larger matrix) when computing the Euclidean norms, but this could be avoided with sq <- rowNorms(m, method="euclidean").] I am usually more concerned about RAM than raw processing speed, so the package was designed to keep memory overhead as low as possible and allow users to work with realistic data sets on ordinary laptop computers.> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and 12,803 columns, while dist.matrix with method="euclidean" takes almost 10 minutes (which is still orders of magnitude faster than dist).It's a little disappointing that dist.matrix() is still relatively slow despite all simplifications and better cache consistency (the function automatically transposes the input matrix and computes distances by columns rather than rows). I'm a little surprised about your timing, though. Testing with a random 5000 x 20000 matrix, my MacBook computers the full Euclidean distance matrix in about 5 minutes. If your machine (and version of R) supports OpenMP, you can improve performance by allowing multithreading with wordspace.openmp(threads=<n>). In my test case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m). Best wishes, Stefan
Hi Stefan,
You are right about the possible loss of accuracy computing the Euclidean
distance the way I did. In some cases you probably even can get a negative value
to compute a square root (so I am making all negative numbers 0). To do what I
did one must know that it is all right in their case.I tried wordspace.openmp
wuth 8 threads and it reduces the time to just over 2.5 minutes. This is more
than enough for me.I am not sure whether you have any chance to beat the speed
of (t)crossprod since they may be using a (complexity-wise) faster algorithm for
matrix multiplication (may be with FFT - I am not sure).
Once again, thank you very much for your comments and help.
From: Stefan Evert <stefanML at collocations.de>
To: Moshe Olshansky <m_olshansky at yahoo.com>
Cc: R-devel Mailing List <r-devel at r-project.org>
Sent: Monday, 19 June 2017, 2:23
Subject: Re: [Rd] dist function in R is very slow
> By the way, since the tcrossprod function in the Matrix package is so fast,
the Euclidean distance can be computed very fast:
Indeed.
> euc_dist <- function(m) {mtm <- Matrix::tcrossprod(m); sq <-
rowSums(m*m);? sqrt(outer(sq,sq,"+") - 2*mtm)}
There are two reasons why I didn't use this optimization in
"wordspace":
1) It can be inaccurate for small distances between vectors of large Euclidean
length because of loss of significance in the subtraction step.? This is not
just a theoretical concern ? I've seen data sets were this became a real
problem.
2) It incurs substantial memory overhead for a large distance matrix. Your code
allocates at least five matrices of this size: outer(?), mtm, 2 * mtm, outer(?)
- 2*mtm, and the final result obtained by taking the square root.? [Actually,
there is additional overhead for m*m (an even larger matrix) when computing the
Euclidean norms, but this could be avoided with sq <- rowNorms(m,
method="euclidean").]
I am usually more concerned about RAM than raw processing speed, so the package
was designed to keep memory overhead as low as possible and allow users to work
with realistic data sets on ordinary laptop computers.
> It takes less than 50 seconds for my (dense) matrix of 5,054 rows and
12,803 columns, while dist.matrix with method="euclidean" takes almost
10 minutes (which is still orders of magnitude faster than dist).
It's a little disappointing that dist.matrix() is still relatively slow
despite all simplifications and better cache consistency (the function
automatically transposes the input matrix and computes distances by columns
rather than rows).? I'm a little surprised about your timing, though.?
Testing with a random 5000 x 20000 matrix, my MacBook computers the full
Euclidean distance matrix in about 5 minutes.?
If your machine (and version of R) supports OpenMP, you can improve performance
by allowing multithreading with wordspace.openmp(threads=<n>).? In my test
case, I get a 2.2x speed-up with 4 threads (2m 15s instead of 5m).
Best wishes,
Stefan
[[alternative HTML version deleted]]