thr3ads.net - R help - [R] Efficient distance calculation on big matrix [Jun 2012]

If this information is useful, please help other people find it:
Share via:

Boel Brynedal

2012-Jun-16 17:04 UTC

[R] Efficient distance calculation on big matrix

Hi All,

I'm working on analyzing a large data set, lets asume that
dim(Data)=c(1000,8700). I want to calculate the canberra distance
between the columns of this matrix, and using a toy example ('test' is
a matrix filled with random numbers 0-1):
> system.time(d<-as.matrix(dist(t(test), method = "canberra",
diag = FALSE, upper = FALSE, p = 2)))    user   system  elapsed
1417.713    3.219 1421.144

Is there any way to calculate the distance which would take less time?
I am already parallelizing this to a great deal (the real data has
many more rows), but I cant go below 1000 rows in order to get
reliable results. And I will calculate the distances repeatedly (about
100 times if 1000 rows) while removing small parts of the matrix.

The system.time results also confuse me a bit, since 99% of the time
is not system time but user time. What does that mean?

I'm on a Linux server and should have about 48GB RAM here.

Any suggestions appreciated,

Bo

> sessionInfo()R version 2.12.1 (2010-12-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.iso885915       LC_NUMERIC=C
 [3] LC_TIME=en_US.iso885915        LC_COLLATE=en_US.iso885915
 [5] LC_MONETARY=C                  LC_MESSAGES=en_US.iso885915
 [7] LC_PAPER=en_US.iso885915       LC_NAME=C
 [9] LC_ADDRESS=C                   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hash_2.1.0 mmap_0.6-9

loaded via a namespace (and not attached):
[1] tools_2.12.1

$ uname -a
Linux compute-13-2.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36
EST 2009 x86_64 x86_64 x86_64 GNU/Linux

Stefan Evert

2012-Jun-17 17:43 UTC

head link

[R] Efficient distance calculation on big matrix

> I'm working on analyzing a large data set, lets asume that
> dim(Data)=c(1000,8700). I want to calculate the canberra distance
> between the columns of this matrix, and using a toy example ('test'
is
> a matrix filled with random numbers 0-1):
> 
>> system.time(d<-as.matrix(dist(t(test), method =
"canberra", diag = FALSE, upper = FALSE, p = 2)))
>    user   system  elapsed
> 1417.713    3.219 1421.144
> The system.time results also confuse me a bit, since 99% of the time
> is not system time but user time. What does that mean?
User time is the time that R spends working on your problem; system time refers
to tasks done by the operating system, e.g. disk access, managing locks and,
most importantly, swapping when you run out of RAM.  With multi-threading,
system time can be much larger than the time that has actually elapsed.
> 
> Is there any way to calculate the distance which would take less time?
Well, one thing you can do is to get a faster computer. :-)  The command above
takes only 670 seconds on my MacBook Pro (without multi-threading).

Calculating a distance matrix is an expensive computation.  In your example, R
has to carry out (8700 * 8700 * 1000) / 2 = 37.8 billion floating point
divisions.  With approx. 27 clock cycles per division (according to tables
I've found on the Web), this takes at least 340 seconds even on a 3GHz CPU
(and ignoring memory access, addition/subtraction, loops, etc.).

You can shave off some of the time if you implement the distance calculation in
C, inline the code to avoid callback functions in loops, operate on columns of
the matrix directly (which should be more cache-friendly than rows) and
don't check for NA's, NaN's and other degenerate cases.

I've done just that in my experimental R package "wordspace",
which isn't on CRAN yet:
> library(wordspace)
> A <- matrix(runif(8.7e6), 1000, 8700)
> system.time(d1 <- as.matrix(dist(t(A), method="canberra")))   user  system elapsed 
669.207   2.724 669.305 
> system.time(d2 <- dist.matrix(A, method="canberra",
byrow=FALSE))   user  system elapsed 
250.534   0.784 250.301 
> all(d1 == d2)[1] TRUE

If you aren't tied to Canberra distance, you can use a less expensive metric
such as the Manhattan distance for an additional, more substantial speed boost:
> system.time(d3 <- dist.matrix(A, method="manhattan",
byrow=FALSE))   user  system elapsed 
 42.488   0.999  43.569 

This is still single-threaded, so you can run multiple of these calculations in
parallel depending on how many cores your server has.

Hope this helps,
Stefan


PS: In case you'd like to give it a try yourself and aren't daunted by a
complete lack of documentation:

	svn checkout svn://scm.r-forge.r-project.org/svnroot/wordspace/pkg


[ evert at linglit.tu-darmstadt.de | http://purl.org/stefan.evert ]

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jun 2012 - Efficient distance calculation on big matrix

[R] Efficient distance calculation on big matrix

[R] Efficient distance calculation on big matrix

Apparently Analagous Threads