Dear all,
Given vectors "x" and "y", I would like to compute the
proportion of
entries that are equal, that is, mean(x == y).
Now, suppose I have the following matrix:
n <- 1e2
m <- 1e4
X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m)
I am interested in calculating the above proportion for every pairwise
combination of rows. I came up with the following:
myd <- function(X, p = NROW(X)){
D <- matrix(NA, p, p)
for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] == X[j,])
D
}
system.time(d <- myd(X))
However, in my application n and m are much more larger than in this
example and the computational time might be an issue. I would very much
appreciate any suggestions on how to speed the "myd" function.
Note: I have done some experiments with the dist() function and despite
being much, much, much faster than "myd", none of the default
distances
fits my needs. I would also appreciate any suggestions on how to include
"my own" distance function in dist().
Thank you very much for your time.
Best regards,
Jorge Velez.-
[[alternative HTML version deleted]]
Jorge, I have not used it myself, but you might find the dist() function in the proxy package to be useful. http://cran.r-project.org/web/packages/proxy/index.html Jean On Mon, Jan 19, 2015 at 7:38 AM, Jorge I Velez <jorgeivanvelez at gmail.com> wrote:> Dear all, > > Given vectors "x" and "y", I would like to compute the proportion of > entries that are equal, that is, mean(x == y). > > Now, suppose I have the following matrix: > > n <- 1e2 > m <- 1e4 > X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m) > > I am interested in calculating the above proportion for every pairwise > combination of rows. I came up with the following: > > myd <- function(X, p = NROW(X)){ > D <- matrix(NA, p, p) > for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] == X[j,]) > D > } > > system.time(d <- myd(X)) > > However, in my application n and m are much more larger than in this > example and the computational time might be an issue. I would very much > appreciate any suggestions on how to speed the "myd" function. > > Note: I have done some experiments with the dist() function and despite > being much, much, much faster than "myd", none of the default distances > fits my needs. I would also appreciate any suggestions on how to include > "my own" distance function in dist(). > > Thank you very much for your time. > > Best regards, > Jorge Velez.- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
... (just a comment) and since this appears to be O(m^2 x n), where m,n are the number of rows and columns (correction requested if I got this wrong), it would appear that some basically C level functionality -- perhaps the one Jean suggested? -- would be required for even moderately "large" matrices. Cheers, Bert Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374 "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." Clifford Stoll On Tue, Jan 20, 2015 at 10:10 AM, Adams, Jean <jvadams at usgs.gov> wrote:> Jorge, > > I have not used it myself, but you might find the dist() function in the > proxy package to be useful. > > http://cran.r-project.org/web/packages/proxy/index.html > > Jean > > On Mon, Jan 19, 2015 at 7:38 AM, Jorge I Velez <jorgeivanvelez at gmail.com> > wrote: > >> Dear all, >> >> Given vectors "x" and "y", I would like to compute the proportion of >> entries that are equal, that is, mean(x == y). >> >> Now, suppose I have the following matrix: >> >> n <- 1e2 >> m <- 1e4 >> X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m) >> >> I am interested in calculating the above proportion for every pairwise >> combination of rows. I came up with the following: >> >> myd <- function(X, p = NROW(X)){ >> D <- matrix(NA, p, p) >> for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] == X[j,]) >> D >> } >> >> system.time(d <- myd(X)) >> >> However, in my application n and m are much more larger than in this >> example and the computational time might be an issue. I would very much >> appreciate any suggestions on how to speed the "myd" function. >> >> Note: I have done some experiments with the dist() function and despite >> being much, much, much faster than "myd", none of the default distances >> fits my needs. I would also appreciate any suggestions on how to include >> "my own" distance function in dist(). >> >> Thank you very much for your time. >> >> Best regards, >> Jorge Velez.- >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Mon, Jan 19, 2015 at 5:38 AM, Jorge I Velez <jorgeivanvelez at gmail.com> wrote:> Dear all, > > Given vectors "x" and "y", I would like to compute the proportion of > entries that are equal, that is, mean(x == y). > > Now, suppose I have the following matrix: > > n <- 1e2 > m <- 1e4 > X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m) > > I am interested in calculating the above proportion for every pairwise > combination of rows. I came up with the following: > > myd <- function(X, p = NROW(X)){ > D <- matrix(NA, p, p) > for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] == X[j,]) > D > } > > system.time(d <- myd(X))An obvious speed up is to only subset X[i,] onces and not j times. Also, mean() is a generic function meaning it dispatches on class in each call, which has some overhead; it's a bit faster to use sum(). Also, beware of the classical matrix(NA, ...) mistake, which does *not* allocate a numeric matrix and will just results in an extra copy and coercion, cf. http://www.jottr.org/2014/06/matrixNA-wrong-way.html. myd2 <- function(X, p = NROW(X)) { D <- matrix(NA_real_, nrow=p, ncol=p) for (i in 2:p) { Xi <- X[i, ] for (j in 1:(i-1)) D[i, j] <- sum(Xi == X[j,]) } D / ncol(X) } That's > 1.5 times faster. But as others already mentioned, this is something you'll do best in C/C++, because you can avoid lots of overhead from subsetting/copying and garbage collection. /Henrik> > However, in my application n and m are much more larger than in this > example and the computational time might be an issue. I would very much > appreciate any suggestions on how to speed the "myd" function. > > Note: I have done some experiments with the dist() function and despite > being much, much, much faster than "myd", none of the default distances > fits my needs. I would also appreciate any suggestions on how to include > "my own" distance function in dist(). > > Thank you very much for your time. > > Best regards, > Jorge Velez.- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.