thr3ads.net - R help - [R] Proportion of equal entries in dist()? [Jan 2015]

If this information is useful, please help other people find it:
Share via:

Jorge I Velez

2015-Jan-19 13:38 UTC

[R] Proportion of equal entries in dist()?

Dear all,

Given vectors "x" and "y", I would like to compute the
proportion of
entries that are equal, that is, mean(x == y).

Now, suppose I have the following matrix:

n <- 1e2
m <- 1e4
X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m)

I am interested in calculating the above proportion for every pairwise
combination of rows.  I came up with the following:

myd <- function(X, p = NROW(X)){
D <- matrix(NA, p, p)
for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] == X[j,])
D
}

system.time(d <- myd(X))

However, in my application n and m are much more larger than in this
example and the computational time might be an issue.  I would very much
appreciate any suggestions on how to speed the "myd" function.

Note:  I have done some experiments with the dist() function and despite
being much, much, much faster than "myd", none of the default
distances
fits my needs.  I would also appreciate any suggestions on how to include
"my own" distance function in dist().

Thank you very much for your time.

Best regards,
Jorge Velez.-

	[[alternative HTML version deleted]]

Adams, Jean

2015-Jan-20 18:10 UTC

head link

[R] Proportion of equal entries in dist()?

Jorge,

I have not used it myself, but you might find the dist() function in the
proxy package to be useful.

http://cran.r-project.org/web/packages/proxy/index.html

Jean

On Mon, Jan 19, 2015 at 7:38 AM, Jorge I Velez <jorgeivanvelez at
gmail.com>
wrote:
> Dear all,
>
> Given vectors "x" and "y", I would like to compute the
proportion of
> entries that are equal, that is, mean(x == y).
>
> Now, suppose I have the following matrix:
>
> n <- 1e2
> m <- 1e4
> X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m)
>
> I am interested in calculating the above proportion for every pairwise
> combination of rows.  I came up with the following:
>
> myd <- function(X, p = NROW(X)){
> D <- matrix(NA, p, p)
> for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] ==
X[j,])
> D
> }
>
> system.time(d <- myd(X))
>
> However, in my application n and m are much more larger than in this
> example and the computational time might be an issue.  I would very much
> appreciate any suggestions on how to speed the "myd" function.
>
> Note:  I have done some experiments with the dist() function and despite
> being much, much, much faster than "myd", none of the default
distances
> fits my needs.  I would also appreciate any suggestions on how to include
> "my own" distance function in dist().
>
> Thank you very much for your time.
>
> Best regards,
> Jorge Velez.-
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Bert Gunter

2015-Jan-20 18:57 UTC

head link

[R] Proportion of equal entries in dist()?

...

(just a comment)

and since this appears to be O(m^2 x n), where m,n are the number of
rows and columns (correction requested if I got this wrong), it would
appear that some basically C level functionality -- perhaps the one
Jean suggested? -- would be required for even moderately "large"
matrices.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Tue, Jan 20, 2015 at 10:10 AM, Adams, Jean <jvadams at usgs.gov>
wrote:> Jorge,
>
> I have not used it myself, but you might find the dist() function in the
> proxy package to be useful.
>
> http://cran.r-project.org/web/packages/proxy/index.html
>
> Jean
>
> On Mon, Jan 19, 2015 at 7:38 AM, Jorge I Velez <jorgeivanvelez at
gmail.com>
> wrote:
>
>> Dear all,
>>
>> Given vectors "x" and "y", I would like to compute
the proportion of
>> entries that are equal, that is, mean(x == y).
>>
>> Now, suppose I have the following matrix:
>>
>> n <- 1e2
>> m <- 1e4
>> X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m)
>>
>> I am interested in calculating the above proportion for every pairwise
>> combination of rows.  I came up with the following:
>>
>> myd <- function(X, p = NROW(X)){
>> D <- matrix(NA, p, p)
>> for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] ==
X[j,])
>> D
>> }
>>
>> system.time(d <- myd(X))
>>
>> However, in my application n and m are much more larger than in this
>> example and the computational time might be an issue.  I would very
much
>> appreciate any suggestions on how to speed the "myd"
function.
>>
>> Note:  I have done some experiments with the dist() function and
despite
>> being much, much, much faster than "myd", none of the default
distances
>> fits my needs.  I would also appreciate any suggestions on how to
include
>> "my own" distance function in dist().
>>
>> Thank you very much for your time.
>>
>> Best regards,
>> Jorge Velez.-
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Henrik Bengtsson

2015-Jan-21 04:18 UTC

head link

[R] Proportion of equal entries in dist()?

On Mon, Jan 19, 2015 at 5:38 AM, Jorge I Velez <jorgeivanvelez at
gmail.com> wrote:> Dear all,
>
> Given vectors "x" and "y", I would like to compute the
proportion of
> entries that are equal, that is, mean(x == y).
>
> Now, suppose I have the following matrix:
>
> n <- 1e2
> m <- 1e4
> X <- matrix(sample(0:2, m*n, replace = TRUE), ncol = m)
>
> I am interested in calculating the above proportion for every pairwise
> combination of rows.  I came up with the following:
>
> myd <- function(X, p = NROW(X)){
> D <- matrix(NA, p, p)
> for(i in 1:p) for(j in 1:p) if(i > j) D[i, j] <- mean(X[i, ] ==
X[j,])
> D
> }
>
> system.time(d <- myd(X))
An obvious speed up is to only subset X[i,] onces and not j times.
Also, mean() is a generic function meaning it dispatches on class in
each call, which has some overhead; it's a bit faster to use sum().
Also, beware of the classical matrix(NA, ...) mistake, which does
*not* allocate a numeric matrix and will just results in an extra copy
and coercion, cf.
http://www.jottr.org/2014/06/matrixNA-wrong-way.html.

myd2 <- function(X, p = NROW(X)) {
  D <- matrix(NA_real_, nrow=p, ncol=p)
  for (i in 2:p) {
    Xi <- X[i, ]
    for (j in 1:(i-1)) D[i, j] <- sum(Xi == X[j,])
  }
  D / ncol(X)
}

That's > 1.5 times faster.  But as others already mentioned, this is
something you'll do best in C/C++, because you can avoid lots of
overhead from subsetting/copying and garbage collection.

/Henrik
>
> However, in my application n and m are much more larger than in this
> example and the computational time might be an issue.  I would very much
> appreciate any suggestions on how to speed the "myd" function.
>
> Note:  I have done some experiments with the dist() function and despite
> being much, much, much faster than "myd", none of the default
distances
> fits my needs.  I would also appreciate any suggestions on how to include
> "my own" distance function in dist().
>
> Thank you very much for your time.
>
> Best regards,
> Jorge Velez.-
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

R help - Jan 2015 - Proportion of equal entries in dist()?

[R] Proportion of equal entries in dist()?

[R] Proportion of equal entries in dist()?

[R] Proportion of equal entries in dist()?

[R] Proportion of equal entries in dist()?