thr3ads.net - R help - [R] Calculate Closest 5 Cases? [Feb 2004]

If this information is useful, please help other people find it:
Share via:

dsheuman@rogers.com

2004-Feb-13 16:34 UTC

[R] Calculate Closest 5 Cases?

I've only begun investigating R as a substitute for SPSS.

I have a need to identify for each CASE the closest (or most similar) 5 
other CASES (not including itself as it is automatically the closest).  I 
have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use
Correlate > Distances to generate a matrix of similarity, but only on a small
sample.  The entire matrix can not be processed at once due to memory
limitations.

The data are all percents, so they are easy comparable.  

Is there any way to do this in R?

Below is a small sample of the data (from SPSS) and the desired output.

Thanks,

Danny




*Sample Data.
DATA LIST LIST /id(F8) var1(F8.2) var2(F8.2) var3(F8.2) var4(F8.2) var5
(F8.2) var6(F8.2) var7(F8.2) var8(F8.2) var9(F8.2) var10(F8.2) var11(F8.2).
BEGIN DATA.
10170069	3.51	4.02	6.53	11.05	6.53	8.04	13.57	20.10	11.05	8.55
	7.04
10190229	1.89	5.66	4.61	7.62	8.45	13.21	9.50	20.82	16.07	9.36
	3.77
10540023	3.40	5.08	3.39	4.52	10.18	14.71	13.56	16.38	9.60	7.89
	11.85
10650413	6.64	6.64	3.73	4.70	3.78	13.23	19.82	15.98	12.26	8.48
	3.78
10662074	5.11	5.81	4.37	5.11	6.55	14.60	18.97	11.68	10.25	8.75
	8.79
10770041	6.43	4.17	6.34	4.26	6.34	4.26	19.11	19.20	14.95	12.77
	4.35
11010422	3.14	4.71	6.81	7.85	5.75	6.81	15.18	15.18	13.61	11.00
	9.44
11060762	7.03	5.03	6.95	5.99	5.92	12.94	15.01	12.06	11.98	8.06
	9.02
11070078	4.61	9.22	4.61	7.94	6.27	12.75	14.02	20.49	7.75	7.75
	4.61
11180646	4.48	5.35	6.29	5.42	4.55	11.71	20.74	15.32	14.45	8.09
	3.61
11460001	5.71	7.34	6.48	5.68	4.07	10.55	13.83	18.69	12.15	9.76
	4.87
11650133	6.00	3.72	6.72	6.00	7.50	17.94	13.44	16.37	13.51	5.15
	3.65
11650275	4.02	8.06	6.06	8.10	5.06	8.10	17.16	14.12	12.14	14.12
	4.02
11780034	4.25	4.28	5.30	5.33	6.38	14.88	15.96	18.08	14.85	7.48
	3.20
11790016	4.40	4.40	5.54	4.40	4.40	10.93	17.67	19.72	13.20	12.13
	4.33
12660338	6.60	7.54	5.66	8.49	10.38	11.31	16.06	12.26	8.49	8.49
	4.73
12660644	5.51	3.14	3.95	7.09	7.11	14.98	15.72	18.90	9.44	5.50
	8.65
12661667	5.44	4.50	5.44	4.50	5.44	12.69	13.63	11.81	9.07	13.68
	13.79
END DATA.

*Output should be:.
*.
*	ID1	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
*	ID2	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
*	ID3	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
*	ID4	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
*	ID5	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.

Tom Blackwell

2004-Feb-13 18:40 UTC

head link

[R] Calculate Closest 5 Cases?

Danny  -

The flip answer is, it depends on the size of your computer.
One can readily calculate the number of entries in the pairwise
distance matrix that you would like to calculate, and ask whether
it will fit inside the physical memory installed in your computer.
It is  50,000 x 50,000 x 8 bytes per floating point number, for
a total of 20,000,000,000 bytes or 20 gigabytes.  The critical
information that's still missing is that R needs enough space
for 10 or 20 copies of the largest object in its workspace, in
order to turn around and assign that object to a new name, or
do any summaries on it, etc.  So,  . . .  if you have a computer
with between 200 and 400 gigabytes of random access memory, yes,
you can calculate and summarize the matrix of pairwise distances.
But that requires more memory slots than any ordinary motherboard
provides.  (It would be a mother of a motherboard !)

So, failing that, you could always use Adrian Raftery and Chris
Fraley's 'mclust' package to cluster your data into 50 or more
clusters of very similar cases (instructions for running mclust()
on large data sets are found in the manual which comes with the
package), then calculate all pairwise distances only between the
cases within each cluster.  That's a bit of work to code up.
You wouldn't want to work interactively for each of 50 clusters.
But it certainly can be done in R.  Depends how much effort you
want to put into it.

-  tom blackwell  -  u mighigan medical school  -  ann arbor  -

On Fri, 13 Feb 2004 dsheuman at rogers.com wrote:
> I've only begun investigating R as a substitute for SPSS.
>
> I have a need to identify for each CASE the closest (or most similar) 5
> other CASES (not including itself as it is automatically the closest).  I
> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use
Correlate > Distances to generate a matrix of similarity, but only on a small
sample.  The entire matrix can not be processed at once due to memory
limitations.
>
> The data are all percents, so they are easy comparable.
>
> Is there any way to do this in R?
>
> Below is a small sample of the data (from SPSS) and the desired output.
>
> Thanks,
>
> Danny
>
>
>
>
> *Sample Data.
> DATA LIST LIST /id(F8) var1(F8.2) var2(F8.2) var3(F8.2) var4(F8.2) var5
> (F8.2) var6(F8.2) var7(F8.2) var8(F8.2) var9(F8.2) var10(F8.2) var11(F8.2).
> BEGIN DATA.
> 10170069	3.51	4.02	6.53	11.05	6.53	8.04	13.57	20.10	11.05	8.55
> 	7.04
> 10190229	1.89	5.66	4.61	7.62	8.45	13.21	9.50	20.82	16.07	9.36
> 	3.77
> 10540023	3.40	5.08	3.39	4.52	10.18	14.71	13.56	16.38	9.60	7.89
> 	11.85
> 10650413	6.64	6.64	3.73	4.70	3.78	13.23	19.82	15.98	12.26	8.48
> 	3.78
> 10662074	5.11	5.81	4.37	5.11	6.55	14.60	18.97	11.68	10.25	8.75
> 	8.79
> 10770041	6.43	4.17	6.34	4.26	6.34	4.26	19.11	19.20	14.95	12.77
> 	4.35
> 11010422	3.14	4.71	6.81	7.85	5.75	6.81	15.18	15.18	13.61	11.00
> 	9.44
> 11060762	7.03	5.03	6.95	5.99	5.92	12.94	15.01	12.06	11.98	8.06
> 	9.02
> 11070078	4.61	9.22	4.61	7.94	6.27	12.75	14.02	20.49	7.75	7.75
> 	4.61
> 11180646	4.48	5.35	6.29	5.42	4.55	11.71	20.74	15.32	14.45	8.09
> 	3.61
> 11460001	5.71	7.34	6.48	5.68	4.07	10.55	13.83	18.69	12.15	9.76
> 	4.87
> 11650133	6.00	3.72	6.72	6.00	7.50	17.94	13.44	16.37	13.51	5.15
> 	3.65
> 11650275	4.02	8.06	6.06	8.10	5.06	8.10	17.16	14.12	12.14	14.12
> 	4.02
> 11780034	4.25	4.28	5.30	5.33	6.38	14.88	15.96	18.08	14.85	7.48
> 	3.20
> 11790016	4.40	4.40	5.54	4.40	4.40	10.93	17.67	19.72	13.20	12.13
> 	4.33
> 12660338	6.60	7.54	5.66	8.49	10.38	11.31	16.06	12.26	8.49	8.49
> 	4.73
> 12660644	5.51	3.14	3.95	7.09	7.11	14.98	15.72	18.90	9.44	5.50
> 	8.65
> 12661667	5.44	4.50	5.44	4.50	5.44	12.69	13.63	11.81	9.07	13.68
> 	13.79
> END DATA.
>
> *Output should be:.
> *.
> *	ID1	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
> *	ID2	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
> *	ID3	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
> *	ID4	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
> *	ID5	CLOSEID1	CLOSEID2	CLOSEID3	CLOSEID4	CLOSEID5.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Peter Dalgaard

2004-Feb-13 20:37 UTC

head link

[R] Calculate Closest 5 Cases?

<dsheuman at rogers.com> writes:
> I've only begun investigating R as a substitute for SPSS.
> 
> I have a need to identify for each CASE the closest (or most similar) 5 
> other CASES (not including itself as it is automatically the closest).  I 
> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use
Correlate > Distances to generate a matrix of similarity, but only on a small
sample.  The entire matrix can not be processed at once due to memory
limitations.
> 
> The data are all percents, so they are easy comparable.  
> 
> Is there any way to do this in R?
> 
> Below is a small sample of the data (from SPSS) and the desired output.
> 
> Thanks,
> 
> Danny
This seems to be close:

d <- read.table("tempfile") # needed to edit to get 12 items per
line.
close6 <- function(r)
  d$V1[order(apply(d[-1],1,
                   function(r2)dist(rbind(r,r2))))][1:6]
t(apply(d[-1],1,close6))

       [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
1  10170069 11010422 11460001 11070078 12660644 11790016
2  10190229 11780034 11460001 10170069 11650133 11070078
3  10540023 12660644 10662074 11060762 12661667 11070078
4  10650413 11180646 11780034 11790016 10662074 11460001
5  10662074 11060762 10650413 12660338 11180646 10540023
6  10770041 11790016 11650275 11010422 11460001 11180646
7  11010422 10170069 11650275 11460001 11060762 11790016
8  11060762 10662074 12660338 12661667 11010422 11460001
9  11070078 11460001 12660644 10170069 11780034 12660338
10 11180646 10650413 11780034 11790016 11460001 10662074
11 11460001 11790016 11070078 11780034 10650413 10170069
12 11650133 11780034 12660644 11060762 10650413 11460001
13 11650275 11010422 11460001 11790016 11180646 10770041
14 11780034 11650133 11180646 10650413 11790016 11460001
15 11790016 11460001 11180646 11780034 10650413 10770041
16 12660338 11060762 10662074 11650275 11070078 10650413
17 12660644 10540023 11650133 11780034 11070078 11060762
18 12661667 11060762 10662074 10540023 11010422 12660644


Notice that I use a function to get the closest *6* ID's because the
method will include the row itself. If multiple rows have distance
zero, this might be a problem since you're not guaranteed to get the
ID of the "self" row sorted first.

Here's another try:

close5 <- function(i)
  d$V1[-i][order(apply(d[-i,-1],1,function(r)dist(rbind(d[i,-1],r))))[1:5]]

do.call("rbind",lapply(1,nrow(d),close5))

However, for some reason this is much slower. Getting rid of the more
obvious inefficiencies (some of which would really kill you on a large
data set since they involve copying the entire data frame!) doesn't
really help:

dd <- d[-1]
close5 <- function(i) {r1 <- dd[i,];
d$V1[-i][order(apply(dd,1,function(r)dist(rbind(r1,r)))[-i])[1:5]]}
>system.time(do.call("rbind",lapply(1:nrow(d),close5)))[1] 1.67 0.00 1.67 0.00 0.00

whereas
> system.time(t(apply(d[-1],1,close6)))[1] 0.23 0.00 0.23 0.00 0.00

Anyone have a better idea, or just an explanation of the slowness? 

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907

Jim_Garrett@bd.com

2004-Feb-16 19:22 UTC

head link

[R] Calculate Closest 5 Cases?

Danny,

Here's another approach that doesn't use sorting.  Instead, after
calculating distances it considers a threshold on distance and counts how
many cases are within the threshold.  Then a search over thresholds is
conducted to find a threshold yielding the desired number of cases.  Then
case numbers satisfying the found threshold are returned.

Why go to this bother to avoid sorting?  Since the computation required to
sort N distances grows more quickly than the computation required to make a
pass through a vector of distances and count values less than a threshold,
for some N this approach ought to be more efficient than one based on
sorting.  Well, maybe, if the number of iterations grows slowly enough with
N.  Call it a conjecture.  On the other hand, the N for which this occurs
depends on the efficiency of the sort algorithm (pretty darned efficient
and using compiled code, I imagine) and the efficiency of the search
algorithm.  Here I use uniroot, which isn't really designed with step
functions in mind, so perhaps there are possible improvements.  So I
couldn't say whether this will work faster than the other suggestions
you've gotten with your data, or even for data one is likely to ever see.

If there are ties for the 6th-nearest case (counting the point itself),
this routine should either include or exclude all the ties, whichever comes
closest to yielding 5 points.

Here's the code:

## Args:
## CaseNo:     Row number of case for which to find nearest neighbors.
## x:          A numeric matrix.
## k:          The desired number of neighbors, not counting the point of
interest.
## TempIndex:  Index of row numbers.  If you use this function many many
times you
##             might gain an iota of efficiency by creating this vector
once and
##             passing it in as an argument with each call.  Probably not
worth the
##             trouble, I couldn't help myself....
## verbose:    Tells you how many iterations uniroot used, if you're
curious.
## Value:
## A vector of (hopefully) k row numbers indicating the k nearest
neighbors.
nearestKNeighbors <- function(CaseNo, x, k, TempIndex = 1:nrow(x), verbose
= F) {
  ## make sure x is a matrix
  if(!is.matrix(x))
    stop("x must be a matrix.")

  TempVect <- x[CaseNo, ]
  SquaredDistances <- apply(x, 1, function(s) sum((s - TempVect)^2) )

  tempFun <- function(h) sum(SquaredDistances < h^2) - (k + 1)
  TempSolution <- uniroot(tempFun, interval = range(SquaredDistances))
  if(verbose)
    cat("Required", TempSolution$iter, "iterations.\n")

  sort(setdiff(TempIndex[SquaredDistances < TempSolution$root^2], CaseNo))
}

You would apply this to each row of your dataset using some variety of
"apply" or a loop.  I don't know if memory usage would differ.  In
the
event of ties, you may not have 5 neighbors, so I might put the results in
a list, which can accommodate different lengths in each component (indeed,
it can accommodate completely different data structures):

## data in matrix temp.matrix
ListOutput <-
  lapply(as.list(1:nrow(temp.matrix)),
         function(s) nearestKNeighbors(s, temp.matrix, 5))

or

ListOutput <-
  vector(nrow(temp.matrix), mode = "list")
for(i in 1:nrow(temp.matrix))
  ListOutput[[i]] <- nearestKNeighbors(i, temp.matrix, 5)

and then you can manipulate the list however you like.  For instance, to
see if all components have length 5,

all(unlist(lapply(ListOutput, length)) == 5)

or, if there are such components, find out which one(s):

(1:length(ListOutput))[unlist(lapply(ListOutput, length)) != 5]

Welcome to R!

Best,

-Jim Garrett

***
> I've only begun investigating R as a substitute for SPSS.
> I have a need to identify for each CASE the closest (or most similar) 5
> other CASES (not including itself as it is automatically the closest).  I
> have a fairly large matrix (50000 cases by 50 vars).  In SPSS, I can use
> Correlate > Distances to generate a matrix of similarity, but only on a
small> sample.  The entire matrix can not be processed at once due to memorylimitations.
> The data are all percents, so they are easy comparable.
> Is there any way to do this in R?
> Below is a small sample of the data (from SPSS) and the desired output.
> Thanks,
> Danny

**********************************************************************
This message is intended only for the designated recipient(s...{{dropped}}

R help - Feb 2004 - Calculate Closest 5 Cases?

[R] Calculate Closest 5 Cases?

[R] Calculate Closest 5 Cases?

[R] Calculate Closest 5 Cases?

[R] Calculate Closest 5 Cases?