In the first line, use the dist function, found in library mva, to get the distance between each pair of rows. From this calculate an incidence matrix for which element i,j is true if row i in dat equals row j in dat (and false elsewhere). In the second line, for each row calculate the indices of the matching rows and take the minimum of those as the key. incid <- as.matrix(dist(dat[,-1],method="max"))==0 keys <- unlist(lapply(apply(incid,1,which),min)) --- Göran Broström <gb@stat.umu.se> wrote:>I have a dataframe ''dat'' with one response and some covariates. Many >observations (rows), but only a few unique combinations of >the covariates. Let''s say that the response is in column 1, and >the covariates in columns 2:k. > >I want to do > >> covar <- unique.data.frame(dat[, 2:k]) >> y <- dat[, 1] >> keys <- ?????? > >where ''keys'' should be a vector of length length(y) and contain the >row numbers in ''covar'', where the response will find its covariates. > >Example: > >> dat > y x1 x2 >1 1 1 0 >2 2 0 1 >3 3 1 0 > >> unique.data.frame(dat[, 2:3]) > x1 x2 >1 1 0 >2 0 1 > >> keys >1 1 >2 2 >3 1 > >But how do I get ''keys''? >-- > Göran Broström tel: +46 90 786 5223 > professor fax: +46 90 786 6614 > Department of Statistics http://www.stat.umu.se/egna/gb/ > Umeå University > SE-90187 Umeå, Sweden e-mail: gb@stat.umu.se > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >Send "info", "help", or "[un]subscribe" >(in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.______________________________________________________________ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
On Wed, 20 Feb 2002, Gabor Grothendieck wrote:> In the first line, use the dist function, found in library mva, > to get the distance between each pair of rows. From this > calculate an incidence matrix for which element i,j is true if > row i in dat equals row j in dat (and false elsewhere). > > In the second line, for each row calculate the indices of > the matching rows and take the minimum of those as the key. > > incid <- as.matrix(dist(dat[,-1],method="max"))==0 > keys <- unlist(lapply(apply(incid,1,which),min))Thank you very much! This is very fast, much faster than my attempts so far, but it has two drawbacks: 1. It gives pointers to first occurrences in the _original_ data frame, not the 'unique' version. 2. The first step results in a _huge_ matrix 'incid', too huge for my applications. However, this is a promising first attempt, and I will try to refine the idea. Again, thanks! G?ran> > --- G?ran Brostr?m <gb at stat.umu.se> wrote: > >I have a dataframe 'dat' with one response and some covariates. Many > >observations (rows), but only a few unique combinations of > >the covariates. Let's say that the response is in column 1, and > >the covariates in columns 2:k. > > > >I want to do > > > >> covar <- unique.data.frame(dat[, 2:k]) > >> y <- dat[, 1] > >> keys <- ?????? > > > >where 'keys' should be a vector of length length(y) and contain the > >row numbers in 'covar', where the response will find its covariates. > > > >Example: > > > >> dat > > y x1 x2 > >1 1 1 0 > >2 2 0 1 > >3 3 1 0 > > > >> unique.data.frame(dat[, 2:3]) > > x1 x2 > >1 1 0 > >2 0 1 > > > >> keys > >1 1 > >2 2 > >3 1 > > > >But how do I get 'keys'? > >-- > > G?ran Brostr?m tel: +46 90 786 5223 > > professor fax: +46 90 786 6614 > > Department of Statistics http://www.stat.umu.se/egna/gb/ > > Ume? University > > SE-90187 Ume?, Sweden e-mail: gb at stat.umu.se > > > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > >Send "info", "help", or "[un]subscribe" > >(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ > > _____________________________________________________________ > > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- > r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >-- G?ran Brostr?m tel: +46 90 786 5223 professor fax: +46 90 786 6614 Department of Statistics http://www.stat.umu.se/egna/gb/ Ume? University SE-90187 Ume?, Sweden e-mail: gb at stat.umu.se -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Here''s another idea. It assumes that dat[,-1] contains only zeros and ones since this is true in your example. Some comments on lifting this restriction are at the end. dat0 <- 2*matrix(unlist(dat[,-1]),nrow=nrow(dat))-1 u0 <- 2*matrix(unlist(unique(dat[,-1])),ncol=ncol(dat[,-1]))-1 keys <- apply(dat0 %*% t(u0) == ncol(dat0),1,which) The first line creates a matrix of 1''s and -1''s from the x variables such that 1 is mapped to 1 and 0 is mapped to -1. The second line extacts the unique rows and performs the same transformation. The last line does a matrix multiplication creating an incidence matrix using the fact that the inner product of a row of dat0 and a row of u0 equals the number of columns of dat0 iff they are equal. We then apply the which function to get the indices. We don''t have to use the minimum like we did last time since u0 has unique rows. To generalize this to x matrices which have more than just zeros and ones we would have to define a generalized matrix multiplication which uses ands instead of plus and == instead of times but is otherwise the same. The *.= operator in the APL language did this. Creating such an operator would even have a benefit in the 0/1 case since it would make the mapping to +1/-1 unnecessary. --- Göran Broström <gb@stat.umu.se> wrote:>On Wed, 20 Feb 2002, Gabor Grothendieck wrote: > >> In the first line, use the dist function, found in library mva, >> to get the distance between each pair of rows. From this >> calculate an incidence matrix for which element i,j is true if >> row i in dat equals row j in dat (and false elsewhere). >> >> In the second line, for each row calculate the indices of >> the matching rows and take the minimum of those as the key. >> >> incid <- as.matrix(dist(dat[,-1],method="max"))==0 >> keys <- unlist(lapply(apply(incid,1,which),min)) > >Thank you very much! This is very fast, much faster than my attempts >so far, but it has two drawbacks: > >1. It gives pointers to first occurrences in the _original_ data frame, >not the ''unique'' version. > >2. The first step results in a _huge_ matrix ''incid'', too huge for my >applications. > >However, this is a promising first attempt, and I will try to refine >the idea. Again, thanks! > >Göran > >> >> --- Göran Broström <gb@stat.umu.se> wrote: >> >I have a dataframe ''dat'' with one response and some covariates. Many >> >observations (rows), but only a few unique combinations of >> >the covariates. Let''s say that the response is in column 1, and >> >the covariates in columns 2:k. >> > >> >I want to do >> > >> >> covar <- unique.data.frame(dat[, 2:k]) >> >> y <- dat[, 1] >> >> keys <- ?????? >> > >> >where ''keys'' should be a vector of length length(y) and contain the >> >row numbers in ''covar'', where the response will find its covariates. >> > >> >Example: >> > >> >> dat >> > y x1 x2 >> >1 1 1 0 >> >2 2 0 1 >> >3 3 1 0 >> > >> >> unique.data.frame(dat[, 2:3]) >> > x1 x2 >> >1 1 0 >> >2 0 1 >> > >> >> keys >> >1 1 >> >2 2 >> >3 1 >> > >> >But how do I get ''keys''? >> >-- >> > Göran Broström tel: +46 90 786 5223 >> > professor fax: +46 90 786 6614 >> > Department of Statistics http://www.stat.umu.se/egna/gb/ >> > Umeå University >> > SE-90187 Umeå, Sweden e-mail: gb@stat.umu.se >> > >> >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >> >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >> >Send "info", "help", or "[un]subscribe" >> >(in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch >> >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >> >> _____________________________________________________________ >> >> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >> Send "info", "help", or "[un]subscribe" >> (in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch >> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._ >> > >-- > Göran Broström tel: +46 90 786 5223 > professor fax: +46 90 786 6614 > Department of Statistics http://www.stat.umu.se/egna/gb/ > Umeå University > SE-90187 Umeå, Sweden e-mail: gb@stat.umu.se > >-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- >r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html >Send "info", "help", or "[un]subscribe" >(in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch >_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.______________________________________________________________ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Another way to do this is to realize that what is wanted is essentially a join (in the relational database sense) of dat and the unique rows of dat. We use merge to perform the join in the following. The first few lines set up the data for merge and the last one unscrambles it since merge does not preserve ordering. Note that apply is nowhere used, suggesting that this solution may have adequate speed. u <- unique(dat) dat0 <- cbind( dat, seq(nrow(dat)) ) u0 <- cbind( u, seq(nrow(u)) ) by.arg <- c( rep(T,ncol(dat)), F ) dat.mrg <- merge( dat0,u0, by.x=by.arg, by.y=by.arg, sort=F ) keys <- dat.mrg[,ncol(dat.mrg)][order(dat.mrg[,ncol(dat.mrg)-1])] First, u becomes the unique rows of dat. The next two lines append a column of sequence numbers to dat and to u . The 4th & 5th lines merge dat0 and u0 on all cols but the seq numbers. At this point the last two columns of dat.mrg contain the sequence number of the original data frame, dat, and the corresponding sequence number of u. However, the rows may be scrambled relative to the original ordering in dat since merge does not preserve order so resort to get keys. _____________________________________________________________ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._