Giuseppe Paleologo
2008-Aug-27 15:34 UTC
[R] A manipulation problem for a large data set in R
I have two questions for the group. One is very concrete, and is dangerously close to a "please do my homework" posting. The second follows from the first one but is more general. I would welcome the advice of experienced R users. As for the first one: I have a data frame with two variables X Y A, chris D, chris B, chris B, chris C, andrew E, andrew C, andrew B, beth D, chris D, beth C, beth D, beth D, beth A, andrew A, andrew A, andrew C, chris B, beth D, chris E, andrew D, chris D, beth D, chris A, andrew A, chris C chris A chris B chris C beth A chris I would like to produce a table, with one row for every level of the factor X, and multiple columns, filled with the observed levels of the factor Y that are observed jointly with X. Hence: X Z1 Z2 Z3 A, andrew, chris B, chris beth, chris C, andrew, beth, chris D, chris, beth E, andrew A solution would be to something like temp = tapply(Y, X, function(a) levels(a[,drop=TRUE]))) and then putting the output in an appropriately sized data frame. The issue I have with this is that it is inelegant and rather slow for my typical data set (~200K rows). So I was wondering if a more efficient, nicer solution exists. This leads me to a second question. Maybe out of laziness, maybe because R is good enough, I tend to do all my local data manipulations in R. This includes de-duping records, joining tables, and grouping observations. I do this also for larger data sets (say, dense tables with 100M+ elements). Is this current practice among R users? If so, is there a tutorial, or an R view on it? If not, what do you use? Thanks in advance, -gappy [[alternative HTML version deleted]]
Charles C. Berry
2008-Aug-27 15:58 UTC
[R] A manipulation problem for a large data set in R
On Wed, 27 Aug 2008, Giuseppe Paleologo wrote:> I have two questions for the group. One is very concrete, and is dangerously > close to a "please do my homework" posting. The second follows from the > first one but is more general. I would welcome the advice of experienced R > users. > > As for the first one: I have a data frame with two variables > > X Y > A, chris > D, chris > B, chris > B, chris > C, andrew > E, andrew > C, andrew > B, beth > D, chris > D, beth > C, beth > D, beth > D, beth > A, andrew > A, andrew > A, andrew > C, chris > B, beth > D, chris > E, andrew > D, chris > D, beth > D, chris > A, andrew > A, chris > C chris > A chris > B chris > C beth > A chris > > I would like to produce a table, with one row for every level of the factor > X, and multiple columns, filled with the observed levels of the factor Y > that are observed jointly with X. Hence: > > X Z1 Z2 Z3 > A, andrew, chris > B, chris beth, chris > C, andrew, beth, chris > D, chris, beth > E, andrew > > A solution would be to something like > > temp = tapply(Y, X, function(a) levels(a[,drop=TRUE])))lapply( split(Y,X), unique ) or lapply( split(Y,X), function(x) as.character(unique(x))) HTH, Chuck> > and then putting the output in an appropriately sized data frame. The issue > I have with this is that it is inelegant and rather slow for my typical data > set (~200K rows). So I was wondering if a more efficient, nicer solution > exists. > > This leads me to a second question. Maybe out of laziness, maybe because R > is good enough, I tend to do all my local data manipulations in R. This > includes de-duping records, joining tables, and grouping observations. I do > this also for larger data sets (say, dense tables with 100M+ elements). Is > this current practice among R users? If so, is there a tutorial, or an R > view on it? If not, what do you use? > > Thanks in advance, > > -gappy > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901