Florian Jansen
2006-Oct-02 15:30 UTC
[R] separation depending on equal contents in more than one field
Hi, I have a dataframe: (obs <- data.frame(a=c(1,2,2,3,3,3), b=c(1,2,3,4,4,5), c=1:2)) attach(obs) In reality its about 1 million rows. Some of the datasets have same contents in col a and! b like row 4 and 5. I want to do some calculations on col c within the duplicated rows and merge them afterwards: layer <- function(x) round((1-prod(1-x/100))*100,0) (covnew <- aggregate(c, list(a=a, b=b), layer)) This works fine, but not with 1 mill. rows because of memory space limitations. So I thought to split the dataframe into the majority of unique rows on one hand and all duplicated rows on the other: With subset(obs, a %in% a[duplicated(a)]) and !a respectively this works fine for single column comparison. This must be also possible for two column comparison, but I can`t get it. Thanks Florian -- Dr. Florian Jansen Geobotany & Nature Conservation Institute for Botany and Landscape Ecology Ernst-Moritz-Arndt-University Grimmer Str. 88 17487 Greifswald - Germany +49 (0)3834 86 4147
jim holtman
2006-Oct-02 17:13 UTC
[R] separation depending on equal contents in more than one field
One way is to 'split' the indices of the rows to determine which ones to use. For example from the data give, I got the following:> split(seq(nrow(obs)), list(obs$a, obs$b), drop=T)$`1.1` [1] 1 $`2.2` [1] 2 $`2.3` [1] 3 $`3.4` [1] 4 5 $`3.5` [1] 6 You can then use this resulting list and find all entries with more than one value and use this to do your calculations. On 10/2/06, Florian Jansen <jansen at uni-greifswald.de> wrote:> Hi, > > I have a dataframe: > > (obs <- data.frame(a=c(1,2,2,3,3,3), b=c(1,2,3,4,4,5), c=1:2)) > attach(obs) > > In reality its about 1 million rows. > > Some of the datasets have same contents in col a and! b like row 4 and 5. > I want to do some calculations on col c within the duplicated rows and > merge them afterwards: > > layer <- function(x) round((1-prod(1-x/100))*100,0) > (covnew <- aggregate(c, list(a=a, b=b), layer)) > > This works fine, but not with 1 mill. rows because of memory space > limitations. > So I thought to split the dataframe into the majority of unique rows on > one hand and all duplicated rows on the other: > > With > subset(obs, a %in% a[duplicated(a)]) > and !a respectively this works fine for single column comparison. > This must be also possible for two column comparison, but I can`t get it. > > Thanks > Florian > > -- > Dr. Florian Jansen > Geobotany & Nature Conservation > Institute for Botany and Landscape Ecology > Ernst-Moritz-Arndt-University > Grimmer Str. 88 > 17487 Greifswald - Germany > +49 (0)3834 86 4147 > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?