Hello, I am trying to do some fairly straightforward data summarization, i.e., the kind you would do with a pivot table in excel or by using SQL queires. I have a moderately sized data set of ~70,000 records and I am trying to compute some group averages and sum values within groups. the code example below shows how I am trying to go about doing this pti <-rnorm(70000,10) fid <- rnorm(70000,100) finc <- rnorm(70000,1000) ### compute the sums of pti within fid groups sum_pinc <-aggregate(cbind(fid,pti),list(fid),FUN=sum) #### compute mean finc within fid groups tot_finc <- aggregate(cbind(fid,finc),list(fid),FUN=mean) when I try to do it this way I get an error message telling me that enough memory cannot be allocated ( I am using R 2.7.1 on Windows XP with 2 GB of Memory). I figure that there must be a more efficent way to go about doing this. Please suggest. I would typically do this kind of task in a database and use SQL to push the data around. I know RODBC allows you to write SQL to query external DBs. Is there any mechanisim that allows you to write SQL queies against datasets internal to R e.g. in the case above I could do something like set <- cbind(fid,pti,finc) select fid, sum(pti) from set group by fid that would be handy! Thanks, Spencer [[alternative HTML version deleted]]
Hello, Have you tried using the GUI Rattle from www.rattle.togaware.com . It works pretty well for summarization. Regards, Ajay www.decisionstats.com On Sat, Jul 12, 2008 at 4:14 AM, sj <ssj1364 at gmail.com> wrote:> > Hello, > > I am trying to do some fairly straightforward data summarization, i.e., the > kind you would do with a pivot table in excel or by using SQL queires. I > have a moderately sized data set of ~70,000 records and I am trying to > compute some group averages and sum values within groups. the code example > below shows how I am trying to go about doing this > > pti <-rnorm(70000,10) > fid <- rnorm(70000,100) > finc <- rnorm(70000,1000) > > > ### compute the sums of pti within fid groups > sum_pinc <-aggregate(cbind(fid,pti),list(fid),FUN=sum) > > #### compute mean finc within fid groups > tot_finc <- aggregate(cbind(fid,finc),list(fid),FUN=mean) > > when I try to do it this way I get an error message telling me that enough > memory cannot be allocated ( I am using R 2.7.1 on Windows XP with 2 GB of > Memory). I figure that there must be a more efficent way to go about doing > this. Please suggest. > > I would typically do this kind of task in a database and use SQL to push the > data around. I know RODBC allows you to write SQL to query external DBs. Is > there any mechanisim that allows you to write SQL queies against datasets > internal to R e.g. in the case above > > > I could do something like > > set <- cbind(fid,pti,finc) > > select fid, sum(pti) > from set > group by fid > > that would be handy! > > Thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
On Sat, Jul 12, 2008 at 7:44 AM, sj <ssj1364 at gmail.com> wrote:> Hello, > > I am trying to do some fairly straightforward data summarization, i.e., the > kind you would do with a pivot table in excel or by using SQL queires. I > have a moderately sized data set of ~70,000 records and I am trying to > compute some group averages and sum values within groups. the code example > below shows how I am trying to go about doing this >You might want to have a look at the reshape package - http://had.co.nz/reshape - it's design was much inspired by pivot tables and sql crosstab queries. Hadley -- http://had.co.nz/
See sqldf home page: http://sqldf.googlecode.com e.g. library(sqldf) set.seed(1) pti <-rnorm(70000,10) fid <- rnorm(70000,100) finc <- rnorm(70000,1000) # set is a reserved word in SQL so use sset sset <- data.frame(fid,pti,finc) system.time(out <- sqldf("select fid, sum(pti) from sset group by fid")) # 3 seconds On Fri, Jul 11, 2008 at 6:44 PM, sj <ssj1364 at gmail.com> wrote:> Hello, > > I am trying to do some fairly straightforward data summarization, i.e., the > kind you would do with a pivot table in excel or by using SQL queires. I > have a moderately sized data set of ~70,000 records and I am trying to > compute some group averages and sum values within groups. the code example > below shows how I am trying to go about doing this > > pti <-rnorm(70000,10) > fid <- rnorm(70000,100) > finc <- rnorm(70000,1000) > > > ### compute the sums of pti within fid groups > sum_pinc <-aggregate(cbind(fid,pti),list(fid),FUN=sum) > > #### compute mean finc within fid groups > tot_finc <- aggregate(cbind(fid,finc),list(fid),FUN=mean) > > when I try to do it this way I get an error message telling me that enough > memory cannot be allocated ( I am using R 2.7.1 on Windows XP with 2 GB of > Memory). I figure that there must be a more efficent way to go about doing > this. Please suggest. > > I would typically do this kind of task in a database and use SQL to push the > data around. I know RODBC allows you to write SQL to query external DBs. Is > there any mechanisim that allows you to write SQL queies against datasets > internal to R e.g. in the case above > > > I could do something like > > set <- cbind(fid,pti,finc) > > select fid, sum(pti) > from set > group by fid > > that would be handy! > > Thanks, > > Spencer > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >