Leo Alekseyev
2009-Sep-02 22:39 UTC
[R] Grouping data in a data frame: is there an efficient way to do it?
I have a data frame with about 10^6 rows; I want to group the data according to entries in one of the columns and do something with it. For instance, suppose I want to count up the number of elements in each group. I tried something like aggregate(my.df$my.field, list(my.df$my.field), length) but it seems to be very slow. Likewise, the split() function was slow (I killed it before it completed). Is there a way to efficiently accomplish this in R?.. I am almost tempted to write an external Perl/Python script entering every row into a hashtable keyed by my.field and iterating over the keys... Might this be faster?..
David Winsemius
2009-Sep-02 22:59 UTC
[R] Grouping data in a data frame: is there an efficient way to do it?
table is reasonably fast. I have more than 4 X 10^6 records and a 2D table takes very little time: nUA <- with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are factors nUA This does the same thing and took about 5 seconds just now: xtabs( ~ URwbc + URrbc, data=TRdta) On Sep 2, 2009, at 6:39 PM, Leo Alekseyev wrote:> I have a data frame with about 10^6 rows; I want to group the data > according to entries in one of the columns and do something with it. > For instance, suppose I want to count up the number of elements in > each group. I tried something like aggregate(my.df$my.field, > list(my.df$my.field), length) but it seems to be very slow. Likewise, > the split() function was slow (I killed it before it completed). Is > there a way to efficiently accomplish this in R?.. I am almost > tempted to write an external Perl/Python script entering every row > into a hashtable keyed by my.field and iterating over the keys... > Might this be faster?..David Winsemius, MD Heritage Laboratories West Hartford, CT
Bert Gunter
2009-Sep-02 23:12 UTC
[R] Grouping data in a data frame: is there an efficient way todo it?
table() and xtabs() are fast only because they are just doing counts. If you want the general case, you need ?tapply. aggregate() is basically a wrapper for lapply and so you may have the same performance issues with tapply . Try it to see. They are essentially doing the sort of hash table you describe in R code. Whether Perl or Python would be faster I cannot say -- but are you including the time required to develop and debug the script in your assessment? Bert Gunter Genentech Nonclinical Biostatistics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of David Winsemius Sent: Wednesday, September 02, 2009 3:59 PM To: Leo Alekseyev Cc: r-help at r-project.org Subject: Re: [R] Grouping data in a data frame: is there an efficient way todo it? table is reasonably fast. I have more than 4 X 10^6 records and a 2D table takes very little time: nUA <- with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are factors nUA This does the same thing and took about 5 seconds just now: xtabs( ~ URwbc + URrbc, data=TRdta) On Sep 2, 2009, at 6:39 PM, Leo Alekseyev wrote:> I have a data frame with about 10^6 rows; I want to group the data > according to entries in one of the columns and do something with it. > For instance, suppose I want to count up the number of elements in > each group. I tried something like aggregate(my.df$my.field, > list(my.df$my.field), length) but it seems to be very slow. Likewise, > the split() function was slow (I killed it before it completed). Is > there a way to efficiently accomplish this in R?.. I am almost > tempted to write an external Perl/Python script entering every row > into a hashtable keyed by my.field and iterating over the keys... > Might this be faster?..David Winsemius, MD Heritage Laboratories West Hartford, CT ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
jim holtman
2009-Sep-02 23:13 UTC
[R] Grouping data in a data frame: is there an efficient way to do it?
Take 0.6 seconds on my slow laptop:> n <- 1e6 > x <- data.frame(a=sample(LETTERS, n, TRUE)) > system.time(print(tapply(x$a, x$a, length)))A B C D E F G H I J K L M N O P Q 38555 38349 38647 38271 38456 38352 38644 38679 38575 38730 38471 38379 38540 38413 38365 38501 38555 R S T U V W X Y Z 38379 38417 38326 38509 38238 38395 38625 38175 38454 user system elapsed 0.59 0.02 0.63>On Wed, Sep 2, 2009 at 6:39 PM, Leo Alekseyev<dnquark at gmail.com> wrote:> I have a data frame with about 10^6 rows; I want to group the data > according to entries in one of the columns and do something with it. > For instance, suppose I want to count up the number of elements in > each group. ?I tried something like aggregate(my.df$my.field, > list(my.df$my.field), length) but it seems to be very slow. ?Likewise, > the split() function was slow (I killed it before it completed). ?Is > there a way to efficiently accomplish this in R?.. ?I am almost > tempted to write an external Perl/Python script entering every row > into a hashtable keyed by my.field and iterating over the keys... > Might this be faster?.. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
David M Smith
2009-Sep-02 23:28 UTC
[R] Grouping data in a data frame: is there an efficient way to do it?
You may want to try using isplit (from the iterators package). Combined with foreach, it's an efficient way of iterating through a data frame by groups of rows defined by common values of a columns (which I think is what you're after). You can speed things up further if you have a multiprocessor system with the doMC package to run iterations in parallel. There's an example here: http://blog.revolution-computing.com/2009/08/blockprocessing-a-data-frame-with-isplit.html Hope this helps, # David Smith On Wed, Sep 2, 2009 at 3:39 PM, Leo Alekseyev <dnquark@gmail.com> wrote:> I have a data frame with about 10^6 rows; I want to group the data > according to entries in one of the columns and do something with it. > For instance, suppose I want to count up the number of elements in > each group. I tried something like aggregate(my.df$my.field, > list(my.df$my.field), length) but it seems to be very slow. Likewise, > the split() function was slow (I killed it before it completed). Is > there a way to efficiently accomplish this in R?.. I am almost > tempted to write an external Perl/Python script entering every row > into a hashtable keyed by my.field and iterating over the keys... > Might this be faster?.. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- David M Smith <david@revolution-computing.com> Director of Community, REvolution Computing www.revolution-computing.com Tel: +1 (206) 577-4778 x3203 (San Francisco, USA) Check out our upcoming events schedule at www.revolution-computing.com/events [[alternative HTML version deleted]]