thr3ads.net - R help - [R] Grouping data in a data frame: is there an efficient way to do it? [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Leo Alekseyev

2009-Sep-02 22:39 UTC

[R] Grouping data in a data frame: is there an efficient way to do it?

I have a data frame with about 10^6 rows; I want to group the data
according to entries in one of the columns and do something with it.
For instance, suppose I want to count up the number of elements in
each group.  I tried something like aggregate(my.df$my.field,
list(my.df$my.field), length) but it seems to be very slow.  Likewise,
the split() function was slow (I killed it before it completed).  Is
there a way to efficiently accomplish this in R?..  I am almost
tempted to write an external Perl/Python script entering every row
into a hashtable keyed by my.field and iterating over the keys...
Might this be faster?..

David Winsemius

2009-Sep-02 22:59 UTC

head link

[R] Grouping data in a data frame: is there an efficient way to do it?

table is reasonably fast. I have more than 4 X 10^6 records and a 2D  
table takes very little time:

  nUA <- with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are  
factors
  nUA

This does the same thing and took about 5 seconds just now:

xtabs( ~ URwbc + URrbc, data=TRdta)

On Sep 2, 2009, at 6:39 PM, Leo Alekseyev wrote:
> I have a data frame with about 10^6 rows; I want to group the data
> according to entries in one of the columns and do something with it.
> For instance, suppose I want to count up the number of elements in
> each group.  I tried something like aggregate(my.df$my.field,
> list(my.df$my.field), length) but it seems to be very slow.  Likewise,
> the split() function was slow (I killed it before it completed).  Is
> there a way to efficiently accomplish this in R?..  I am almost
> tempted to write an external Perl/Python script entering every row
> into a hashtable keyed by my.field and iterating over the keys...
> Might this be faster?..


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

Bert Gunter

2009-Sep-02 23:12 UTC

head link

[R] Grouping data in a data frame: is there an efficient way todo it?

table() and xtabs() are fast only because they are just doing counts. If you
want the general case, you need ?tapply. aggregate() is basically a wrapper
for lapply and so you may have the same performance issues with tapply . Try
it to see. They are essentially doing the sort of hash table you describe in
R code.

Whether Perl or Python would be faster I cannot say -- but are you including
the time required to develop and debug the script in your assessment?

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of David Winsemius
Sent: Wednesday, September 02, 2009 3:59 PM
To: Leo Alekseyev
Cc: r-help at r-project.org
Subject: Re: [R] Grouping data in a data frame: is there an efficient way
todo it?

table is reasonably fast. I have more than 4 X 10^6 records and a 2D  
table takes very little time:

  nUA <- with (TRdta, table(URwbc, URrbc)) # both URwbc and URrbc are  
factors
  nUA

This does the same thing and took about 5 seconds just now:

xtabs( ~ URwbc + URrbc, data=TRdta)

On Sep 2, 2009, at 6:39 PM, Leo Alekseyev wrote:
> I have a data frame with about 10^6 rows; I want to group the data
> according to entries in one of the columns and do something with it.
> For instance, suppose I want to count up the number of elements in
> each group.  I tried something like aggregate(my.df$my.field,
> list(my.df$my.field), length) but it seems to be very slow.  Likewise,
> the split() function was slow (I killed it before it completed).  Is
> there a way to efficiently accomplish this in R?..  I am almost
> tempted to write an external Perl/Python script entering every row
> into a hashtable keyed by my.field and iterating over the keys...
> Might this be faster?..

David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

jim holtman

2009-Sep-02 23:13 UTC

head link

[R] Grouping data in a data frame: is there an efficient way to do it?

Take 0.6 seconds on my slow laptop:
> n <- 1e6
> x <- data.frame(a=sample(LETTERS, n, TRUE))
> system.time(print(tapply(x$a, x$a, length)))    A     B     C     D     E     F     G     H     I     J     K
L     M     N     O     P     Q
38555 38349 38647 38271 38456 38352 38644 38679 38575 38730 38471
38379 38540 38413 38365 38501 38555
    R     S     T     U     V     W     X     Y     Z
38379 38417 38326 38509 38238 38395 38625 38175 38454
   user  system elapsed
   0.59    0.02    0.63>



On Wed, Sep 2, 2009 at 6:39 PM, Leo Alekseyev<dnquark at gmail.com>
wrote:> I have a data frame with about 10^6 rows; I want to group the data
> according to entries in one of the columns and do something with it.
> For instance, suppose I want to count up the number of elements in
> each group. ?I tried something like aggregate(my.df$my.field,
> list(my.df$my.field), length) but it seems to be very slow. ?Likewise,
> the split() function was slow (I killed it before it completed). ?Is
> there a way to efficiently accomplish this in R?.. ?I am almost
> tempted to write an external Perl/Python script entering every row
> into a hashtable keyed by my.field and iterating over the keys...
> Might this be faster?..
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

David M Smith

2009-Sep-02 23:28 UTC

head link

[R] Grouping data in a data frame: is there an efficient way to do it?

You may want to try using isplit (from the iterators package). Combined with
foreach, it's an efficient way of iterating through a data frame by groups
of rows defined by common values of a columns (which I think is what you're
after). You can speed things up further if you have a multiprocessor system
with the doMC package to run iterations in parallel. There's an example
here:
http://blog.revolution-computing.com/2009/08/blockprocessing-a-data-frame-with-isplit.html

Hope this helps,
# David Smith

On Wed, Sep 2, 2009 at 3:39 PM, Leo Alekseyev <dnquark@gmail.com> wrote:
> I have a data frame with about 10^6 rows; I want to group the data
> according to entries in one of the columns and do something with it.
> For instance, suppose I want to count up the number of elements in
> each group.  I tried something like aggregate(my.df$my.field,
> list(my.df$my.field), length) but it seems to be very slow.  Likewise,
> the split() function was slow (I killed it before it completed).  Is
> there a way to efficiently accomplish this in R?..  I am almost
> tempted to write an external Perl/Python script entering every row
> into a hashtable keyed by my.field and iterating over the keys...
> Might this be faster?..
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
David M Smith <david@revolution-computing.com>
Director of Community, REvolution Computing www.revolution-computing.com
Tel: +1 (206) 577-4778 x3203 (San Francisco, USA)

Check out our upcoming events schedule at
www.revolution-computing.com/events

	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Sep 2009 - Grouping data in a data frame: is there an efficient way to do it?

[R] Grouping data in a data frame: is there an efficient way to do it?

[R] Grouping data in a data frame: is there an efficient way to do it?

[R] Grouping data in a data frame: is there an efficient way todo it?

[R] Grouping data in a data frame: is there an efficient way to do it?

[R] Grouping data in a data frame: is there an efficient way to do it?

Seemingly Similar Threads