thr3ads.net - R help - [R] Improve code efficient with do.call, rbind and split contruction [Sep 2016]

If this information is useful, please help other people find it:
Share via:

Jun Shen

2016-Sep-02 17:02 UTC

[R] Improve code efficient with do.call, rbind and split contruction

Dear list,

I have the following line of code to extract the last line of the split
data and put them back together.

do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),]))

the problem is when  have a huge dataset, it takes too long to run.
(actually it's > 3 hours and it's still running).

The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
totally 800,000 split dataset. Is there anyway to speed it up? Thanks.

Jun

	[[alternative HTML version deleted]]

Bert Gunter

2016-Sep-02 17:51 UTC

head link

[R] Improve code efficient with do.call, rbind and split contruction

This is the sort of thing that dplyr or the data.table packages can
probably do elegantly and efficiently. So you might consider looking
at them. But as I use neither, let me suggest a base R solution. As
you supplied no data for a reproducible example, I'll make up my own
and hopefully I have understood you correctly. If not, maybe someone
else will get it straight. Anyway...

The "trick" is to use tapply() to select the necessary row indices of
your data frame and forget about all the do.call and rbind stuff. e.g.
> set.seed(1001)
> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),+                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
+                  y = runif(100))>
> ix <- seq_len(nrow(df))
>
> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
> ix   a  b   c  d  e  f
A 94 69 100 59 80 87
B 89 57  65 90 75 88
C 85 92  86 95 97 62
D 47 73  72 74 99 96

## ix can now be used as an index into df as:
df[ix,]

This should help somewhat, but you still have to contend with the
tapply() loop at the interpreted level. I'll leave speed comparisons
to you.

Cheers,
Bert

## Note: if, in fact, your data frame is arranged in a regular way
with, e.g. your SID, DOSENO groups all of the same size and together,
then you can calculate the indices you want directly and skip the
tapply business.I'm assuming this is not the case... Again, no data...

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com>
wrote:> Dear list,
>
> I have the following line of code to extract the last line of the split
> data and put them back together.
>
>
do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),]))
>
> the problem is when  have a huge dataset, it takes too long to run.
> (actually it's > 3 hours and it's still running).
>
> The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
>
> Jun
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

ruipbarradas at sapo.pt

2016-Sep-02 17:57 UTC

head link

[R] Improve code efficient with do.call, rbind and split contruction

Hello,

Try ?aggregate, it's probably faster.
With a made up data.frame, since you haven't provided us with a dataset,

simout.s1 <- data.frame(SID = rep(LETTERS[1:2], 10),
		DOSENO = rep(letters[1:4], each = 5),
		value = rnorm(20))

res2 <- aggregate(simout.s1$value, list(simout.s1$SID,  
simout.s1$DOSENO), function(x)x[NROW(x)])
names(res2) <- names(simout.s1)


Use dput to post a data example. Something like the following.

dput(head(simout.s1, 50))  #paste the output of this in your next mail


Hope this helps,

Rui Barradas



Citando Jun Shen <jun.shen.ut at gmail.com>:
> Dear list,
>
> I have the following line of code to extract the last line of the split
> data and put them back together.
>
>
do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),]))
>
> the problem is when  have a huge dataset, it takes too long to run.
> (actually it's > 3 hours and it's still running).
>
> The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
>
> Jun
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Jun Shen

2016-Sep-02 18:37 UTC

head link

[R] Improve code efficient with do.call, rbind and split contruction

Hi Bert,

This is the best method I have seen this year! do.call, rbind has just gone
to museum :)

It took ~30 second to get the results. You deserve a medal!!!!

Jun

On Fri, Sep 2, 2016 at 1:51 PM, Bert Gunter <bgunter.4567 at gmail.com>
wrote:
> This is the sort of thing that dplyr or the data.table packages can
> probably do elegantly and efficiently. So you might consider looking
> at them. But as I use neither, let me suggest a base R solution. As
> you supplied no data for a reproducible example, I'll make up my own
> and hopefully I have understood you correctly. If not, maybe someone
> else will get it straight. Anyway...
>
> The "trick" is to use tapply() to select the necessary row
indices of
> your data frame and forget about all the do.call and rbind stuff. e.g.
>
> > set.seed(1001)
> > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
> +                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
> +                  y = runif(100))
> >
> > ix <- seq_len(nrow(df))
> >
> > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
> > ix
>    a  b   c  d  e  f
> A 94 69 100 59 80 87
> B 89 57  65 90 75 88
> C 85 92  86 95 97 62
> D 47 73  72 74 99 96
>
> ## ix can now be used as an index into df as:
> df[ix,]
>
> This should help somewhat, but you still have to contend with the
> tapply() loop at the interpreted level. I'll leave speed comparisons
> to you.
>
> Cheers,
> Bert
>
> ## Note: if, in fact, your data frame is arranged in a regular way
> with, e.g. your SID, DOSENO groups all of the same size and together,
> then you can calculate the indices you want directly and skip the
> tapply business.I'm assuming this is not the case... Again, no data...
>
>
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip
)
>
>
> On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com>
wrote:
> > Dear list,
> >
> > I have the following line of code to extract the last line of the
split
> > data and put them back together.
> >
> > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','
> DOSENO')]),function(x)x[nrow(x),]))
> >
> > the problem is when  have a huge dataset, it takes too long to run.
> > (actually it's > 3 hours and it's still running).
> >
> > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so
> > totally 800,000 split dataset. Is there anyway to speed it up? Thanks.
> >
> > Jun
> >
> >         [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Charles C. Berry

2016-Sep-02 18:50 UTC

head link

[R] Improve code efficient with do.call, rbind and split contruction

On Fri, 2 Sep 2016, Bert Gunter wrote:
[snip]>
> The "trick" is to use tapply() to select the necessary row
indices of
> your data frame and forget about all the do.call and rbind stuff. e.g.
>
I agree the way to go is "select the necessary row indices" but I get 
there a different way. See below.
>> set.seed(1001)
>> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),
> +                  g <- factor(sample(letters[1:6],100,rep=TRUE)),
> +                  y = runif(100))
>>
>> ix <- seq_len(nrow(df))
>>
>> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)]))
>> ix
>   a  b   c  d  e  f
> A 94 69 100 59 80 87
> B 89 57  65 90 75 88
> C 85 92  86 95 97 62
> D 47 73  72 74 99 96

   jx <- which( !duplicated( df[,c("f","g")],
fromLast=TRUE ))

   xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix'

    g
f     a   b   c   d   e   f
   A  94  69 100  59  80  87
   B  89  57  65  90  75  88
   C  85  92  86  95  97  62
   D  47  73  72  74  99  96


Chuck

R help - Sep 2016 - Improve code efficient with do.call, rbind and split contruction

[R] Improve code efficient with do.call, rbind and split contruction

[R] Improve code efficient with do.call, rbind and split contruction

[R] Improve code efficient with do.call, rbind and split contruction

[R] Improve code efficient with do.call, rbind and split contruction

[R] Improve code efficient with do.call, rbind and split contruction