thr3ads.net - R help - [R] Thoughts for faster indexing [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Noah Silverman

2013-Nov-20 20:16 UTC

[R] Thoughts for faster indexing

Hello,

I have a fairly large data.frame.  (About 150,000 rows of 100
variables.) There are case IDs, and multiple entries for each ID, with a
date stamp.  (i.e. records of peoples activity.)


I need to iterate over each person (record ID) in the data set, and then
process their data for each date.  The processing part is fast, the date
part is fast.  Locating the records is slow.  I've even tried using
data.table, with ID set as the index, and it is still slow.

The line with the slow process (According to Rprof) is:


j <- which( d$id == person )

(I then process all the records indexed by j, which seems fast enough.)

where d is my data.frame or data.table

I thought that using the data.table indexing would speed things up, but
not in this case.

Any ideas on how to speed this up?


Thanks!

-- 
Noah Silverman, M.S., C.Phil
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095

Jim Holtman

2013-Nov-21 11:34 UTC

head link

[R] Thoughts for faster indexing

you need to show the statement in context with the rest of the script.  you need
to tell us what you want to do, not how you want to do it.

Sent from my iPad

On Nov 20, 2013, at 15:16, Noah Silverman <noahsilverman at g.ucla.edu>
wrote:
> Hello,
> 
> I have a fairly large data.frame.  (About 150,000 rows of 100
> variables.) There are case IDs, and multiple entries for each ID, with a
> date stamp.  (i.e. records of peoples activity.)
> 
> 
> I need to iterate over each person (record ID) in the data set, and then
> process their data for each date.  The processing part is fast, the date
> part is fast.  Locating the records is slow.  I've even tried using
> data.table, with ID set as the index, and it is still slow.
> 
> The line with the slow process (According to Rprof) is:
> 
> 
> j <- which( d$id == person )
> 
> (I then process all the records indexed by j, which seems fast enough.)
> 
> where d is my data.frame or data.table
> 
> I thought that using the data.table indexing would speed things up, but
> not in this case.
> 
> Any ideas on how to speed this up?
> 
> 
> Thanks!
> 
> -- 
> Noah Silverman, M.S., C.Phil
> UCLA Department of Statistics
> 8117 Math Sciences Building
> Los Angeles, CA 90095
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Carl Witthoft

2013-Nov-21 13:23 UTC

head link

[R] Thoughts for faster indexing

What the Data Munger Guru said.
Plus: this is almost certainly a job for ddply or data.table.



Noah Silverman-2 wrote> Hello,
> 
> I have a fairly large data.frame.  (About 150,000 rows of 100
> variables.) There are case IDs, and multiple entries for each ID, with a
> date stamp.  (i.e. records of peoples activity.)
> 
> 
> I need to iterate over each person (record ID) in the data set, and then
> process their data for each date.  The processing part is fast, the date
> part is fast.  Locating the records is slow.  I've even tried using
> data.table, with ID set as the index, and it is still slow.
> 
> The line with the slow process (According to Rprof) is:
> 
> 
> j <- which( d$id == person )
> 
> (I then process all the records indexed by j, which seems fast enough.)
> 
> where d is my data.frame or data.table
> 
> I thought that using the data.table indexing would speed things up, but
> not in this case.
> 
> Any ideas on how to speed this up?
> 
> 
> Thanks!
> 
> -- 
> Noah Silverman, M.S., C.Phil
> UCLA Department of Statistics
> 8117 Math Sciences Building
> Los Angeles, CA 90095
> 
> ______________________________________________
> R-help@
>  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




--
View this message in context:
http://r.789695.n4.nabble.com/Thoughts-for-faster-indexing-tp4680854p4680889.html
Sent from the R help mailing list archive at Nabble.com.

MacQueen, Don

2013-Nov-21 15:42 UTC

head link

[R] Thoughts for faster indexing

I have some processes where I do the same thing, iterate over subsets of a
data frame.
My data frame has ~250,000 rows, 30 variables, and the subsets are such
that there are about 6000 of them.

Performing a which() statement like yours seems quite fast.

For example, wrapping unix.time() around the which() expression, I get

   user  system elapsed   0.008   0.000   0.008

It's hard for me to imagine the single task of getting the indexes is slow
enough to be a bottleneck.

On the other hand, if the variable being used to identify subsets is a
factor with many levels (~6000 in my case), it is noticeably slower.

   user  system elapsed
  0.024   0.002   0.026

I haven't tested it, and have no real expectation that it will make a
difference, but perhaps sorting by the index variable before iterating
will help (if you haven't already). Since these are not true indexes in
the sense used by relational database systems, maybe it will make a
difference.

-- 
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062

On 11/20/13 12:16 PM, "Noah Silverman" <noahsilverman at
g.ucla.edu> wrote:
>Hello,
>
>I have a fairly large data.frame.  (About 150,000 rows of 100
>variables.) There are case IDs, and multiple entries for each ID, with a
>date stamp.  (i.e. records of peoples activity.)
>
>
>I need to iterate over each person (record ID) in the data set, and then
>process their data for each date.  The processing part is fast, the date
>part is fast.  Locating the records is slow.  I've even tried using
>data.table, with ID set as the index, and it is still slow.
>
>The line with the slow process (According to Rprof) is:
>
>
>j <- which( d$id == person )
>
>(I then process all the records indexed by j, which seems fast enough.)
>
>where d is my data.frame or data.table
>
>I thought that using the data.table indexing would speed things up, but
>not in this case.
>
>Any ideas on how to speed this up?
>
>
>Thanks!
>
>-- 
>Noah Silverman, M.S., C.Phil
>UCLA Department of Statistics
>8117 Math Sciences Building
>Los Angeles, CA 90095
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

William Dunlap

2013-Nov-21 16:48 UTC

head link

[R] Thoughts for faster indexing

> The line with the slow process (According to Rprof) is:
> j <- which( d$id == person )
> (I then process all the records indexed by j, which seems fast enough.)
Using split() once (and using its output in a loop) instead of == applied to
a long vector many times, as in
   for(j in split(seq_along(d$id), people)) {
       # newdata[j,] <- process(data[j,])
   }
is typically faster.  But this is the sort of thing that tapply() and the
functions
in package:plyr do for you.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf
> Of Noah Silverman
> Sent: Wednesday, November 20, 2013 12:17 PM
> To: 'R-help at r-project.org'
> Subject: [R] Thoughts for faster indexing
> 
> Hello,
> 
> I have a fairly large data.frame.  (About 150,000 rows of 100
> variables.) There are case IDs, and multiple entries for each ID, with a
> date stamp.  (i.e. records of peoples activity.)
> 
> 
> I need to iterate over each person (record ID) in the data set, and then
> process their data for each date.  The processing part is fast, the date
> part is fast.  Locating the records is slow.  I've even tried using
> data.table, with ID set as the index, and it is still slow.
> 
> The line with the slow process (According to Rprof) is:
> 
> 
> j <- which( d$id == person )
> 
> (I then process all the records indexed by j, which seems fast enough.)
> 
> where d is my data.frame or data.table
> 
> I thought that using the data.table indexing would speed things up, but
> not in this case.
> 
> Any ideas on how to speed this up?
> 
> 
> Thanks!
> 
> --
> Noah Silverman, M.S., C.Phil
> UCLA Department of Statistics
> 8117 Math Sciences Building
> Los Angeles, CA 90095
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

jlh.membership

2013-Nov-21 20:41 UTC

head link

[R] Thoughts for faster indexing

Not sure this helps but...
 
######
# data frame with 30,000 ID's, each with 5 "dates", plus some
random data...
df <- data.frame(id=rep(1:30000, each=5), 
                                  date=rep(1:5, each=30000),
                                  x=rnorm(150000), y=rnorm(150000,
mean=1),z=rnorm(150000,mean=3))
dt <- data.table(dt, key=id)      # note you have to set the  key...

# No difference when using which
system.time(for (i in 1:300) {j <- which(df$id==i)})
  user  system elapsed
  0.73    0.06    0.79

system.time(for (i in 1:300) {j <- which(dt$id==i)})
  user  system elapsed
  0.69    0.04    0.76

# 20 X faster using joins
system.time(for (i in 1:300) {select <- df[df$id==i,]})
  user  system elapsed
  19.25    0.36   19.64 
system.time(for (i in 1:300) {select <- dt[id==i,]})
  user  system elapsed
  4.32    0.11    4.45 
system.time(for (i in 1:300) {select <- dt[J(i)]})
  user  system elapsed
  0.88    0.00    0.88
######

Note that extracting select with a data table join still took longer than
generating an "index" using which, but having all the
columns in one step, instead of just the index might speed up later operations.


-----Original Message-----
From: Noah Silverman [mailto:noahsilverman at g.ucla.edu] 
Sent: Wednesday, November 20, 2013 3:17 PM
To: 'R-help at r-project.org'
Subject: [R] Thoughts for faster indexing

Hello,

I have a fairly large data.frame.  (About 150,000 rows of 100
variables.) There are case IDs, and multiple entries for each ID, with a date
stamp.  (i.e. records of peoples activity.)


I need to iterate over each person (record ID) in the data set, and then process
their data for each date.  The processing part is
fast, the date part is fast.  Locating the records is slow.  I've even tried
using data.table, with ID set as the index, and it is
still slow.

The line with the slow process (According to Rprof) is:


j <- which( d$id == person )

(I then process all the records indexed by j, which seems fast enough.)

where d is my data.frame or data.table

I thought that using the data.table indexing would speed things up, but not in
this case.

Any ideas on how to speed this up?


Thanks!

--
Noah Silverman, M.S., C.Phil
UCLA Department of Statistics
8117 Math Sciences Building
Los Angeles, CA 90095

R help - Nov 2013 - Thoughts for faster indexing

[R] Thoughts for faster indexing

[R] Thoughts for faster indexing

[R] Thoughts for faster indexing

[R] Thoughts for faster indexing

[R] Thoughts for faster indexing

[R] Thoughts for faster indexing