Hello, I have a fairly large data.frame. (About 150,000 rows of 100 variables.) There are case IDs, and multiple entries for each ID, with a date stamp. (i.e. records of peoples activity.) I need to iterate over each person (record ID) in the data set, and then process their data for each date. The processing part is fast, the date part is fast. Locating the records is slow. I've even tried using data.table, with ID set as the index, and it is still slow. The line with the slow process (According to Rprof) is: j <- which( d$id == person ) (I then process all the records indexed by j, which seems fast enough.) where d is my data.frame or data.table I thought that using the data.table indexing would speed things up, but not in this case. Any ideas on how to speed this up? Thanks! -- Noah Silverman, M.S., C.Phil UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095
you need to show the statement in context with the rest of the script. you need to tell us what you want to do, not how you want to do it. Sent from my iPad On Nov 20, 2013, at 15:16, Noah Silverman <noahsilverman at g.ucla.edu> wrote:> Hello, > > I have a fairly large data.frame. (About 150,000 rows of 100 > variables.) There are case IDs, and multiple entries for each ID, with a > date stamp. (i.e. records of peoples activity.) > > > I need to iterate over each person (record ID) in the data set, and then > process their data for each date. The processing part is fast, the date > part is fast. Locating the records is slow. I've even tried using > data.table, with ID set as the index, and it is still slow. > > The line with the slow process (According to Rprof) is: > > > j <- which( d$id == person ) > > (I then process all the records indexed by j, which seems fast enough.) > > where d is my data.frame or data.table > > I thought that using the data.table indexing would speed things up, but > not in this case. > > Any ideas on how to speed this up? > > > Thanks! > > -- > Noah Silverman, M.S., C.Phil > UCLA Department of Statistics > 8117 Math Sciences Building > Los Angeles, CA 90095 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
What the Data Munger Guru said. Plus: this is almost certainly a job for ddply or data.table. Noah Silverman-2 wrote> Hello, > > I have a fairly large data.frame. (About 150,000 rows of 100 > variables.) There are case IDs, and multiple entries for each ID, with a > date stamp. (i.e. records of peoples activity.) > > > I need to iterate over each person (record ID) in the data set, and then > process their data for each date. The processing part is fast, the date > part is fast. Locating the records is slow. I've even tried using > data.table, with ID set as the index, and it is still slow. > > The line with the slow process (According to Rprof) is: > > > j <- which( d$id == person ) > > (I then process all the records indexed by j, which seems fast enough.) > > where d is my data.frame or data.table > > I thought that using the data.table indexing would speed things up, but > not in this case. > > Any ideas on how to speed this up? > > > Thanks! > > -- > Noah Silverman, M.S., C.Phil > UCLA Department of Statistics > 8117 Math Sciences Building > Los Angeles, CA 90095 > > ______________________________________________> R-help@> mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- View this message in context: http://r.789695.n4.nabble.com/Thoughts-for-faster-indexing-tp4680854p4680889.html Sent from the R help mailing list archive at Nabble.com.
I have some processes where I do the same thing, iterate over subsets of a data frame. My data frame has ~250,000 rows, 30 variables, and the subsets are such that there are about 6000 of them. Performing a which() statement like yours seems quite fast. For example, wrapping unix.time() around the which() expression, I get user system elapsed 0.008 0.000 0.008 It's hard for me to imagine the single task of getting the indexes is slow enough to be a bottleneck. On the other hand, if the variable being used to identify subsets is a factor with many levels (~6000 in my case), it is noticeably slower. user system elapsed 0.024 0.002 0.026 I haven't tested it, and have no real expectation that it will make a difference, but perhaps sorting by the index variable before iterating will help (if you haven't already). Since these are not true indexes in the sense used by relational database systems, maybe it will make a difference. -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 11/20/13 12:16 PM, "Noah Silverman" <noahsilverman at g.ucla.edu> wrote:>Hello, > >I have a fairly large data.frame. (About 150,000 rows of 100 >variables.) There are case IDs, and multiple entries for each ID, with a >date stamp. (i.e. records of peoples activity.) > > >I need to iterate over each person (record ID) in the data set, and then >process their data for each date. The processing part is fast, the date >part is fast. Locating the records is slow. I've even tried using >data.table, with ID set as the index, and it is still slow. > >The line with the slow process (According to Rprof) is: > > >j <- which( d$id == person ) > >(I then process all the records indexed by j, which seems fast enough.) > >where d is my data.frame or data.table > >I thought that using the data.table indexing would speed things up, but >not in this case. > >Any ideas on how to speed this up? > > >Thanks! > >-- >Noah Silverman, M.S., C.Phil >UCLA Department of Statistics >8117 Math Sciences Building >Los Angeles, CA 90095 > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
> The line with the slow process (According to Rprof) is: > j <- which( d$id == person ) > (I then process all the records indexed by j, which seems fast enough.)Using split() once (and using its output in a loop) instead of == applied to a long vector many times, as in for(j in split(seq_along(d$id), people)) { # newdata[j,] <- process(data[j,]) } is typically faster. But this is the sort of thing that tapply() and the functions in package:plyr do for you. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Noah Silverman > Sent: Wednesday, November 20, 2013 12:17 PM > To: 'R-help at r-project.org' > Subject: [R] Thoughts for faster indexing > > Hello, > > I have a fairly large data.frame. (About 150,000 rows of 100 > variables.) There are case IDs, and multiple entries for each ID, with a > date stamp. (i.e. records of peoples activity.) > > > I need to iterate over each person (record ID) in the data set, and then > process their data for each date. The processing part is fast, the date > part is fast. Locating the records is slow. I've even tried using > data.table, with ID set as the index, and it is still slow. > > The line with the slow process (According to Rprof) is: > > > j <- which( d$id == person ) > > (I then process all the records indexed by j, which seems fast enough.) > > where d is my data.frame or data.table > > I thought that using the data.table indexing would speed things up, but > not in this case. > > Any ideas on how to speed this up? > > > Thanks! > > -- > Noah Silverman, M.S., C.Phil > UCLA Department of Statistics > 8117 Math Sciences Building > Los Angeles, CA 90095 > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Not sure this helps but... ###### # data frame with 30,000 ID's, each with 5 "dates", plus some random data... df <- data.frame(id=rep(1:30000, each=5), date=rep(1:5, each=30000), x=rnorm(150000), y=rnorm(150000, mean=1),z=rnorm(150000,mean=3)) dt <- data.table(dt, key=id) # note you have to set the key... # No difference when using which system.time(for (i in 1:300) {j <- which(df$id==i)}) user system elapsed 0.73 0.06 0.79 system.time(for (i in 1:300) {j <- which(dt$id==i)}) user system elapsed 0.69 0.04 0.76 # 20 X faster using joins system.time(for (i in 1:300) {select <- df[df$id==i,]}) user system elapsed 19.25 0.36 19.64 system.time(for (i in 1:300) {select <- dt[id==i,]}) user system elapsed 4.32 0.11 4.45 system.time(for (i in 1:300) {select <- dt[J(i)]}) user system elapsed 0.88 0.00 0.88 ###### Note that extracting select with a data table join still took longer than generating an "index" using which, but having all the columns in one step, instead of just the index might speed up later operations. -----Original Message----- From: Noah Silverman [mailto:noahsilverman at g.ucla.edu] Sent: Wednesday, November 20, 2013 3:17 PM To: 'R-help at r-project.org' Subject: [R] Thoughts for faster indexing Hello, I have a fairly large data.frame. (About 150,000 rows of 100 variables.) There are case IDs, and multiple entries for each ID, with a date stamp. (i.e. records of peoples activity.) I need to iterate over each person (record ID) in the data set, and then process their data for each date. The processing part is fast, the date part is fast. Locating the records is slow. I've even tried using data.table, with ID set as the index, and it is still slow. The line with the slow process (According to Rprof) is: j <- which( d$id == person ) (I then process all the records indexed by j, which seems fast enough.) where d is my data.frame or data.table I thought that using the data.table indexing would speed things up, but not in this case. Any ideas on how to speed this up? Thanks! -- Noah Silverman, M.S., C.Phil UCLA Department of Statistics 8117 Math Sciences Building Los Angeles, CA 90095