Jack Arnestad
2018-Apr-14 00:31 UTC
[R] Efficient way to subset rows in R for dataset with 10^7 columns
I have a data.table with dimensions 100 by 10^7. When I do trainIndex <- caret::createDataPartition( df$status, p = .9, list = FALSE, times = 1 ) outerTrain <- df[trainIndex] outerTest <- df[-trainIndex] Subsetting the rows of df takes over 20 minutes. What is the best way to efficiently subset this? Thanks! [[alternative HTML version deleted]]
Jeff Newmiller
2018-Apr-14 01:08 UTC
[R] Efficient way to subset rows in R for dataset with 10^7 columns
You have 10^7 columns? That process is bound to be slow. On April 13, 2018 5:31:32 PM PDT, Jack Arnestad <jackarnestad at gmail.com> wrote:>I have a data.table with dimensions 100 by 10^7. > >When I do > > trainIndex <- > caret::createDataPartition( > df$status, > p = .9, > list = FALSE, > times = 1 > ) > outerTrain <- df[trainIndex] > outerTest <- df[-trainIndex] > >Subsetting the rows of df takes over 20 minutes. > >What is the best way to efficiently subset this? > >Thanks! > > [[alternative HTML version deleted]] > >______________________________________________ >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.-- Sent from my phone. Please excuse my brevity.
Jeff Newmiller
2018-Apr-14 02:07 UTC
[R] Efficient way to subset rows in R for dataset with 10^7 columns
Oh, there are ways, but the constraining issue here is moving data (memory bandwidth), and data table is probably already the fastest mechanism for doing that. If you have a computer with four or more real cores you can try setting up a subset of the columns in each task and cbind the results afterward, but it will be hard to accomplish without making extra copies of the data. You are already probably already using virtual memory which is saved to and from hard disk storage as needed. Working in Spark with a distributed file system like Hadoop might solve some of these problems... but I haven't done real work with such tools. On April 13, 2018 6:31:32 PM PDT, Jack Arnestad <jackarnestad at gmail.com> wrote:>Yes unfortunately. The goal of the "outer" is to do feature selection >before fitting it to a model. > >Is there a way it could be parallelized? > >Thanks! > >On Fri, Apr 13, 2018 at 9:08 PM, Jeff Newmiller ><jdnewmil at dcn.davis.ca.us> >wrote: > >> You have 10^7 columns? That process is bound to be slow. >> >> On April 13, 2018 5:31:32 PM PDT, Jack Arnestad ><jackarnestad at gmail.com> >> wrote: >> >I have a data.table with dimensions 100 by 10^7. >> > >> >When I do >> > >> > trainIndex <- >> > caret::createDataPartition( >> > df$status, >> > p = .9, >> > list = FALSE, >> > times = 1 >> > ) >> > outerTrain <- df[trainIndex] >> > outerTest <- df[-trainIndex] >> > >> >Subsetting the rows of df takes over 20 minutes. >> > >> >What is the best way to efficiently subset this? >> > >> >Thanks! >> > >> > [[alternative HTML version deleted]] >> > >> >______________________________________________ >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >https://stat.ethz.ch/mailman/listinfo/r-help >> >PLEASE do read the posting guide >> >http://www.R-project.org/posting-guide.html >> >and provide commented, minimal, self-contained, reproducible code. >> >> -- >> Sent from my phone. Please excuse my brevity. >>-- Sent from my phone. Please excuse my brevity.