Mike Williamson
2010-Jun-30 23:53 UTC
[R] anyone know why package "RandomForest" na.roughfix is so slow??
Hi all, I am using the package "random forest" for random forest predictions. I like the package. However, I have fairly large data sets, and it can often take *hours* just to go through the "na.roughfix" call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well "tailored" to the requirements of random forest. Has anyone else seen that this is really slow? (I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: my "predict" data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) If so, any ideas how to speed this up? Thanks! Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]]
jim holtman
2010-Jul-01 00:26 UTC
[R] anyone know why package "RandomForest" na.roughfix is so slow??
Use "Rprof" to determine where time is being spent. This might point out some problems in the code. On Wed, Jun 30, 2010 at 7:53 PM, Mike Williamson <this.is.mvw at gmail.com> wrote:> Hi all, > > ? ?I am using the package "random forest" for random forest predictions. ?I > like the package. ?However, I have fairly large data sets, and it can often > take *hours* just to go through the "na.roughfix" call, which simply goes > through and cleans up any NA values to either the median (numerical data) or > the most frequent occurrence (factors). > ? ?I am going to start doing some comparisons between na.roughfix() and > some apply() functions which, it seems, are able to do the same job more > quickly. ?But I hesitate to duplicate a function that is already in the > package, since I presume the na.roughfix should be as quick as possible and > it should also be well "tailored" to the requirements of random forest. > > ? ?Has anyone else seen that this is really slow? ?(I haven't noticed > rfImpute to be nearly as slow, but I cannot say for sure: ?my "predict" data > sets are MUCH larger than my model data sets, so cleaning the prediction > data set simply takes much longer.) > ? ?If so, any ideas how to speed this up? > > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanks! > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? Mike > > > > "Telescopes and bathyscaphes and sonar probes of Scottish lakes, > Tacoma Narrows bridge collapse explained with abstract phase-space maps, > Some x-ray slides, a music score, Minard's Napoleanic war: > The most exciting frontier is charting what's already here." > ?-- xkcd > > -- > Help protect Wikipedia. Donate now: > http://wikimediafoundation.org/wiki/Support_Wikipedia/en > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Liaw, Andy
2010-Jul-01 15:58 UTC
[R] anyone know why package "RandomForest" na.roughfix is so slow??
You have not shown any code on exactly how you use na.roughfix(), so I can only guess. If you are doing something like: randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...) I would not be surprised that it's taking very long on large datasets. Most likely it's caused by the formula interface, not na.roughfix() itself. If that is your case, try doing the imputation beforehand and run randomForest() afterward; e.g., myroughfixed <- na.roughfix(mybigdata) randomForest(myroughfixed[list.of.predictor.columns], myroughfixed[[myresponse]],...) HTH, Andy -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Mike Williamson Sent: Wednesday, June 30, 2010 7:53 PM To: r-help Subject: [R] anyone know why package "RandomForest" na.roughfix is so slow?? Hi all, I am using the package "random forest" for random forest predictions. I like the package. However, I have fairly large data sets, and it can often take *hours* just to go through the "na.roughfix" call, which simply goes through and cleans up any NA values to either the median (numerical data) or the most frequent occurrence (factors). I am going to start doing some comparisons between na.roughfix() and some apply() functions which, it seems, are able to do the same job more quickly. But I hesitate to duplicate a function that is already in the package, since I presume the na.roughfix should be as quick as possible and it should also be well "tailored" to the requirements of random forest. Has anyone else seen that this is really slow? (I haven't noticed rfImpute to be nearly as slow, but I cannot say for sure: my "predict" data sets are MUCH larger than my model data sets, so cleaning the prediction data set simply takes much longer.) If so, any ideas how to speed this up? Thanks! Mike "Telescopes and bathyscaphes and sonar probes of Scottish lakes, Tacoma Narrows bridge collapse explained with abstract phase-space maps, Some x-ray slides, a music score, Minard's Napoleanic war: The most exciting frontier is charting what's already here." -- xkcd -- Help protect Wikipedia. Donate now: http://wikimediafoundation.org/wiki/Support_Wikipedia/en [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Notice: This e-mail message, together with any attachme...{{dropped:11}}
Maybe Matching Threads
- manipulating the Date & Time classes
- manipulating the Date & Time classes
- question regarding "varImpPlot" results vs. model$importance data on package "RandomForest"
- How to get 'R' to talk BACK to other languages / scripts??
- how to convert "sloppy data" into a time series?