Dear All, I am far from being a guru about parallel programming. Most of the time, I rely or randomForest for data mining large datasets. I would like to give a try also to the gradient boosted methods in GBM, but I have a need for parallelization. I normally rely on gbm.fit for speed reasons, and I usually call it this way gbm_model <- gbm.fit(trainRF,prices_train, offset = NULL, misc = NULL, distribution = "multinomial", w = NULL, var.monotone = NULL, n.trees = 50, interaction.depth = 5, n.minobsinnode = 10, shrinkage = 0.001, bag.fraction = 0.5, nTrain = (n_train/2), keep.data = FALSE, verbose = TRUE, var.names = NULL, response.name = NULL) Does anybody know an easy way to parallelize the model (in this case it means simply having 4 cores on the same machine working on the problem)? Any suggestion is welcome. Cheers Lorenzo
See this: https://code.google.com/p/gradientboostedmodels/issues/detail?id=3 and this: https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel Max On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella <lorenzo.isella@gmail.com>wrote:> Dear All, > I am far from being a guru about parallel programming. > Most of the time, I rely or randomForest for data mining large datasets. > I would like to give a try also to the gradient boosted methods in GBM, > but I have a need for parallelization. > I normally rely on gbm.fit for speed reasons, and I usually call it this > way > > > > gbm_model <- gbm.fit(trainRF,prices_train, > offset = NULL, > misc = NULL, > distribution = "multinomial", > w = NULL, > var.monotone = NULL, > n.trees = 50, > interaction.depth = 5, > n.minobsinnode = 10, > shrinkage = 0.001, > bag.fraction = 0.5, > nTrain = (n_train/2), > keep.data = FALSE, > verbose = TRUE, > var.names = NULL, > response.name = NULL) > > > Does anybody know an easy way to parallelize the model (in this case it > means simply having 4 cores on the same machine working on the problem)? > Any suggestion is welcome. > Cheers > > Lorenzo > > ______________________________**________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide http://www.R-project.org/** > posting-guide.html <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >-- Max [[alternative HTML version deleted]]
Thanks a lot for the quick answer. However, from what I see, the parallelization affects only the cross-validation part in the gbm interface (but it changes nothing when you call gbm.fit). Am I missing anything here? Is there any fundamental reason why gbm.fit cannot be parallelized? Lorenzo On Sun, 24 Mar 2013 12:45:39 +0100, Max Kuhn <mxkuhn at gmail.com> wrote:> See this: > > https://code.google.com/p/gradientboostedmodels/issues/detail?id=3 > > > and this: > > https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel > > > > Max > > > On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella > <lorenzo.isella at gmail.com> wrote: > >> Dear All, >> >> I am far from being a guru about parallel programming. >> >> Most of the time, I rely or randomForest for data mining large datasets. >> >> I would like to give a try also to the gradient boosted methods in GBM, >> but I have a need for parallelization. >> >> I normally rely on gbm.fit for speed reasons, and I usually call it >> this way >> >> >> >> >> >> >> >> gbm_model <- gbm.fit(trainRF,prices_train, >> >> offset = NULL, >> >> misc = NULL, >> >> distribution = "multinomial", >> >> w = NULL, >> >> var.monotone = NULL, >> >> n.trees = 50, >> >> interaction.depth = 5, >> >> n.minobsinnode = 10, >> >> shrinkage = 0.001, >> >> bag.fraction = 0.5, >> >> nTrain = (n_train/2), >> >> keep.data = FALSE, >> >> verbose = TRUE, >> >> var.names = NULL, >> >> response.name = NULL) >> >> >> >> >> >> Does anybody know an easy way to parallelize the model (in this case it >> means simply having 4 cores on the same >>machine working on the >> problem)? >> >> Any suggestion is welcome. >> >> Cheers >> >> >> >> Lorenzo >> >> >> >> ______________________________________________ >> >> R-help at r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Max
Yes, I think the second link is a test build of a parallelized cv loop within gbm(). On Mar 24, 2013, at 9:28 AM, "Lorenzo Isella" <lorenzo.isella at gmail.com> wrote:> Thanks a lot for the quick answer. > However, from what I see, the parallelization affects only the cross-validation part in the gbm interface (but it changes nothing when you call gbm.fit). > Am I missing anything here? > Is there any fundamental reason why gbm.fit cannot be parallelized? > > Lorenzo > > > > On Sun, 24 Mar 2013 12:45:39 +0100, Max Kuhn <mxkuhn at gmail.com> wrote: > >> See this: >> >> https://code.google.com/p/gradientboostedmodels/issues/detail?id=3 >> >> >> and this: >> >> https://code.google.com/p/gradientboostedmodels/source/browse/?name=parallel >> >> >> >> Max >> >> >> On Sun, Mar 24, 2013 at 7:31 AM, Lorenzo Isella <lorenzo.isella at gmail.com> wrote: >> >>> Dear All, >>> >>> I am far from being a guru about parallel programming. >>> >>> Most of the time, I rely or randomForest for data mining large datasets. >>> >>> I would like to give a try also to the gradient boosted methods in GBM, but I have a need for parallelization. >>> >>> I normally rely on gbm.fit for speed reasons, and I usually call it this way >>> >>> >>> >>> >>> >>> >>> >>> gbm_model <- gbm.fit(trainRF,prices_train, >>> >>> offset = NULL, >>> >>> misc = NULL, >>> >>> distribution = "multinomial", >>> >>> w = NULL, >>> >>> var.monotone = NULL, >>> >>> n.trees = 50, >>> >>> interaction.depth = 5, >>> >>> n.minobsinnode = 10, >>> >>> shrinkage = 0.001, >>> >>> bag.fraction = 0.5, >>> >>> nTrain = (n_train/2), >>> >>> keep.data = FALSE, >>> >>> verbose = TRUE, >>> >>> var.names = NULL, >>> >>> response.name = NULL) >>> >>> >>> >>> >>> >>> Does anybody know an easy way to parallelize the model (in this case it means simply having 4 cores on the same >>machine working on the problem)? >>> >>> Any suggestion is welcome. >>> >>> Cheers >>> >>> >>> >>> Lorenzo >>> >>> >>> >>> ______________________________________________ >>> >>> R-help at r-project.org mailing list >>> >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> >>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> >> >> -- >> Max