Mary Kindall
2013-Oct-04 19:16 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
This reproducible example is from the help of 'gbm' in R. I ran the following code in R, and works fine as long as the response is numeric. The problem starts when I convert the response from numeric to binary (0/1). It gives me an error. My question is, is converting the response from numeric to binary will have this much effect. Help page code: N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) X4 <- factor(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # introduce some missing values X1[sample(1:N,size=500)] <- NA X4[sample(1:N,size=300)] <- NA data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # fit initial model gbm1 <- gbm(Y~X1+X2+X3+X4+X5+X6, # formula data=data, # dataset var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, # +1: monotone increase, # 0: no monotone restrictions distribution="gaussian", # see the help for other choices n.trees=1000, # number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3, # do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm1 summary(gbm1) Now I slightly change the response variable to make it binary. Y[Y < mean(Y)] = 0 #My edit Y[Y >= mean(Y)] = 1 #My edit data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit gbm2 <- gbm(fmla, # formula data=data, # dataset distribution="bernoulli", # My edit n.trees=1000, # number of trees shrinkage=0.05, # shrinkage or learning rate, # 0.001 to 0.1 usually work interaction.depth=3, # 1: additive model, 2: two-way interactions, etc. bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best train.fraction = 0.5, # fraction of data for training, # first train.fraction*N used for training n.minobsinnode = 10, # minimum total weight needed in each node cv.folds = 3, # do 3-fold cross-validation keep.data=TRUE, # keep a copy of the dataset with the object verbose=FALSE) # don't print out progress gbm2> gbm2gbm(formula = fmla, distribution = "bernoulli", data = data, n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, cv.folds = 3, keep.data = TRUE, verbose = FALSE) A gradient boosted model with bernoulli loss function. 1000 iterations were performed. The best cross-validation iteration was . The best test-set iteration was . Error in 1:n.trees : argument of length 0 My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors? Thanks -- ------------- Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]]
Bert Gunter
2013-Oct-04 19:26 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
"My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors?" Yes. Dichotomizing throws away most of the information in the data. Which is why you shouldn't do it. This is a statistics, not an R question, so any follow-up should be posted on a statistical list like stats.stackexchange.com, not here. -- Bert On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:> This reproducible example is from the help of 'gbm' in R. > > I ran the following code in R, and works fine as long as the response is > numeric. The problem starts when I convert the response from numeric to > binary (0/1). It gives me an error. > > My question is, is converting the response from numeric to binary will have > this much effect. > > Help page code: > > N <- 1000 > X1 <- runif(N) > X2 <- 2*runif(N) > X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) > X4 <- factor(sample(letters[1:6],N,replace=TRUE)) > X5 <- factor(sample(letters[1:3],N,replace=TRUE)) > X6 <- 3*runif(N) > mu <- c(-1,0,1,2)[as.numeric(X3)] > > SNR <- 10 # signal-to-noise ratio > Y <- X1**1.5 + 2 * (X2**.5) + mu > sigma <- sqrt(var(Y)/SNR) > Y <- Y + rnorm(N,0,sigma) > > # introduce some missing values > X1[sample(1:N,size=500)] <- NA > X4[sample(1:N,size=300)] <- NA > > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > > # fit initial model > gbm1 <- > gbm(Y~X1+X2+X3+X4+X5+X6, # formula > data=data, # dataset > var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, > # +1: monotone increase, > # 0: no monotone restrictions > distribution="gaussian", # see the help for other choices > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm1 > summary(gbm1) > > > Now I slightly change the response variable to make it binary. > > Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My edit > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit > > gbm2 <- > gbm(fmla, # formula > data=data, # dataset > distribution="bernoulli", # My edit > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm2 > > >> gbm2 > gbm(formula = fmla, distribution = "bernoulli", data = data, > n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, > shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, > cv.folds = 3, keep.data = TRUE, verbose = FALSE) > A gradient boosted model with bernoulli loss function. > 1000 iterations were performed. > The best cross-validation iteration was . > The best test-set iteration was . > Error in 1:n.trees : argument of length 0 > > > My question is, Is binarizing the response will have so much effect that it > does not find anythin useful in the predictors? > > Thanks > > -- > ------------- > Mary Kindall > Yorktown Heights, NY > USA > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374
Marc Schwartz
2013-Oct-04 19:34 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:> This reproducible example is from the help of 'gbm' in R. > > I ran the following code in R, and works fine as long as the response is > numeric. The problem starts when I convert the response from numeric to > binary (0/1). It gives me an error. > > My question is, is converting the response from numeric to binary will have > this much effect. > > Help page code: > > N <- 1000 > X1 <- runif(N) > X2 <- 2*runif(N) > X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) > X4 <- factor(sample(letters[1:6],N,replace=TRUE)) > X5 <- factor(sample(letters[1:3],N,replace=TRUE)) > X6 <- 3*runif(N) > mu <- c(-1,0,1,2)[as.numeric(X3)] > > SNR <- 10 # signal-to-noise ratio > Y <- X1**1.5 + 2 * (X2**.5) + mu > sigma <- sqrt(var(Y)/SNR) > Y <- Y + rnorm(N,0,sigma) > > # introduce some missing values > X1[sample(1:N,size=500)] <- NA > X4[sample(1:N,size=300)] <- NA > > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > > # fit initial model > gbm1 <- > gbm(Y~X1+X2+X3+X4+X5+X6, # formula > data=data, # dataset > var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, > # +1: monotone increase, > # 0: no monotone restrictions > distribution="gaussian", # see the help for other choices > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm1 > summary(gbm1) > > > Now I slightly change the response variable to make it binary. > > Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My edit > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit > > gbm2 <- > gbm(fmla, # formula > data=data, # dataset > distribution="bernoulli", # My edit > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm2 > > >> gbm2 > gbm(formula = fmla, distribution = "bernoulli", data = data, > n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, > shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, > cv.folds = 3, keep.data = TRUE, verbose = FALSE) > A gradient boosted model with bernoulli loss function. > 1000 iterations were performed. > The best cross-validation iteration was . > The best test-set iteration was . > Error in 1:n.trees : argument of length 0 > > > My question is, Is binarizing the response will have so much effect that it > does not find anythin useful in the predictors? > > ThanksSure, it's possible. See this page for a good overview of why you should not dichotomize continuous data: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous Regards, Marc Schwartz
peter dalgaard
2013-Oct-04 19:35 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 21:16 , Mary Kindall wrote:> Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My editI have no clue about gbm, but I don't think the above does what I think you think it does. Y <- as.integer(Y >= mean(Y)) might be closer to the mark. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com