Mary Kindall
2013-Oct-04 19:16 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
This reproducible example is from the help of 'gbm' in R.
I ran the following code in R, and works fine as long as the response is
numeric. The problem starts when I convert the response from numeric to
binary (0/1). It gives me an error.
My question is, is converting the response from numeric to binary will have
this much effect.
Help page code:
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each
node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the
object
verbose=FALSE) # don't print out progress
gbm1
summary(gbm1)
Now I slightly change the response variable to make it binary.
Y[Y < mean(Y)] = 0 #My edit
Y[Y >= mean(Y)] = 1 #My edit
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
gbm2 <-
gbm(fmla, # formula
data=data, # dataset
distribution="bernoulli", # My edit
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way
interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each
node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the
object
verbose=FALSE) # don't print out progress
gbm2
> gbm2
gbm(formula = fmla, distribution = "bernoulli", data = data,
n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
cv.folds = 3, keep.data = TRUE, verbose = FALSE)
A gradient boosted model with bernoulli loss function.
1000 iterations were performed.
The best cross-validation iteration was .
The best test-set iteration was .
Error in 1:n.trees : argument of length 0
My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?
Thanks
--
-------------
Mary Kindall
Yorktown Heights, NY
USA
[[alternative HTML version deleted]]
Bert Gunter
2013-Oct-04 19:26 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
"My question is, Is binarizing the response will have so much effect that it does not find anythin useful in the predictors?" Yes. Dichotomizing throws away most of the information in the data. Which is why you shouldn't do it. This is a statistics, not an R question, so any follow-up should be posted on a statistical list like stats.stackexchange.com, not here. -- Bert On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:> This reproducible example is from the help of 'gbm' in R. > > I ran the following code in R, and works fine as long as the response is > numeric. The problem starts when I convert the response from numeric to > binary (0/1). It gives me an error. > > My question is, is converting the response from numeric to binary will have > this much effect. > > Help page code: > > N <- 1000 > X1 <- runif(N) > X2 <- 2*runif(N) > X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) > X4 <- factor(sample(letters[1:6],N,replace=TRUE)) > X5 <- factor(sample(letters[1:3],N,replace=TRUE)) > X6 <- 3*runif(N) > mu <- c(-1,0,1,2)[as.numeric(X3)] > > SNR <- 10 # signal-to-noise ratio > Y <- X1**1.5 + 2 * (X2**.5) + mu > sigma <- sqrt(var(Y)/SNR) > Y <- Y + rnorm(N,0,sigma) > > # introduce some missing values > X1[sample(1:N,size=500)] <- NA > X4[sample(1:N,size=300)] <- NA > > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > > # fit initial model > gbm1 <- > gbm(Y~X1+X2+X3+X4+X5+X6, # formula > data=data, # dataset > var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, > # +1: monotone increase, > # 0: no monotone restrictions > distribution="gaussian", # see the help for other choices > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm1 > summary(gbm1) > > > Now I slightly change the response variable to make it binary. > > Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My edit > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit > > gbm2 <- > gbm(fmla, # formula > data=data, # dataset > distribution="bernoulli", # My edit > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm2 > > >> gbm2 > gbm(formula = fmla, distribution = "bernoulli", data = data, > n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, > shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, > cv.folds = 3, keep.data = TRUE, verbose = FALSE) > A gradient boosted model with bernoulli loss function. > 1000 iterations were performed. > The best cross-validation iteration was . > The best test-set iteration was . > Error in 1:n.trees : argument of length 0 > > > My question is, Is binarizing the response will have so much effect that it > does not find anythin useful in the predictors? > > Thanks > > -- > ------------- > Mary Kindall > Yorktown Heights, NY > USA > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Bert Gunter Genentech Nonclinical Biostatistics (650) 467-7374
Marc Schwartz
2013-Oct-04 19:34 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:> This reproducible example is from the help of 'gbm' in R. > > I ran the following code in R, and works fine as long as the response is > numeric. The problem starts when I convert the response from numeric to > binary (0/1). It gives me an error. > > My question is, is converting the response from numeric to binary will have > this much effect. > > Help page code: > > N <- 1000 > X1 <- runif(N) > X2 <- 2*runif(N) > X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1]) > X4 <- factor(sample(letters[1:6],N,replace=TRUE)) > X5 <- factor(sample(letters[1:3],N,replace=TRUE)) > X6 <- 3*runif(N) > mu <- c(-1,0,1,2)[as.numeric(X3)] > > SNR <- 10 # signal-to-noise ratio > Y <- X1**1.5 + 2 * (X2**.5) + mu > sigma <- sqrt(var(Y)/SNR) > Y <- Y + rnorm(N,0,sigma) > > # introduce some missing values > X1[sample(1:N,size=500)] <- NA > X4[sample(1:N,size=300)] <- NA > > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > > # fit initial model > gbm1 <- > gbm(Y~X1+X2+X3+X4+X5+X6, # formula > data=data, # dataset > var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease, > # +1: monotone increase, > # 0: no monotone restrictions > distribution="gaussian", # see the help for other choices > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm1 > summary(gbm1) > > > Now I slightly change the response variable to make it binary. > > Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My edit > data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) > fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit > > gbm2 <- > gbm(fmla, # formula > data=data, # dataset > distribution="bernoulli", # My edit > n.trees=1000, # number of trees > shrinkage=0.05, # shrinkage or learning rate, > # 0.001 to 0.1 usually work > interaction.depth=3, # 1: additive model, 2: two-way > interactions, etc. > bag.fraction = 0.5, # subsampling fraction, 0.5 is probably > best > train.fraction = 0.5, # fraction of data for training, > # first train.fraction*N used for training > n.minobsinnode = 10, # minimum total weight needed in each > node > cv.folds = 3, # do 3-fold cross-validation > keep.data=TRUE, # keep a copy of the dataset with the > object > verbose=FALSE) # don't print out progress > > gbm2 > > >> gbm2 > gbm(formula = fmla, distribution = "bernoulli", data = data, > n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10, > shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5, > cv.folds = 3, keep.data = TRUE, verbose = FALSE) > A gradient boosted model with bernoulli loss function. > 1000 iterations were performed. > The best cross-validation iteration was . > The best test-set iteration was . > Error in 1:n.trees : argument of length 0 > > > My question is, Is binarizing the response will have so much effect that it > does not find anythin useful in the predictors? > > ThanksSure, it's possible. See this page for a good overview of why you should not dichotomize continuous data: http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous Regards, Marc Schwartz
peter dalgaard
2013-Oct-04 19:35 UTC
[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
On Oct 4, 2013, at 21:16 , Mary Kindall wrote:> Y[Y < mean(Y)] = 0 #My edit > Y[Y >= mean(Y)] = 1 #My editI have no clue about gbm, but I don't think the above does what I think you think it does. Y <- as.integer(Y >= mean(Y)) might be closer to the mark. -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com