thr3ads.net - R help - [R] Why 'gbm' is not giving me error when I change the response from numeric to categorical? [Oct 2013]

If this information is useful, please help other people find it:
Share via:

Mary Kindall

2013-Oct-04 19:16 UTC

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

This reproducible example is from the help of 'gbm' in R.

I ran the following code in R, and works fine as long as the response is
numeric.  The problem starts when I convert the response from numeric to
binary (0/1). It gives me an error.

My question is, is converting the response from numeric to binary will have
this much effect.

Help page code:

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model
gbm1 <-
  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
      data=data,                   # dataset
      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
      # +1: monotone increase,
      #  0: no monotone restrictions
      distribution="gaussian",     # see the help for other choices
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
      # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way
interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
best
      train.fraction = 0.5,        # fraction of data for training,
      # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each
node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the
object
      verbose=FALSE)               # don't print out progress

gbm1
summary(gbm1)


Now I slightly change the response variable to make it binary.

Y[Y < mean(Y)] = 0   #My edit
Y[Y >= mean(Y)] = 1  #My edit
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit

gbm2 <-
  gbm(fmla,                        # formula
      data=data,                   # dataset
      distribution="bernoulli",     # My edit
      n.trees=1000,                # number of trees
      shrinkage=0.05,              # shrinkage or learning rate,
      # 0.001 to 0.1 usually work
      interaction.depth=3,         # 1: additive model, 2: two-way
interactions, etc.
      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
best
      train.fraction = 0.5,        # fraction of data for training,
      # first train.fraction*N used for training
      n.minobsinnode = 10,         # minimum total weight needed in each
node
      cv.folds = 3,                # do 3-fold cross-validation
      keep.data=TRUE,              # keep a copy of the dataset with the
object
      verbose=FALSE)               # don't print out progress

gbm2

> gbm2gbm(formula = fmla, distribution = "bernoulli", data = data,
    n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
    shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
    cv.folds = 3, keep.data = TRUE, verbose = FALSE)
A gradient boosted model with bernoulli loss function.
1000 iterations were performed.
The best cross-validation iteration was .
The best test-set iteration was .
Error in 1:n.trees : argument of length 0


My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?

Thanks

-- 
-------------
Mary Kindall
Yorktown Heights, NY
USA

	[[alternative HTML version deleted]]

Bert Gunter

2013-Oct-04 19:26 UTC

head link

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

"My question is, Is binarizing the response will have so much effect that
it
does not find anythin useful in the predictors?"

Yes. Dichotomizing throws away most of the information in the data.
Which is why you shouldn't do it.

This is a statistics, not an R question, so any follow-up should be
posted on a statistical list like stats.stackexchange.com, not here.

-- Bert

On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall <mary.kindall at gmail.com>
wrote:> This reproducible example is from the help of 'gbm' in R.
>
> I ran the following code in R, and works fine as long as the response is
> numeric.  The problem starts when I convert the response from numeric to
> binary (0/1). It gives me an error.
>
> My question is, is converting the response from numeric to binary will have
> this much effect.
>
> Help page code:
>
> N <- 1000
> X1 <- runif(N)
> X2 <- 2*runif(N)
> X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
> X4 <- factor(sample(letters[1:6],N,replace=TRUE))
> X5 <- factor(sample(letters[1:3],N,replace=TRUE))
> X6 <- 3*runif(N)
> mu <- c(-1,0,1,2)[as.numeric(X3)]
>
> SNR <- 10 # signal-to-noise ratio
> Y <- X1**1.5 + 2 * (X2**.5) + mu
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(N,0,sigma)
>
> # introduce some missing values
> X1[sample(1:N,size=500)] <- NA
> X4[sample(1:N,size=300)] <- NA
>
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
>
> # fit initial model
> gbm1 <-
>   gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
>       data=data,                   # dataset
>       var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
>       # +1: monotone increase,
>       #  0: no monotone restrictions
>       distribution="gaussian",     # see the help for other
choices
>       n.trees=1000,                # number of trees
>       shrinkage=0.05,              # shrinkage or learning rate,
>       # 0.001 to 0.1 usually work
>       interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>       bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>       train.fraction = 0.5,        # fraction of data for training,
>       # first train.fraction*N used for training
>       n.minobsinnode = 10,         # minimum total weight needed in each
> node
>       cv.folds = 3,                # do 3-fold cross-validation
>       keep.data=TRUE,              # keep a copy of the dataset with the
> object
>       verbose=FALSE)               # don't print out progress
>
> gbm1
> summary(gbm1)
>
>
> Now I slightly change the response variable to make it binary.
>
> Y[Y < mean(Y)] = 0   #My edit
> Y[Y >= mean(Y)] = 1  #My edit
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
>
> gbm2 <-
>   gbm(fmla,                        # formula
>       data=data,                   # dataset
>       distribution="bernoulli",     # My edit
>       n.trees=1000,                # number of trees
>       shrinkage=0.05,              # shrinkage or learning rate,
>       # 0.001 to 0.1 usually work
>       interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>       bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>       train.fraction = 0.5,        # fraction of data for training,
>       # first train.fraction*N used for training
>       n.minobsinnode = 10,         # minimum total weight needed in each
> node
>       cv.folds = 3,                # do 3-fold cross-validation
>       keep.data=TRUE,              # keep a copy of the dataset with the
> object
>       verbose=FALSE)               # don't print out progress
>
> gbm2
>
>
>> gbm2
> gbm(formula = fmla, distribution = "bernoulli", data = data,
>     n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
>     shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
>     cv.folds = 3, keep.data = TRUE, verbose = FALSE)
> A gradient boosted model with bernoulli loss function.
> 1000 iterations were performed.
> The best cross-validation iteration was .
> The best test-set iteration was .
> Error in 1:n.trees : argument of length 0
>
>
> My question is, Is binarizing the response will have so much effect that it
> does not find anythin useful in the predictors?
>
> Thanks
>
> --
> -------------
> Mary Kindall
> Yorktown Heights, NY
> USA
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374

Marc Schwartz

2013-Oct-04 19:34 UTC

head link

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kindall at gmail.com>
wrote:
> This reproducible example is from the help of 'gbm' in R.
> 
> I ran the following code in R, and works fine as long as the response is
> numeric.  The problem starts when I convert the response from numeric to
> binary (0/1). It gives me an error.
> 
> My question is, is converting the response from numeric to binary will have
> this much effect.
> 
> Help page code:
> 
> N <- 1000
> X1 <- runif(N)
> X2 <- 2*runif(N)
> X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
> X4 <- factor(sample(letters[1:6],N,replace=TRUE))
> X5 <- factor(sample(letters[1:3],N,replace=TRUE))
> X6 <- 3*runif(N)
> mu <- c(-1,0,1,2)[as.numeric(X3)]
> 
> SNR <- 10 # signal-to-noise ratio
> Y <- X1**1.5 + 2 * (X2**.5) + mu
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(N,0,sigma)
> 
> # introduce some missing values
> X1[sample(1:N,size=500)] <- NA
> X4[sample(1:N,size=300)] <- NA
> 
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> 
> # fit initial model
> gbm1 <-
>  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
>      data=data,                   # dataset
>      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
>      # +1: monotone increase,
>      #  0: no monotone restrictions
>      distribution="gaussian",     # see the help for other
choices
>      n.trees=1000,                # number of trees
>      shrinkage=0.05,              # shrinkage or learning rate,
>      # 0.001 to 0.1 usually work
>      interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>      train.fraction = 0.5,        # fraction of data for training,
>      # first train.fraction*N used for training
>      n.minobsinnode = 10,         # minimum total weight needed in each
> node
>      cv.folds = 3,                # do 3-fold cross-validation
>      keep.data=TRUE,              # keep a copy of the dataset with the
> object
>      verbose=FALSE)               # don't print out progress
> 
> gbm1
> summary(gbm1)
> 
> 
> Now I slightly change the response variable to make it binary.
> 
> Y[Y < mean(Y)] = 0   #My edit
> Y[Y >= mean(Y)] = 1  #My edit
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
> 
> gbm2 <-
>  gbm(fmla,                        # formula
>      data=data,                   # dataset
>      distribution="bernoulli",     # My edit
>      n.trees=1000,                # number of trees
>      shrinkage=0.05,              # shrinkage or learning rate,
>      # 0.001 to 0.1 usually work
>      interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>      train.fraction = 0.5,        # fraction of data for training,
>      # first train.fraction*N used for training
>      n.minobsinnode = 10,         # minimum total weight needed in each
> node
>      cv.folds = 3,                # do 3-fold cross-validation
>      keep.data=TRUE,              # keep a copy of the dataset with the
> object
>      verbose=FALSE)               # don't print out progress
> 
> gbm2
> 
> 
>> gbm2
> gbm(formula = fmla, distribution = "bernoulli", data = data,
>    n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
>    shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
>    cv.folds = 3, keep.data = TRUE, verbose = FALSE)
> A gradient boosted model with bernoulli loss function.
> 1000 iterations were performed.
> The best cross-validation iteration was .
> The best test-set iteration was .
> Error in 1:n.trees : argument of length 0
> 
> 
> My question is, Is binarizing the response will have so much effect that it
> does not find anythin useful in the predictors?
> 
> Thanks


Sure, it's possible. See this page for a good overview of why you should not
dichotomize continuous data:

  http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous

Regards,

Marc Schwartz

peter dalgaard

2013-Oct-04 19:35 UTC

head link

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

On Oct 4, 2013, at 21:16 , Mary Kindall wrote:
> Y[Y < mean(Y)] = 0   #My edit
> Y[Y >= mean(Y)] = 1  #My edit
I have no clue about gbm, but I don't think the above does what I think you
think it does.

Y <- as.integer(Y >= mean(Y)) 

might be closer to the mark.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

R help - Oct 2013 - Why 'gbm' is not giving me error when I change the response from numeric to categorical?

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?