I recently use gbm for a binary classification problem. As expected, it gets
very good results, based on Area under ROC with 7-fold cross validation.
However, the application (malware detection) is cost-sensitive, getting a FP
(classify a clean sample as a dirty one) is much worse than getting a FN (miss a
dirty sample). I would like to tune the gbm model biased to very low FP rate.
For this purpose, I tried both weighting and sampling strategy, but both of them
do not work as I expect yet. I notice that there is a weight vector and hence I
tried to overwight on clean side (10 for each clean sample and 1 for each dirty
sample), but I don't see big difference from gbm modeling without weighting.
I also try to feed an imbalanced data into gbm (in the dataset, clean samples
are 10 times more than dirty samples),  it still not work.
The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher
the better.
I think I miss sth here. Anyone has similar experience and can advise me how to
implement cost-sensitive classification with gbm.
model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution =
"bernoulli",w = tr.w,var.monotone = NULL,n.trees =
NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage =
0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose =
TRUE,var.names = NULL,response.name = NULL);
or 
model.gbm  <- gbm(tr.y ~ .,distribution =
"bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights =
tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth =
TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction =
1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE);
 
------------------------------------
Yuchun Tang, Ph.D.
Principal Engineer, Lead
 
McAfee, Inc.
4800 North Point Parkway
Suite 300
Alpharetta,
GA  30022
 
Main:     678.904.9153
www.mcafee.comwww.trustedsource.org
      
	[[alternative HTML version deleted]]
(sorry to post it again with plain text). I recently use gbm for a binary classification problem. As expected, it gets very good results, based on Area under ROC with 7-fold cross validation. However, the application (malware detection) is cost-sensitive, getting a FP (classify a clean sample as a dirty one) is much worse than getting a FN (miss a dirty sample). I would like to tune the gbm model biased to very low FP rate. The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher the better. For this purpose, I tried both weighting and sampling strategies, but both of them do not work as I expect yet. I notice that there is a weight vector and hence I tried to overwight on clean side (10 for each clean sample and 1 for each dirty sample), but I don't see big difference from gbm modeling without weighting. I also try to feed an imbalanced data into gbm (in the dataset, clean samples are 10 times more than dirty samples), it still not work. I think I miss sth here. I would very much appreciate if anyone can advise me how to implement cost-sensitive classification with gbm. Follows is the gbm modeling scirpt I used. model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution = "bernoulli",w = tr.w,var.monotone = NULL,n.trees = NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage = 0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose = TRUE,var.names = NULL,response.name = NULL); or model.gbm <- gbm(tr.y ~ .,distribution = "bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights = tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction = 1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE); ------------------------------------ Yuchun Tang, Ph.D. Principal Engineer, Lead McAfee, Inc. 4800 North Point Parkway Suite 300 Alpharetta, GA 30022 Main: 678.904.9153 www.mcafee.com www.trustedsource.org