I recently use gbm for a binary classification problem. As expected, it gets very good results, based on Area under ROC with 7-fold cross validation. However, the application (malware detection) is cost-sensitive, getting a FP (classify a clean sample as a dirty one) is much worse than getting a FN (miss a dirty sample). I would like to tune the gbm model biased to very low FP rate. For this purpose, I tried both weighting and sampling strategy, but both of them do not work as I expect yet. I notice that there is a weight vector and hence I tried to overwight on clean side (10 for each clean sample and 1 for each dirty sample), but I don't see big difference from gbm modeling without weighting. I also try to feed an imbalanced data into gbm (in the dataset, clean samples are 10 times more than dirty samples), it still not work. The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher the better. I think I miss sth here. Anyone has similar experience and can advise me how to implement cost-sensitive classification with gbm. model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution = "bernoulli",w = tr.w,var.monotone = NULL,n.trees = NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage = 0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose = TRUE,var.names = NULL,response.name = NULL); or model.gbm <- gbm(tr.y ~ .,distribution = "bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights = tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction = 1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE); ------------------------------------ Yuchun Tang, Ph.D. Principal Engineer, Lead McAfee, Inc. 4800 North Point Parkway Suite 300 Alpharetta, GA 30022 Main: 678.904.9153 www.mcafee.comwww.trustedsource.org [[alternative HTML version deleted]]
(sorry to post it again with plain text). I recently use gbm for a binary classification problem. As expected, it gets very good results, based on Area under ROC with 7-fold cross validation. However, the application (malware detection) is cost-sensitive, getting a FP (classify a clean sample as a dirty one) is much worse than getting a FN (miss a dirty sample). I would like to tune the gbm model biased to very low FP rate. The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher the better. For this purpose, I tried both weighting and sampling strategies, but both of them do not work as I expect yet. I notice that there is a weight vector and hence I tried to overwight on clean side (10 for each clean sample and 1 for each dirty sample), but I don't see big difference from gbm modeling without weighting. I also try to feed an imbalanced data into gbm (in the dataset, clean samples are 10 times more than dirty samples), it still not work. I think I miss sth here. I would very much appreciate if anyone can advise me how to implement cost-sensitive classification with gbm. Follows is the gbm modeling scirpt I used. model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution = "bernoulli",w = tr.w,var.monotone = NULL,n.trees = NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage = 0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose = TRUE,var.names = NULL,response.name = NULL); or model.gbm <- gbm(tr.y ~ .,distribution = "bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights = tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction = 1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE); ------------------------------------ Yuchun Tang, Ph.D. Principal Engineer, Lead McAfee, Inc. 4800 North Point Parkway Suite 300 Alpharetta, GA 30022 Main: 678.904.9153 www.mcafee.com www.trustedsource.org