thr3ads.net - R help - [R] gbm for cost-sensitive binary classification? [Jun 2009]

If this information is useful, please help other people find it:
Share via:

Tang Yuchun

2009-Jun-17 18:05 UTC

[R] gbm for cost-sensitive binary classification?

I recently use gbm for a binary classification problem. As expected, it gets
very good results, based on Area under ROC with 7-fold cross validation.
However, the application (malware detection) is cost-sensitive, getting a FP
(classify a clean sample as a dirty one) is much worse than getting a FN (miss a
dirty sample). I would like to tune the gbm model biased to very low FP rate.

For this purpose, I tried both weighting and sampling strategy, but both of them
do not work as I expect yet. I notice that there is a weight vector and hence I
tried to overwight on clean side (10 for each clean sample and 1 for each dirty
sample), but I don't see big difference from gbm modeling without weighting.
I also try to feed an imbalanced data into gbm (in the dataset, clean samples
are 10 times more than dirty samples),  it still not work.

The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher
the better.

I think I miss sth here. Anyone has similar experience and can advise me how to
implement cost-sensitive classification with gbm.

model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution =
"bernoulli",w = tr.w,var.monotone = NULL,n.trees =
NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage =
0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose =
TRUE,var.names = NULL,response.name = NULL);

or 

model.gbm  <- gbm(tr.y ~ .,distribution =
"bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights =
tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth =
TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction =
1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE);


 
------------------------------------
Yuchun Tang, Ph.D.
Principal Engineer, Lead
 
McAfee, Inc.
4800 North Point Parkway
Suite 300
Alpharetta,
GA  30022
 
Main:     678.904.9153
www.mcafee.comwww.trustedsource.org



      
	[[alternative HTML version deleted]]

Tang Yuchun

2009-Jun-17 19:05 UTC

head link

[R] gbm for cost-sensitive binary classification?

(sorry to post it again with plain text).

I recently use gbm for a binary classification problem. As expected, it gets
very good results, based on Area under ROC with 7-fold cross validation.
However, the application (malware detection) is cost-sensitive, getting a FP
(classify a clean sample as a dirty one) is much worse than getting a FN (miss a
dirty sample). I would like to tune the gbm model biased to very low FP rate.
The metric I used is to calculate Area under ROC, cut at 1% FP rate. The higher
the better.

For this purpose, I tried both weighting and sampling strategies, but both of
them do not work as I expect yet. I notice that there is a weight vector and
hence I tried to overwight on clean side (10 for each clean sample and 1 for
each dirty sample), but I don't see big difference from gbm modeling without
weighting. I also try to feed an imbalanced data into gbm (in the dataset, clean
samples are 10 times more than dirty samples),  it still not work.

I think I miss sth here. I would very much appreciate if anyone can advise me
how to implement cost-sensitive classification with gbm. Follows is the gbm
modeling scirpt I used.

model.gbm <- gbm.fit(tr[,1:DIM],tr.y,offset = NULL,misc = NULL,distribution =
"bernoulli",w = tr.w,var.monotone = NULL,n.trees =
NTREE,interaction.depth = TREEDEPTH,n.minobsinnode = 10,shrinkage =
0.05,bag.fraction = BAGRATIO,train.fraction = 1.0,keep.data = TRUE,verbose =
TRUE,var.names = NULL,response.name = NULL);

or 

model.gbm  <- gbm(tr.y ~ .,distribution =
"bernoulli",data=data.frame(cbind(tr[,1:DIM],tr.y)),weights =
tr.w,var.monotone=NULL,n.trees=NTREE,interaction.depth =
TREEDEPTH,n.minobsinnode = 10,shrinkage=0.05,bag.fraction = 0.5,train.fraction =
1.0,cv.folds = 5,keep.data=TRUE,verbose=TRUE);


 
------------------------------------
Yuchun Tang, Ph.D.
Principal Engineer, Lead
 
McAfee, Inc.
4800 North Point Parkway
Suite 300
Alpharetta,
GA  30022
 
Main:     678.904.9153
www.mcafee.com
www.trustedsource.org

Seemingly Similar Threads

Search for more seemingly similar threads

R help - Jun 2009 - gbm for cost-sensitive binary classification?

[R] gbm for cost-sensitive binary classification?

[R] gbm for cost-sensitive binary classification?

Seemingly Similar Threads