thr3ads.net - R help - [R] Analyzing Poor Performance Using naiveBayes() [Aug 2012]

If this information is useful, please help other people find it:
Share via:

Kirk Fleming

2012-Aug-09 22:40 UTC

[R] Analyzing Poor Performance Using naiveBayes()

My data is 50,000 instances of about 200 predictor values, and for all 50,000
examples I have the actual class labels (binary). The data is quite
unbalanced with about 10% or less of the examples having a positive outcome
and the remainder, of course, negative. Nothing suggests the data has any
order, and it doesn't appear to have any, so I've pulled the first
30,000
examples to use as training data, reserving the remainder for test data.

There are actually 3 distinct sets of class labels associated with the
predictor data, and I've built 3 distinct models. When each model is used in
predict() with the training data and true class labels, I get AUC values of
0.95, 0.98 and 0.98 for the 3 classifier problems.

When I run these models against the 'unknown' inputs that I held
out--the
20,000 instances--I get AUC values of about 0.55 or so for each of the three
problems, give or take.  I reran the entire experiment, but instead using
40,000 instances for the model building, and the remaining 10,000 for
testing. The AUC values showed a modest improvement, but still under 0.60.

I've looked at a) the number of unique values that each predictor takes on,
and b) the number of values, for a given predictor, that appear in the test
data that do not appear in the training data.  I can eliminate variables
that have very few non-null values, and those that have very few unique
values (the two are largely the same), but I wouldn't expect this to have
any influence on the model.

I've already eliminated variables that are null in every instance, and
duplicate variables having identical values for all instances. I have not
done anything to check further for dependant variables, and don't know how
to.

Besides getting a clue, what might be my next best step?




--
View this message in context:
http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html
Sent from the R help mailing list archive at Nabble.com.

C.H.

2012-Aug-10 03:46 UTC

head link

[R] Analyzing Poor Performance Using naiveBayes()

I think you have been hit by the problem of high variance. (overfitting)

Maybe you should consider doing a feature selection perhaps using the
chisq ranking from FSelector.

And then training the Naive Bayes using the top n features (n=1 to
200) as ranked by chisq, plot the AUCs or F1 score from both training
set and cross training set against n. From the graph, you can select
the optimal number of n.


On Fri, Aug 10, 2012 at 6:40 AM, Kirk Fleming <kirkrfleming at
hotmail.com> wrote:> My data is 50,000 instances of about 200 predictor values, and for all
50,000
> examples I have the actual class labels (binary). The data is quite
> unbalanced with about 10% or less of the examples having a positive outcome
> and the remainder, of course, negative. Nothing suggests the data has any
> order, and it doesn't appear to have any, so I've pulled the first
30,000
> examples to use as training data, reserving the remainder for test data.
>
> There are actually 3 distinct sets of class labels associated with the
> predictor data, and I've built 3 distinct models. When each model is
used in
> predict() with the training data and true class labels, I get AUC values of
> 0.95, 0.98 and 0.98 for the 3 classifier problems.
>
> When I run these models against the 'unknown' inputs that I held
out--the
> 20,000 instances--I get AUC values of about 0.55 or so for each of the
three
> problems, give or take.  I reran the entire experiment, but instead using
> 40,000 instances for the model building, and the remaining 10,000 for
> testing. The AUC values showed a modest improvement, but still under 0.60.
>
> I've looked at a) the number of unique values that each predictor takes
on,
> and b) the number of values, for a given predictor, that appear in the test
> data that do not appear in the training data.  I can eliminate variables
> that have very few non-null values, and those that have very few unique
> values (the two are largely the same), but I wouldn't expect this to
have
> any influence on the model.
>
> I've already eliminated variables that are null in every instance, and
> duplicate variables having identical values for all instances. I have not
> done anything to check further for dependant variables, and don't know
how
> to.
>
> Besides getting a clue, what might be my next best step?
>
>
>
>
> --
> View this message in context:
http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Patrick Connolly

2012-Sep-15 01:51 UTC

head link

[R] Analyzing Poor Performance Using naiveBayes()

On Thu, 09-Aug-2012 at 03:40PM -0700, Kirk Fleming wrote:

|> My data is 50,000 instances of about 200 predictor values, and for
|> all 50,000 examples I have the actual class labels (binary). The
|> data is quite unbalanced with about 10% or less of the examples
|> having a positive outcome and the remainder, of course,
|> negative. Nothing suggests the data has any order, and it doesn't
|> appear to have any, so I've pulled the first 30,000 examples to use
|> as training data, reserving the remainder for test data.
|> 
|> There are actually 3 distinct sets of class labels associated with
|> the predictor data, and I've built 3 distinct models. When each
|> model is used in predict() with the training data and true class
|> labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3
|> classifier problems.

I don't know where you got naiveBayes from so I can't check it, but my
experience with boosted regression trees might be useful.  I had AUC
values fairly similar to yours with only one tenth of the number of
instances you have.

If naiveBayes has the ability to use a validation set, I think you'll
find it makes a huge difference.  In my case, it brought the Training
AUC down to something like 0.85 but the test AUC was only slightly
less, say 0.81.

Try reserving about 20-25% of your training data for a validation set,
then calculate your AUC on the combined Training and validation data.
It will probably go down somewhat but your Test AUC will look much
better.

I'd be interested to know what you discover.


-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Aug 2012 - Analyzing Poor Performance Using naiveBayes()

[R] Analyzing Poor Performance Using naiveBayes()

[R] Analyzing Poor Performance Using naiveBayes()

[R] Analyzing Poor Performance Using naiveBayes()

Possibly Parallel Threads