thr3ads.net - R help - [R] Confused - better empirical results with error in data [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Noah Silverman

2009-Sep-07 19:33 UTC

[R] Confused - better empirical results with error in data

Hi,

I have a strange one for the group.

We have a system that predicts probabilities using a fairly standard svm 
(e1017).  We are looking at probabilities of a binary outcome.

The input data is generated by a perl script that calculates a bunch of 
things, fetches data from a database, etc.

We train the system on 30,000 examples and then test the system on an 
unseen set of 5,000 records.

The "real world" results on the test set looked VERY good.  We were 
really happy with our model.

The, we noticed that there was a big error in our data generation script 
and one of the values (an average of sorts.) was being calculated 
incorrectly.  (The perl script failed to clear two iterators, so they 
both grew with every record.)

As an quick experiment, we removed that item from our data set and 
re-ran the process.  The results were not very good.  Perhaps 75% as 
good as training with the "wrong" factor included.

So, this is really a philosophical question.  Do we:
     1) Shrug and say, "who cares", the SVM figured it out and likes 
that bad data item for some inexplicable reason
     2) Tear into the math and try to figure out WHY the SVM is 
predicting more accurately

Any opinions??

Thanks!

S Ellison

2009-Sep-07 19:41 UTC

head link

[R] Confused - better empirical results with error in data

Predicting whilst confused is unlikely to produce sound predictions...
my vote is for finding out why before believing anything.
>>> Noah Silverman <noah at smartmediacorp.com> 09/07/09 8:33 PM
>>>Hi,

I have a strange one for the group.

We have a system that predicts probabilities using a fairly standard svm

(e1017).  We are looking at probabilities of a binary outcome.

The input data is generated by a perl script that calculates a bunch of 
things, fetches data from a database, etc.

We train the system on 30,000 examples and then test the system on an 
unseen set of 5,000 records.

The "real world" results on the test set looked VERY good.  We were 
really happy with our model.

The, we noticed that there was a big error in our data generation script

and one of the values (an average of sorts.) was being calculated 
incorrectly.  (The perl script failed to clear two iterators, so they 
both grew with every record.)

As an quick experiment, we removed that item from our data set and 
re-ran the process.  The results were not very good.  Perhaps 75% as 
good as training with the "wrong" factor included.

So, this is really a philosophical question.  Do we:
     1) Shrug and say, "who cares", the SVM figured it out and likes 
that bad data item for some inexplicable reason
     2) Tear into the math and try to figure out WHY the SVM is 
predicting more accurately

Any opinions??

Thanks!

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

Mark Knecht

2009-Sep-07 20:05 UTC

head link

[R] Confused - better empirical results with error in data

On Mon, Sep 7, 2009 at 12:33 PM, Noah Silverman<noah at
smartmediacorp.com> wrote:
<SNIP>
> So, this is really a philosophical question. ?Do we:
> ? ?1) Shrug and say, "who cares", the SVM figured it out and
likes that bad
> data item for some inexplicable reason
> ? ?2) Tear into the math and try to figure out WHY the SVM is predicting
> more accurately
>
> Any opinions??
>
> Thanks!
>
Boy, I'd sure think you'd want to know why it worked with the
'wrong'
calculations. It's not that the math is wrong, really, but rather that
it wasn't what you thought it was. I cannot see why you wouldn't want
to know why this mistake helped. Won't future project benefit?

Just my 2 cents,
Mark

Reasonably Related Threads

Search for more apparently analagous threads

R help - Sep 2009 - Confused - better empirical results with error in data

[R] Confused - better empirical results with error in data

[R] Confused - better empirical results with error in data

[R] Confused - better empirical results with error in data

Reasonably Related Threads