Mederos, Vicente (Santander)
2010-Dec-02 10:34 UTC
[R] rpart results - problem after oversampling
Hi all, I am trying to predict a target variable that takes values 0 or 1 using the rpart command. In my initial dataset I have few positive observations of the target variable; therefore I have oversampled the rare event by a multiple of 6 (i.e. from 762 to 4572). However, in my results, I end up with a number of positives in one of the terminal nodes that is not divisible by 6. As I have the same observation repeated 6 times, shouldn't all of them follow the same branch of the tree and go to the same terminal node? Thanks for your help, Vicente Emails aren't always secure, and they may be intercepted or changed after they've been sent. Santander doesn't accept liability if this happens. If you think someone may have interfered with this email, please get in touch with the sender another way. This message doesn't create or change any contract. Santander doesn't accept responsibility for damage caused by any viruses contained in this email or its attachments. Emails may be monitored. If you've received this email by mistake, please let the sender know at once that it's gone to the wrong person and then destroy it without copying, using, or telling anyone about its contents. Santander UK plc (SAN UK) Reg. No. 2294747 and Abbey National Treasury Services plc (ANTS) Reg. No. 2338548 are registered in England and have their Registered Offices at 2 Triton Square, Regent's Place, London, NW1 3AN. www.santander.co.uk SAN UK and ANTS are authorised and regulated by the Financial Services Authority (Reg. No. 106054 and 146003 respectively). SAN UK advises on mortgages, a limited range of life assurance, pension and collective investment scheme products and acts as an insurance intermediary for general insurance. Santander and the flame logo are registered trademarks. Ref:[PDB#1-4] [[alternative HTML version deleted]]
Short answer: Not really. Slightly longer answer: from what I remember of partitioning methods, a given split is made at either a single observation or between two consecutive observations. In your case, I estimate rpart would have split on that particular point... except that there are 6 of them now. So the choice of which 6 to split is arbitrary. (Someone with more knowledge of rpart's guts feel free to correct me). -------------------------------------- Jonathan P. Daily Technician - USGS Leetown Science Center 11649 Leetown Road Kearneysville WV, 25430 (304) 724-4480 "Is the room still a room when its empty? Does the room, the thing itself have purpose? Or do we, what's the word... imbue it." - Jubal Early, Firefly r-help-bounces at r-project.org wrote on 12/02/2010 05:34:23 AM:> [image removed] > > [R] rpart results - problem after oversampling > > Mederos, Vicente (Santander) > > to: > > r-help at r-project.org > > 12/02/2010 05:36 AM > > Sent by: > > r-help-bounces at r-project.org > > Hi all, > > I am trying to predict a target variable that takes values 0 or 1 > using the rpart command. In my initial dataset I have few positive > observations of the target variable; therefore I have oversampled > the rare event by a multiple of 6 (i.e. from 762 to 4572). > > However, in my results, I end up with a number of positives in one > of the terminal nodes that is not divisible by 6. As I have the same > observation repeated 6 times, shouldn't all of them follow the same > branch of the tree and go to the same terminal node? > > Thanks for your help, > > Vicente > Emails aren't always secure, and they may be intercepted or changed > after they've been sent. Santander doesn't accept liability if this > happens. If you think someone may have interfered with this email, > please get in touch with the sender another way. This message doesn't > create or change any contract. Santander doesn't accept responsibility > for damage caused by any viruses contained in this email or its > attachments. Emails may be monitored. If you've received this email by > mistake, please let the sender know at once that it's gone to the wrong > person and then destroy it without copying, using, or telling anyone > about its contents. Santander UK plc (SAN UK) Reg. No. 2294747 and Abbey > National Treasury Services plc (ANTS) Reg. No. 2338548 are registered in > England and have their Registered Offices at 2 Triton Square, Regent's > Place, London, NW1 3AN. www.santander.co.uk SAN UK and ANTS are > authorised and regulated by the Financial Services Authority (Reg. No. > 106054 and 146003 respectively). SAN UK advises on mortgages, a limited > range of life assurance, pension and collective investment scheme > products and acts as an insurance intermediary for general insurance. > Santander and the flame logo are registered trademarks. > Ref:[PDB#1-4] > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.