Currently, I am working on a data mining project and plan to divide the data table into 2 parts, one for modeling and the other for validation to compare several models. But I am not sure about the percentage of data I should use to build the model and the one I should keep to validate the model. Is there any literature reference about this topic? Thank you so much!
Wensui Liu wrote:> Currently, I am working on a data mining project and plan to divide > the data table into 2 parts, one for modeling and the other for > validation to compare several models. > > But I am not sure about the percentage of data I should use to build > the model and the one I should keep to validate the model. > > Is there any literature reference about this topic? > > Thank you so much!Data splitting is very inefficient for model validation unless the sample size is extremely large. Consider using Efron's "optimism" bootstrap as is used in the validate function in the Design package. validate will also do data splitting and cross-validation though. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Assuming you have enough data, usually 1/4 to 1/2 is used for validation. One reference would be Picard, R.R. and Berk, K.N. (1990) "Data Splitting," The American Statistician, 44;140-147. hth, b. -----Original Message----- From: Wensui Liu [mailto:liuwensui at gmail.com] Sent: Thursday, November 11, 2004 10:20 PM To: r-help at stat.math.ethz.ch Subject: [R] an off-topic question -> model validation Currently, I am working on a data mining project and plan to divide the data table into 2 parts, one for modeling and the other for validation to compare several models. But I am not sure about the percentage of data I should use to build the model and the one I should keep to validate the model. Is there any literature reference about this topic? Thank you so much! ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Apparently Analagous Threads
- off-topic: better OS for statistical computing
- off-topic question: Latex and R in industries
- Is it possible to create highly customized report in *.xls format by using R/S+?
- OT: any recommendation for scripting language
- off topic: how is xlispstat used in the industry?