Norman Polozka
2016-Apr-02 21:40 UTC
[R] R model developing & validating - Open to Discussion
Throughout my R journey I have noticed the way we can use given data to develop and validate a model. Assume that you have given data for a problem 1. train.csv 2. test.csv *Method A* *Combine train+test data* and develop a model using the combined data. Then use test.data to validate the model based on predicted error analysis. *Method B* Use *train data* to develop the model and then use *test data* to validate the model based on predicted error analysis. *Method C* Sub divided 75% as training data and 25% test data on *train.csv *file and use new training data for developing the model. Then use new test data to validate the model. After that use initial given test data to double check the performance of the model. I have identified 3 methods so it is bit confusing which one to use. *Are there any other methods other than these methods?* I need opinions from R experts on 1. What is the best practice? 2. Does that depend on the scale of the problem (smaller data or big data)? 3. a) Confusion matrix is the only way that can we use to check the performance of a model? b) Is there any other matrices to check the performance? c) Does it depend on the type of the model(lm(),glm(),tree(),svm() etc..)? d) Do we have different matrices for different models to evaluate the model? PS: I have asked this question in stack but no response so I thought to ask from you guys Many thanks [[alternative HTML version deleted]]
Bert Gunter
2016-Apr-03 17:24 UTC
[R] R model developing & validating - Open to Discussion
This is way OT for this list, and really has nothing to do with R. Post on a statistical list like stats.stackexchange.com if you want to repeat a discussion that has gone on for decades and has no resolution. You really should be spending time with the literature, though. Have you? "Cross validation" and "penalized regression" might be a couple of terms to start you off, although they are far from sufficient, and others might suggest better ones. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Apr 2, 2016 at 2:40 PM, Norman Polozka <normanmath at gmail.com> wrote:> Throughout my R journey I have noticed the way we can use given data to > develop and validate a model. > > Assume that you have given data for a problem > > 1. train.csv > 2. test.csv > > *Method A* > > *Combine train+test data* and develop a model using the combined data. Then > use test.data to validate the model based on predicted error analysis. > > *Method B* > > Use *train data* to develop the model and then use *test data* to validate > the model based on predicted error analysis. > > *Method C* > > Sub divided 75% as training data and 25% test data on *train.csv *file and > use new training data for developing the model. Then use new test data to > validate the model. > After that use initial given test data to double check the performance of > the model. > > I have identified 3 methods so it is bit confusing which one to use. > > *Are there any other methods other than these methods?* > > I need opinions from R experts on > > 1. What is the best practice? > > 2. Does that depend on the scale of the problem (smaller data or big data)? > > 3. a) Confusion matrix is the only way that can we use to check the > performance of a model? > > b) Is there any other matrices to check the performance? > > c) Does it depend on the type of the model(lm(),glm(),tree(),svm() > etc..)? > > d) Do we have different matrices for different models to evaluate the > model? > > > PS: I have asked this question in stack but no response so I thought to ask > from you guys > > Many thanks > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Seemingly Similar Threads
- Validating a Cox model on an external set
- Predicting with a principal component regression model: "non-conformable arguments" error
- Regarding randomForest regression
- nnet question
- [caret package] [trainControl] supplying predefined partitions to train with cross validation