Tyrell Deweber
2012-May-30 11:55 UTC
[R] caret() train based on cross validation - split dataset to keep sites together?
Hello all, I have searched and have not yet identified a solution so now I am sending this message. In short, I need to split my data into training, validation, and testing subsets that keep all observations from the same sites together ? preferably as part of a cross validation procedure. Now for the longer version. And I must confess that although my R skills are improving, they are not so highly developed. I am using 10 fold cross validation with 3 repeats in the train function of the caret() package to identify an optimal nnet (neural network) model to predict daily river water temperature at unsampled sites. I am also withholding data from 10% of sites to have a better understanding of generalization error.?However, the focus on predictions at other sites is turning out to be not easily facilitated ? as far as I can see. ?My data structure (example at bottom of email) consists of columns identifying the site, the date, the water temperature on that day for the site (response variable), and many predictors. ?There are over 220,000 individual observations at ~1,000 sites, and each site has a minimum of 30 observations.? It is important to keep sites separate because selecting a model based on predictions at an already sampled site is likely overly-optimistic.? Is there a way to split data for (or preferably during) cross validation procedure to: 1.) Selects a separate validation dataset from 10% of sites 2.) Splits remaining training data into cross validation subsets and most importantly, keeping all observations from a site together 3.) Secondarily, constrain partitions to be similar - ideally based on distributions of all variables It seems that some combination of the sample.split function of the caTools() package and the createdataPartition function of caret() might do this, but I am at a loss for how to code that.? If this is not possible, I would be content to skip the cross validation procedure and create three similar splits of my data that keep all observations from a site together ? one for training, one for testing, and one for validation. The alternative goal here would be to split the data where 80% of sites are training, 10% of sites are for testing (model selection), and 10% of sites for validation. Thank you and please let me know if there are any remaining questions.? This is my first post as well, so if I left anything out that would be good to know as well. Tyrell Deweber R version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-redhat-linux-gnu (64-bit) Comid?? tempymd??????????? watmntemp?????? airtemp?????????predictorb??? 15433??? 1980-05-01????????? 11.4????????? 22.1???????????????? 15433??? 1980-05-02????????? 11.6??????????23.6??????? ???????? 15433??? 1980-05-03????????? 11.2??????????28.5 15687??? 1980-06-01????????? 13.5??????????26.5 15687??? 1980-06-02????????? 14.2??????????26.9 15687??? 1980-06-03????????? 13.8??????????28.9 18994??? 1980-04-05????????? 8.4???????????16.4 18994??? 1980-04-06????????? 8.3???????????12.6 90342??? 1980-07-13????????? 18.9??????????22.3 90342??? 1980-07-14????????? 19.3??????????28.4 EXAMPLE SCRIPT FOR MODEL FITTING fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3) tuning <- read.table("temptunegrid.txt",head=T,sep=",") tuning # # Model with 100 iterations registerDoMC(4) tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe + netarea + netbuffor + strmslope + netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest + buffor + tempa7day + tempb7day + tempc7day + tempd7day + tempe7day + tempa30day + tempb30day + tempc30day + tempd30day + tempe30day, data = temp.train, method = "nnet", linout=T, maxit 100, MaxNWts = 100000, metric = "RMSE", trControl = fitControl, tuneGrid = tuning, trace = T)
Max Kuhn
2012-May-30 16:40 UTC
[R] caret() train based on cross validation - split dataset to keep sites together?
Tyrell, If you want to have the folds contain data from only one site at a time, you can develop a set of row indices and pass these to the index argument in trainControl. For example index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178), site3 = c(754, 789, 981)) The first fold would fit a model on those site 1 data in the first argument and predict everything else, and so on. I'm not sure if this is what you need, but there you go. Max On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber <jtdeweber at gmail.com> wrote:> Hello all, > > I have searched and have not yet identified a solution so now I am sending > this message. In short, I need to split my data into training, validation, > and testing subsets that keep all observations from the same sites together > ? preferably as part of a cross validation procedure. Now for the longer > version. And I must confess that although my R skills are improving, they > are not so highly developed. > > I am using 10 fold cross validation with 3 repeats in the train function of > the caret() package to identify an optimal nnet (neural network) model to > predict daily river water temperature at unsampled sites. I am also > withholding data from 10% of sites to have a better understanding of > generalization error.?However, the focus on predictions at other sites is > turning out to be not easily facilitated ? as far as I can see. ?My data > structure (example at bottom of email) consists of columns identifying the > site, the date, the water temperature on that day for the site (response > variable), and many predictors. ?There are over 220,000 individual > observations at ~1,000 sites, and each site has a minimum of 30 > observations.? It is important to keep sites separate because selecting a > model based on predictions at an already sampled site is likely > overly-optimistic. > > Is there a way to split data for (or preferably during) cross validation > procedure to: > > 1.) Selects a separate validation dataset from 10% of sites > 2.) Splits remaining training data into cross validation subsets and most > importantly, keeping all observations from a site together > 3.) Secondarily, constrain partitions to be similar - ideally based on > distributions of all variables > > It seems that some combination of the sample.split function of the caTools() > package and the createdataPartition function of caret() might do this, but I > am at a loss for how to code that. > > If this is not possible, I would be content to skip the cross validation > procedure and create three similar splits of my data that keep all > observations from a site together ? one for training, one for testing, and > one for validation. ?The alternative goal here would be to split the data > where 80% of sites are training, 10% of sites are for testing (model > selection), and 10% of sites for validation. > > Thank you and please let me know if there are any remaining questions.? This > is my first post as well, so if I left anything out that would be good to > know as well. > > Tyrell Deweber > > > > R version 2.13.1 (2011-07-08) > Copyright (C) 2011 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-redhat-linux-gnu (64-bit) > > Comid?? tempymd??????????? watmntemp?????? airtemp?????????predictorb??? ? > 15433??? 1980-05-01????????? 11.4????????? 22.1???????????????? ? > 15433??? 1980-05-02????????? 11.6??????????23.6??????? ???????? ? > 15433??? 1980-05-03????????? 11.2??????????28.5 > 15687??? 1980-06-01????????? 13.5??????????26.5 > 15687??? 1980-06-02????????? 14.2??????????26.9 > 15687??? 1980-06-03????????? 13.8??????????28.9 > 18994??? 1980-04-05????????? 8.4???????????16.4 > 18994??? 1980-04-06????????? 8.3???????????12.6 > 90342??? 1980-07-13????????? 18.9??????????22.3 > 90342??? 1980-07-14????????? 19.3??????????28.4 > > > EXAMPLE SCRIPT FOR MODEL FITTING > > > fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3) > > tuning <- read.table("temptunegrid.txt",head=T,sep=",") > tuning > > > # # Model with 100 iterations > registerDoMC(4) > tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe + > netarea + netbuffor + strmslope + > ? ? ? ?netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest + > buffor + tempa7day + tempb7day + > ? ? ? ?tempc7day + tempd7day + tempe7day + ?tempa30day + tempb30day + > tempc30day + tempd30day + > ? ? ? ?tempe30day, data = temp.train, method = "nnet", linout=T, maxit > 100, > ? ? ? ?MaxNWts = 100000, metric = "RMSE", trControl = fitControl, tuneGrid > = tuning, trace = T) > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Max
Reasonably Related Threads
- Trying to extract probabilities in CARET (caret) package with a glmStepAIC model
- How can you find the optimal number of values to randomly sample to optimize random forest classification without trial and error?
- [LLVMdev] llvm-gcc 4.2 assertion failed on linux x86_64
- how to train ksvm with spectral kernel (kernlab) in caret?
- [LLVMdev] llvm-gcc 4.2 assertion failed on linux x86_64