thr3ads.net - R help - [R] caret() train based on cross validation - split dataset to keep sites together? [May 2012]

If this information is useful, please help other people find it:
Share via:

Tyrell Deweber

2012-May-30 11:55 UTC

[R] caret() train based on cross validation - split dataset to keep sites together?

Hello all, 

I have searched and have not yet identified a solution so now I am sending
this message. In short, I need to split my data into training, validation,
and testing subsets that keep all observations from the same sites together
? preferably as part of a cross validation procedure. Now for the longer
version. And I must confess that although my R skills are improving, they
are not so highly developed. 

I am using 10 fold cross validation with 3 repeats in the train function of
the caret() package to identify an optimal nnet (neural network) model to
predict daily river water temperature at unsampled sites. I am also
withholding data from 10% of sites to have a better understanding of
generalization error.?However, the focus on predictions at other sites is
turning out to be not easily facilitated ? as far as I can see. ?My data
structure (example at bottom of email) consists of columns identifying the
site, the date, the water temperature on that day for the site (response
variable), and many predictors. ?There are over 220,000 individual
observations at ~1,000 sites, and each site has a minimum of 30
observations.? It is important to keep sites separate because selecting a
model based on predictions at an already sampled site is likely
overly-optimistic.? 

Is there a way to split data for (or preferably during) cross validation
procedure to: 

1.) Selects a separate validation dataset from 10% of sites 
2.) Splits remaining training data into cross validation subsets and most
importantly, keeping all observations from a site together
3.) Secondarily, constrain partitions to be similar - ideally based on
distributions of all variables

It seems that some combination of the sample.split function of the caTools()
package and the createdataPartition function of caret() might do this, but I
am at a loss for how to code that.? 

If this is not possible, I would be content to skip the cross validation
procedure and create three similar splits of my data that keep all
observations from a site together ? one for training, one for testing, and
one for validation.  The alternative goal here would be to split the data
where 80% of sites are training, 10% of sites are for testing (model
selection), and 10% of sites for validation.  

Thank you and please let me know if there are any remaining questions.? This
is my first post as well, so if I left anything out that would be good to
know as well. 

Tyrell Deweber



R version 2.13.1 (2011-07-08)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-redhat-linux-gnu (64-bit)

Comid?? tempymd??????????? watmntemp?????? airtemp?????????predictorb??? 

15433??? 1980-05-01????????? 11.4????????? 22.1???????????????? 

15433??? 1980-05-02????????? 11.6??????????23.6??????? ???????? 

15433??? 1980-05-03????????? 11.2??????????28.5
15687??? 1980-06-01????????? 13.5??????????26.5
15687??? 1980-06-02????????? 14.2??????????26.9
15687??? 1980-06-03????????? 13.8??????????28.9
18994??? 1980-04-05????????? 8.4???????????16.4
18994??? 1980-04-06????????? 8.3???????????12.6
90342??? 1980-07-13????????? 18.9??????????22.3
90342??? 1980-07-14????????? 19.3??????????28.4


EXAMPLE SCRIPT FOR MODEL FITTING


fitControl <- trainControl(method = "repeatedcv", number=10,
repeats=3)

tuning <- read.table("temptunegrid.txt",head=T,sep=",")
tuning


# # Model with 100 iterations 
registerDoMC(4)
tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe +
netarea + netbuffor + strmslope + 
	netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
buffor + tempa7day + tempb7day + 
	tempc7day + tempd7day + tempe7day +  tempa30day + tempb30day +
tempc30day + tempd30day +
	tempe30day, data = temp.train, method = "nnet", linout=T, maxit 100, 
	MaxNWts = 100000, metric = "RMSE", trControl = fitControl, tuneGrid
= tuning, trace = T)

Max Kuhn

2012-May-30 16:40 UTC

head link

[R] caret() train based on cross validation - split dataset to keep sites together?

Tyrell,

If you want to have the folds contain data from only one site at a
time, you can develop a set of row indices and pass these to the index
argument in trainControl. For example

   index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178),
site3 = c(754, 789, 981))

The first fold would fit a model on those site 1 data in the first
argument and predict everything else, and so on.

I'm not sure if this is what you need, but there you go.

Max

On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber <jtdeweber at gmail.com>
wrote:> Hello all,
>
> I have searched and have not yet identified a solution so now I am sending
> this message. In short, I need to split my data into training, validation,
> and testing subsets that keep all observations from the same sites together
> ? preferably as part of a cross validation procedure. Now for the longer
> version. And I must confess that although my R skills are improving, they
> are not so highly developed.
>
> I am using 10 fold cross validation with 3 repeats in the train function of
> the caret() package to identify an optimal nnet (neural network) model to
> predict daily river water temperature at unsampled sites. I am also
> withholding data from 10% of sites to have a better understanding of
> generalization error.?However, the focus on predictions at other sites is
> turning out to be not easily facilitated ? as far as I can see. ?My data
> structure (example at bottom of email) consists of columns identifying the
> site, the date, the water temperature on that day for the site (response
> variable), and many predictors. ?There are over 220,000 individual
> observations at ~1,000 sites, and each site has a minimum of 30
> observations.? It is important to keep sites separate because selecting a
> model based on predictions at an already sampled site is likely
> overly-optimistic.
>
> Is there a way to split data for (or preferably during) cross validation
> procedure to:
>
> 1.) Selects a separate validation dataset from 10% of sites
> 2.) Splits remaining training data into cross validation subsets and most
> importantly, keeping all observations from a site together
> 3.) Secondarily, constrain partitions to be similar - ideally based on
> distributions of all variables
>
> It seems that some combination of the sample.split function of the
caTools()
> package and the createdataPartition function of caret() might do this, but
I
> am at a loss for how to code that.
>
> If this is not possible, I would be content to skip the cross validation
> procedure and create three similar splits of my data that keep all
> observations from a site together ? one for training, one for testing, and
> one for validation. ?The alternative goal here would be to split the data
> where 80% of sites are training, 10% of sites are for testing (model
> selection), and 10% of sites for validation.
>
> Thank you and please let me know if there are any remaining questions.?
This
> is my first post as well, so if I left anything out that would be good to
> know as well.
>
> Tyrell Deweber
>
>
>
> R version 2.13.1 (2011-07-08)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> Comid?? tempymd??????????? watmntemp?????? airtemp?????????predictorb??? ?
> 15433??? 1980-05-01????????? 11.4????????? 22.1???????????????? ?
> 15433??? 1980-05-02????????? 11.6??????????23.6??????? ???????? ?
> 15433??? 1980-05-03????????? 11.2??????????28.5
> 15687??? 1980-06-01????????? 13.5??????????26.5
> 15687??? 1980-06-02????????? 14.2??????????26.9
> 15687??? 1980-06-03????????? 13.8??????????28.9
> 18994??? 1980-04-05????????? 8.4???????????16.4
> 18994??? 1980-04-06????????? 8.3???????????12.6
> 90342??? 1980-07-13????????? 18.9??????????22.3
> 90342??? 1980-07-14????????? 19.3??????????28.4
>
>
> EXAMPLE SCRIPT FOR MODEL FITTING
>
>
> fitControl <- trainControl(method = "repeatedcv", number=10,
repeats=3)
>
> tuning <-
read.table("temptunegrid.txt",head=T,sep=",")
> tuning
>
>
> # # Model with 100 iterations
> registerDoMC(4)
> tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe +
> netarea + netbuffor + strmslope +
> ? ? ? ?netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
> buffor + tempa7day + tempb7day +
> ? ? ? ?tempc7day + tempd7day + tempe7day + ?tempa30day + tempb30day +
> tempc30day + tempd30day +
> ? ? ? ?tempe30day, data = temp.train, method = "nnet", linout=T,
maxit > 100,
> ? ? ? ?MaxNWts = 100000, metric = "RMSE", trControl = fitControl,
tuneGrid
> = tuning, trace = T)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.


-- 

Max

Reasonably Related Threads

Search for more seemingly similar threads

R help - May 2012 - caret() train based on cross validation - split dataset to keep sites together?

[R] caret() train based on cross validation - split dataset to keep sites together?

[R] caret() train based on cross validation - split dataset to keep sites together?

Reasonably Related Threads