Angela Boag
2014-Aug-21 21:45 UTC
[R] Subsetting data for split-sample validation, then repeating 1000x
Hi all, I'm doing some within-dataset model validation and would like to subset a dataset 70/30 and fit a model to 70% of the data (the training data), then validate it by predicting the remaining 30% (the testing data), and I would like to do this split-sample validation 1000 times and average the correlation coefficient and r2 between the training and testing data. I have the following working for a single iteration, and would like to know how to use either the replicate() or for-loop functions to average the 1000 'r2' and 'cor' outputs. -- # create 70% training sample A.samp <- sample(1:nrow(A),floor(0.7*nrow(A)), replace = TRUE) # Fit model (I'm modeling native plant richness, 'nat.r') A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family "poisson", data = A[A.samp,]) # Use the model to predict the remaining 30% of the data A.pred <- predict(A.model, newdata = A[-A.samp,], type = "response") # Correlation between predicted 30% and actual 30% cor <- cor(A[-A.samp,]$nat.r, A.pred, method = "pearson") # r2 between predicted and observed lm.A <- lm(A.pred ~ A[-A.samp,]$nat.r) r2 <- summary(lm.A)$r.squared # print values r2 cor -- Thanks for your time! Cheers, Angela -- Angela E. Boag Ph.D. Student, Environmental Studies CAFOR Project Researcher University of Colorado, Boulder Mobile: 720-212-6505 [[alternative HTML version deleted]]
David L Carlson
2014-Aug-22 20:46 UTC
[R] Subsetting data for split-sample validation, then repeating 1000x
You can use replicate() or a for (i in 1:1000){} loop to do your replications, but you have other issues first. 1. You are sampling with replacement which makes no sense at all. Your 70% sample will contain some observations multiple times and will use less than 70% of the data most of the time. 2. You compute r using cor() and r.squared using summary.lm(). Why? Once you have computed r, r*r or r^2 is equal to r.squared for the simple linear model you are using. # To split your data, you need to sample without replacement, e.g. train <- sample.int(nrow(A), floor(nrow(A)*.7)) test <- (1:nrow(A))[-train] # Now run your analysis on A[train,] and test it on A[test,] # Fit model (I'm modeling native plant richness, 'nat.r') A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family "poisson", data = A[train,]) # Correlation between predicted 30% and actual 30% cor <- cor(Atest$nat.r, predict(A.model, newdata = A[test,], type = "response")) ------------------------------------- David L Carlson Department of Anthropology Texas A&M University College Station, TX 77840-4352 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Angela Boag Sent: Thursday, August 21, 2014 4:46 PM To: r-help at r-project.org Subject: [R] Subsetting data for split-sample validation, then repeating 1000x Hi all, I'm doing some within-dataset model validation and would like to subset a dataset 70/30 and fit a model to 70% of the data (the training data), then validate it by predicting the remaining 30% (the testing data), and I would like to do this split-sample validation 1000 times and average the correlation coefficient and r2 between the training and testing data. I have the following working for a single iteration, and would like to know how to use either the replicate() or for-loop functions to average the 1000 'r2' and 'cor' outputs. -- # create 70% training sample A.samp <- sample(1:nrow(A),floor(0.7*nrow(A)), replace = TRUE) # Fit model (I'm modeling native plant richness, 'nat.r') A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family "poisson", data = A[A.samp,]) # Use the model to predict the remaining 30% of the data A.pred <- predict(A.model, newdata = A[-A.samp,], type = "response") # Correlation between predicted 30% and actual 30% cor <- cor(A[-A.samp,]$nat.r, A.pred, method = "pearson") # r2 between predicted and observed lm.A <- lm(A.pred ~ A[-A.samp,]$nat.r) r2 <- summary(lm.A)$r.squared # print values r2 cor -- Thanks for your time! Cheers, Angela -- Angela E. Boag Ph.D. Student, Environmental Studies CAFOR Project Researcher University of Colorado, Boulder Mobile: 720-212-6505 [[alternative HTML version deleted]] ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.