thr3ads.net - R help - [R] loops & sampling [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Garth.Warren at csiro.au

2007-Nov-01 06:33 UTC

[R] loops & sampling

Hi,

 

I'm new to R (and statistics) and my boss has thrown me in the deep-end with
the following task:

 

We want to evaluate the impact that sampling size has on our ability to create a
robust model, or evaluate how robust the model is to sample size for the purpose
of cross-validation i.e. in our current project we have collected a series of
independent data at 250 locations, from which we have built a predictive model,
we want to know whether we could get away with collecting fewer samples and
still build a decent model; for the obvious operational reasons of cost, time
spent in the field etc..

 

Our thinking was that we could apply a bootstrap type procedure:

 

We would remove 10 records or samples from the total n=250 and then replace
those 10 removed with replacements (or copies) from the remaining 240. With this
new data-frame we would apply our model and calculate an r², we would then
repeat through looping 1000 times before generating the mean r² from those 1000
r² values generated. After which we would start the process again by remove 20
samples from our data with replacements from the remaining 230 records and so
on...

 

Below is a simplified version of the real code which contains most of the basic
elements. My main problem is I'm not sure what the 'for(i in
1:nboot)' line is doing, originally I though what this meant was that it
removed 1 sample or record from the data which was replaced by a copy of one of
the records from the remaining n, such that 'for(i in 10:nboot)' when
used in the context of the below code removed 10 samples with replacements as I
have said above. I'm almost positive that this isn't happening and if
not how can I make the code below for example do what we want it to?

 

library(utils)

#data

a <- c(5.5, 2.3, 8.5, 9.1, 8.6, 5.1)

b <- c(5.2, 2.2, 8.6, 9.1, 8.8, 5.7)

c <- c(5.0,14.6, 8.9, 9.0, 9.1, 5.5)

#join

abc <- data.frame(a,b,c)

#set column names

names(abc)[1]<-"y"

names(abc)[2]<-"x1"

names(abc)[3]<-"x2"

abc2 <- abc

#sample

abc3 <- as.data.frame(t(as.matrix(data.frame(abc2))))

n <- length(abc2)

npboot.function <- function(nboot)

{

boot.cor <- vector(length=nboot)

for(i in 1:nboot){

rdata <- sample(abc3,n,replace=T)

abc4 <- as.data.frame(t(as.matrix(data.frame(rdata))))

model <- lm(asin(sqrt(abc4$y/100)) ~ I(abc4$x1^2) + abc4$x2)

boot.cor[i] <- cor(abc4$y, model$fit)}

boot.cor

}

bt.cor <- npboot.function(nboot=10)

bootmean <- mean(bt.cor)

 

 

Any assistance would be greatly appreciated, also the sooner the better as we
are under pressure to reach a conclusion.

 

Cheers,

 

Garth


	[[alternative HTML version deleted]]

Julian Burgos

2007-Nov-01 19:26 UTC

head link

[R] loops & sampling

Hi Garth,

Your code is really confusing! You should start by reading the help file 
on the for() function and understanding what it does:

?"for"

Your line
for(i in 1:nboot){

}

is simply starting a loop around the variable 'i', which will change 
values following the sequence 1:nboot.

It seems that the problem (or part of it) is that your are calling the 
sample() function using a 'n' variable that is not defined anywhere.

Also, what nboot is supposed to be?  The numbers of samples to be taken 
(10, 20, etc.) or the number of iterations (1000).  In your example, you 
are calling your function as

bt.cor <- npboot.function(nboot=10)

so in this case your function will loop around 10 times.

Here is a function that will do what you want:

npboot.function <- function(data,nboot){
boot.cor <- vector(length=1000)
for (i in 1:1000){
abc2=data[-(1:nboot),] #Remove the first 'nboot' rows
my.sample=sample(1:(250-nboot),nboot,replace=T) # Sample rows
abc2=rbind(abc2,abc2[my.sample,]) # Add the sampled rows to the 
truncated dataset
model <- lm(asin(sqrt(abc2$y/100)) ~ abc2$x1 + abc2$x2) #Fit the model
boot.cor[i]=cor(abc2$y,model$fit)  #Get correlation
}
return (boot.cor)}

bt.cor <- npboot.function(abc,nboot=120)
bootmean <- mean(bt.cor)




Garth.Warren at csiro.au wrote:> Hi,
> 
>  
> 
> I'm new to R (and statistics) and my boss has thrown me in the deep-end
with the following task:
> 
>  
> 
> We want to evaluate the impact that sampling size has on our ability to
create a robust model, or evaluate how robust the model is to sample size for
the purpose of cross-validation i.e. in our current project we have collected a
series of independent data at 250 locations, from which we have built a
predictive model, we want to know whether we could get away with collecting
fewer samples and still build a decent model; for the obvious operational
reasons of cost, time spent in the field etc..
> 
>  
> 
> Our thinking was that we could apply a bootstrap type procedure:
> 
>  
> 
> We would remove 10 records or samples from the total n=250 and then replace
those 10 removed with replacements (or copies) from the remaining 240. With this
new data-frame we would apply our model and calculate an r?, we would then
repeat through looping 1000 times before generating the mean r? from those 1000
r? values generated. After which we would start the process again by remove 20
samples from our data with replacements from the remaining 230 records and so
on...
> 
>  
> 
> Below is a simplified version of the real code which contains most of the
basic elements. My main problem is I'm not sure what the 'for(i in
1:nboot)' line is doing, originally I though what this meant was that it
removed 1 sample or record from the data which was replaced by a copy of one of
the records from the remaining n, such that 'for(i in 10:nboot)' when
used in the context of the below code removed 10 samples with replacements as I
have said above. I'm almost positive that this isn't happening and if
not how can I make the code below for example do what we want it to?
> 
>  
> 
> library(utils)
> 
> #data
> 
> a <- c(5.5, 2.3, 8.5, 9.1, 8.6, 5.1)
> 
> b <- c(5.2, 2.2, 8.6, 9.1, 8.8, 5.7)
> 
> c <- c(5.0,14.6, 8.9, 9.0, 9.1, 5.5)
> 
> #join
> 
> abc <- data.frame(a,b,c)
> 
> #set column names
> 
> names(abc)[1]<-"y"
> 
> names(abc)[2]<-"x1"
> 
> names(abc)[3]<-"x2"
> 
> abc2 <- abc
> 
> #sample
> 
> abc3 <- as.data.frame(t(as.matrix(data.frame(abc2))))
> 
> n <- length(abc2)
> 
> npboot.function <- function(nboot)
> 
> {
> 
> boot.cor <- vector(length=nboot)
> 
> for(i in 1:nboot){
> 
> rdata <- sample(abc3,n,replace=T)
> 
> abc4 <- as.data.frame(t(as.matrix(data.frame(rdata))))
> 
> model <- lm(asin(sqrt(abc4$y/100)) ~ I(abc4$x1^2) + abc4$x2)
> 
> boot.cor[i] <- cor(abc4$y, model$fit)}
> 
> boot.cor
> 
> }
> 
> bt.cor <- npboot.function(nboot=10)
> 
> bootmean <- mean(bt.cor)
> 
>  
> 
>  
> 
> Any assistance would be greatly appreciated, also the sooner the better as
we are under pressure to reach a conclusion.
> 
>  
> 
> Cheers,
> 
>  
> 
> Garth
> 
> 
> 	[[alternative HTML version deleted]]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Apparently Analagous Threads

Search for more reasonably related threads

R help - Nov 2007 - loops & sampling

[R] loops & sampling

[R] loops & sampling

Apparently Analagous Threads