thr3ads.net - R help - [R] Data Simulation in R [Jan 2005]

If this information is useful, please help other people find it:
Share via:

Doran, Harold

2005-Jan-18 23:28 UTC

[R] Data Simulation in R

Dear List:

A few weeks ago I posted some questions regarding data simulation and
received some very helpful comments, thank you. I have modified my code
accordingly and have made some progress. 

However, I now am facing a new challenge along similar lines. I am
attempting to simulate 250 datasets and then run the data through a
linear model. I use rm() and gc() as I move along to clean up the
workspace and preserve memory. However, my aim is to use sample sizes of
5,000 and 10,000. By any measure this is a huge task.

My machine has 2GB RAM and a Pentium 4 2.8 GHz machine with Windows XP.
I have the following in the "target" section of the Windows shortcut
--max-mem-size=1812M

With such large samples, R is unable to perform the analysis, at least
with the code I have developed. It halts when it runs out of memory. A
collegue subsequently constructed the simulation using another software
program with a similar computer and, while it took over night (and then
some), the program produced the results desired.

I am curious if it is the case that such large simulations are out of
the grasp of R or if my code is not adequately organized to perform the
simulation. 

I would appreciate any thoughts or advice.

Harold



library(MASS)
library(nlme)
mu<-c(100,150,200,250)
Sigma<-matrix(c(400,80,80,80,80,400,80,80,80,80,400,80,80,80,80,400),4,4
)
mu2<-c(0,0,0)
Sigma2<-diag(64,3)
sample.size<-5000
N<-250 #Number of datasets
#Take a single draw from VL distribution
vl.error<-mvrnorm(n=N, mu2, Sigma2)

#Step 1 Create Data
Data <- lapply(seq(N), function(x)
as.data.frame(cbind(1:10,mvrnorm(n=sample.size, mu, Sigma))))

#Step 2 Add Vertical Linking Error
for(i in seq(along=Data)){
Data[[i]]$V6 <- Data[[i]]$V2
Data[[i]]$V7 <- Data[[i]]$V3 + vl.error[i,1] 
Data[[i]]$V8 <- Data[[i]]$V4 + vl.error[i,2]
Data[[i]]$V9 <- Data[[i]]$V5 + vl.error[i,3] 
}

#Step 3 Restructure for Longitudinal Analysis
long <- lapply(Data, function(x) reshape(x, idvar="Data[[i]]$V1",
varying=list(c(names(Data[[i]])[2:5]),c(names(Data[[i]])[6:9])),
v.names=c("score.1","score.2"), direction="long"))

#####################
#Clean up Workspace
rm(Data,vl.error) 
gc()
#####################

# Step 4 Run GLS

glsrun1 <- lapply(long, function(x) gls(score.1~I(time-1), data=x, 
correlation=corAR1(form=~1|V1), method='ML'))

# Extract intercepts and slopes 
int1 <- sapply(glsrun1, function(x) x$coefficient[1])
slo1 <- sapply(glsrun1, function(x) x$coefficient[2])

################
#Clean up workspace
rm(glsrun1)
gc()

glsrun2 <- lapply(long, function(x) gls(score.2~I(time-1), data=x, 
correlation=corAR1(form=~1|V1), method='ML')) 

# Extract intercepts and slopes 
int2 <- sapply(glsrun2, function(x) x$coefficient[1])
slo2 <- sapply(glsrun2, function(x) x$coefficient[2])

 
#Clean up workspace
rm(glsrun2)
gc()



# Print Results

cat("Original Standard Errors","\n",
"Intercept","\t",
sd(int1),"\n","Slope","\t","\t",
sd(slo1),"\n")

cat("Modified Standard Errors","\n",
"Intercept","\t",
sd(int2),"\n","Slope","\t","\t",
sd(slo2),"\n")

rm(list=ls())
gc() 

	[[alternative HTML version deleted]]

Uwe Ligges

2005-Jan-19 10:51 UTC

head link

[R] Data Simulation in R

Doran, Harold wrote:
> Dear List:
> 
> A few weeks ago I posted some questions regarding data simulation and
> received some very helpful comments, thank you. I have modified my code
> accordingly and have made some progress. 
> 
> However, I now am facing a new challenge along similar lines. I am
> attempting to simulate 250 datasets and then run the data through a
> linear model. I use rm() and gc() as I move along to clean up the
> workspace and preserve memory. However, my aim is to use sample sizes of
> 5,000 and 10,000. By any measure this is a huge task.
> 
> My machine has 2GB RAM and a Pentium 4 2.8 GHz machine with Windows XP.
> I have the following in the "target" section of the Windows
shortcut
> --max-mem-size=1812M
> 
> With such large samples, R is unable to perform the analysis, at least
> with the code I have developed. It halts when it runs out of memory. A
> collegue subsequently constructed the simulation using another software
> program with a similar computer and, while it took over night (and then
> some), the program produced the results desired.
> 
> I am curious if it is the case that such large simulations are out of
> the grasp of R or if my code is not adequately organized to perform the
> simulation. 
> 
> I would appreciate any thoughts or advice.

Don't hold all datasets (and results, if they are big) in the memory at 
the same time!!!

Either generate them when you use them and delete them afterwards,
or save them to disc an only load one by one for further analyses.
Also, you might want to call gc() after you removed large objects...

Uwe Ligges


> Harold
> 
> 
> 
> library(MASS)
> library(nlme)
> mu<-c(100,150,200,250)
> Sigma<-matrix(c(400,80,80,80,80,400,80,80,80,80,400,80,80,80,80,400),4,4
> )
> mu2<-c(0,0,0)
> Sigma2<-diag(64,3)
> sample.size<-5000
> N<-250 #Number of datasets
> #Take a single draw from VL distribution
> vl.error<-mvrnorm(n=N, mu2, Sigma2)
> 
> #Step 1 Create Data
> Data <- lapply(seq(N), function(x)
> as.data.frame(cbind(1:10,mvrnorm(n=sample.size, mu, Sigma))))
> 
> #Step 2 Add Vertical Linking Error
> for(i in seq(along=Data)){
> Data[[i]]$V6 <- Data[[i]]$V2
> Data[[i]]$V7 <- Data[[i]]$V3 + vl.error[i,1] 
> Data[[i]]$V8 <- Data[[i]]$V4 + vl.error[i,2]
> Data[[i]]$V9 <- Data[[i]]$V5 + vl.error[i,3] 
> }
> 
> #Step 3 Restructure for Longitudinal Analysis
> long <- lapply(Data, function(x) reshape(x,
idvar="Data[[i]]$V1",
> varying=list(c(names(Data[[i]])[2:5]),c(names(Data[[i]])[6:9])),
> v.names=c("score.1","score.2"),
direction="long"))
> 
> #####################
> #Clean up Workspace
> rm(Data,vl.error) 
> gc()
> #####################
> 
> # Step 4 Run GLS
> 
> glsrun1 <- lapply(long, function(x) gls(score.1~I(time-1), data=x, 
> correlation=corAR1(form=~1|V1), method='ML'))
> 
> # Extract intercepts and slopes 
> int1 <- sapply(glsrun1, function(x) x$coefficient[1])
> slo1 <- sapply(glsrun1, function(x) x$coefficient[2])
> 
> ################
> #Clean up workspace
> rm(glsrun1)
> gc()
> 
> glsrun2 <- lapply(long, function(x) gls(score.2~I(time-1), data=x, 
> correlation=corAR1(form=~1|V1), method='ML')) 
> 
> # Extract intercepts and slopes 
> int2 <- sapply(glsrun2, function(x) x$coefficient[1])
> slo2 <- sapply(glsrun2, function(x) x$coefficient[2])
> 
>  
> #Clean up workspace
> rm(glsrun2)
> gc()
> 
> 
> 
> # Print Results
> 
> cat("Original Standard Errors","\n",
"Intercept","\t",
> sd(int1),"\n","Slope","\t","\t",
sd(slo1),"\n")
> 
> cat("Modified Standard Errors","\n",
"Intercept","\t",
> sd(int2),"\n","Slope","\t","\t",
sd(slo2),"\n")
> 
> rm(list=ls())
> gc() 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Doran, Harold

2005-Jan-19 12:36 UTC

head link

[R] Data Simulation in R

Thanks. But, I think I am doing that. I use rm() and gc() as the code
moves along. The datasets are stored as a list. Is there a way that I
can save the list object and call each dataset within a list one at a
time, or must the entire list be in memory at once?

Harold

-----Original Message-----
From: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de] 
Sent: Wednesday, January 19, 2005 5:51 AM
To: Doran, Harold
Cc: r-help at stat.math.ethz.ch
Subject: Re: [R] Data Simulation in R

Doran, Harold wrote:
> Dear List:
> 
> A few weeks ago I posted some questions regarding data simulation and 
> received some very helpful comments, thank you. I have modified my 
> code accordingly and have made some progress.
> 
> However, I now am facing a new challenge along similar lines. I am 
> attempting to simulate 250 datasets and then run the data through a 
> linear model. I use rm() and gc() as I move along to clean up the 
> workspace and preserve memory. However, my aim is to use sample sizes 
> of 5,000 and 10,000. By any measure this is a huge task.
> 
> My machine has 2GB RAM and a Pentium 4 2.8 GHz machine with Windows
XP.> I have the following in the "target" section of the Windows
shortcut
> --max-mem-size=1812M
> 
> With such large samples, R is unable to perform the analysis, at least
> with the code I have developed. It halts when it runs out of memory. A
> collegue subsequently constructed the simulation using another 
> software program with a similar computer and, while it took over night
> (and then some), the program produced the results desired.
> 
> I am curious if it is the case that such large simulations are out of 
> the grasp of R or if my code is not adequately organized to perform 
> the simulation.
> 
> I would appreciate any thoughts or advice.

Don't hold all datasets (and results, if they are big) in the memory at
the same time!!!

Either generate them when you use them and delete them afterwards, or
save them to disc an only load one by one for further analyses.
Also, you might want to call gc() after you removed large objects...

Uwe Ligges


> Harold
> 
> 
> 
> library(MASS)
> library(nlme)
> mu<-c(100,150,200,250)
> Sigma<-matrix(c(400,80,80,80,80,400,80,80,80,80,400,80,80,80,80,400),4
> ,4
> )
> mu2<-c(0,0,0)
> Sigma2<-diag(64,3)
> sample.size<-5000
> N<-250 #Number of datasets
> #Take a single draw from VL distribution vl.error<-mvrnorm(n=N, mu2, 
> Sigma2)
> 
> #Step 1 Create Data
> Data <- lapply(seq(N), function(x)
> as.data.frame(cbind(1:10,mvrnorm(n=sample.size, mu, Sigma))))
> 
> #Step 2 Add Vertical Linking Error
> for(i in seq(along=Data)){
> Data[[i]]$V6 <- Data[[i]]$V2
> Data[[i]]$V7 <- Data[[i]]$V3 + vl.error[i,1]
> Data[[i]]$V8 <- Data[[i]]$V4 + vl.error[i,2]
> Data[[i]]$V9 <- Data[[i]]$V5 + vl.error[i,3] }
> 
> #Step 3 Restructure for Longitudinal Analysis long <- lapply(Data, 
> function(x) reshape(x, idvar="Data[[i]]$V1", 
> varying=list(c(names(Data[[i]])[2:5]),c(names(Data[[i]])[6:9])),
> v.names=c("score.1","score.2"),
direction="long"))
> 
> #####################
> #Clean up Workspace
> rm(Data,vl.error)
> gc()
> #####################
> 
> # Step 4 Run GLS
> 
> glsrun1 <- lapply(long, function(x) gls(score.1~I(time-1), data=x, 
> correlation=corAR1(form=~1|V1), method='ML'))
> 
> # Extract intercepts and slopes
> int1 <- sapply(glsrun1, function(x) x$coefficient[1])
> slo1 <- sapply(glsrun1, function(x) x$coefficient[2])
> 
> ################
> #Clean up workspace
> rm(glsrun1)
> gc()
> 
> glsrun2 <- lapply(long, function(x) gls(score.2~I(time-1), data=x, 
> correlation=corAR1(form=~1|V1), method='ML'))
> 
> # Extract intercepts and slopes
> int2 <- sapply(glsrun2, function(x) x$coefficient[1])
> slo2 <- sapply(glsrun2, function(x) x$coefficient[2])
> 
>  
> #Clean up workspace
> rm(glsrun2)
> gc()
> 
> 
> 
> # Print Results
> 
> cat("Original Standard Errors","\n",
"Intercept","\t",
> sd(int1),"\n","Slope","\t","\t",
sd(slo1),"\n")
> 
> cat("Modified Standard Errors","\n",
"Intercept","\t",
> sd(int2),"\n","Slope","\t","\t",
sd(slo2),"\n")
> 
> rm(list=ls())
> gc()
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

Weiwei Shi

2005-Jan-19 20:10 UTC

head link

[R] suggestion on data mining book using R

Hi, there:
I think I need a book on data mining book using R. I
knew 
Modern Applied Statistics with S-plus (2nd Ed)
or
Modern Applied Statistics with S (4th Ed)
might be a good choice.

But not sure if there is other better suggestion and
which one between the two is better.

thanks,

Ed

Uwe Ligges

2005-Jan-20 07:32 UTC

head link

[R] suggestion on data mining book using R

Weiwei Shi wrote:> Hi, there:
> I think I need a book on data mining book using R. I
> knew 
> Modern Applied Statistics with S-plus (2nd Ed)
> or
> Modern Applied Statistics with S (4th Ed)
> might be a good choice.
> 
> But not sure if there is other better suggestion and
> which one between the two is better.

Well, the authors seldom debase later editions ...

Uwe


> thanks,
> 
> Ed
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Reasonably Related Threads

Search for more seemingly similar threads

R help - Jan 2005 - Data Simulation in R

[R] Data Simulation in R

[R] Data Simulation in R

[R] Data Simulation in R

[R] suggestion on data mining book using R

[R] suggestion on data mining book using R

Reasonably Related Threads