Thomas Pujol
2007-Jun-20 18:58 UTC
[R] shoudl I use apply, sapply, etc instead of a "for loop"?
I have been trying to learn the various "apply" functions but am still
learning their appropriate use. I appreciate any help the R community can offer
me. Sorry for the length of this post.
Background:
I have data on my hard drive organized in the following manner:
The data pertains to many different "samples" of data. (e.g. sample
001, sample, 002, sample 003, etc.)
Each "sample" contains many different "data frames" for a
large number of different data-items.
(e.g. sat score, median income of zip-code, gender, GPA, etc)
The data frames and files are each named with the data-item name as the
"prefix" of the name and the "sample number" as the suffix
of the name.
e.g. sat.001, income.001, sat.002, income.002
Each data frame has approximately 5,000 rows, 1 for each "person".
Note: The files are somehat large, and most of my analysis will be completed
within each "sample" . (Thus, I think that I should probably keep the
files stored as separate files, and not combine them into a larger list or data
frames. I also do not think I want to load all the files for multiple samples at
once, as this mayy take up too much memory.) Also, I have similar simplified
description of the files; many contain multiple columns of data.
###############
I have written a "for" loop that does the following:
a. For each "sample period" I load two files.
b. I perform a function on the data contain din these two files.
c. I take the results and save them as a new file.
I proceed to the next sample.
Is there a "better" (i.e. more elegant and/or efficient) way to do
this, perhaps with one of the "apply" functions? (e.g. apply, sapply,
lapply, tapply?)
#e.g. my simplified code
#this creates example data:
sat.001=c(500,400,750)
sat.002=c(245,455,767)
income.001=c(5020,4200,7250)
income.002=c(2425,4525,7627)
filenames=c('sat.001', 'sat.002', 'income.001',
'income.002')
sapply(filenames,function(x) { save( list=x , file = paste(x ,'.r', sep
='') ) })
rm(sat.001,sat.002,income.001,income.002,filenames)
ls() #
##############
#my for loop
divide = function(x,y) {x/y}
#creates a custom function
#inputs to my loop:
samplenames=c('001','002')
x.name='sat'
y.name='income'
fun='divide'
for (i in 1:length(samplenames) ) {
x.name.suf = paste(x.name,samplenames[i],sep='.')
#name of x file on hrd drive
y.name.suf = paste(y.name,samplenames[i],sep='.')
#name of y file on hrd drive
x=get(load(file = paste(x.name.suf ,'r', sep ='.') , envir =
.GlobalEnv) )
#loads and gets the x file
y=get(load(file = paste(y.name.suf ,'r', sep ='.') , envir =
.GlobalEnv) )
#loads and gets the y file
temp=get(fun)(x,y)
#applies custom function specified in arguments above
# to data contained in x and y files
save( list='temp' , file = paste(fun,x.name
,y.name,samplenames[i],sep='.') )
#save the results in files with name that specifies
#name of function, name of x, name of y, and sample number
#files will be used for later analysis
rm(list=paste(x.name.suf , sep ='.'))
rm(list=paste(y.name.suf , sep ='.'))
rm(x.name.suf,y.name.suf,x,y,temp)
}
rm(divide,samplenames,x.name,y.name,fun,i)
ls()
---------------------------------
Bored stiff? Loosen up...
[[alternative HTML version deleted]]
