Thomas Pujol
2007-Jun-20 18:58 UTC
[R] shoudl I use apply, sapply, etc instead of a "for loop"?
I have been trying to learn the various "apply" functions but am still learning their appropriate use. I appreciate any help the R community can offer me. Sorry for the length of this post. Background: I have data on my hard drive organized in the following manner: The data pertains to many different "samples" of data. (e.g. sample 001, sample, 002, sample 003, etc.) Each "sample" contains many different "data frames" for a large number of different data-items. (e.g. sat score, median income of zip-code, gender, GPA, etc) The data frames and files are each named with the data-item name as the "prefix" of the name and the "sample number" as the suffix of the name. e.g. sat.001, income.001, sat.002, income.002 Each data frame has approximately 5,000 rows, 1 for each "person". Note: The files are somehat large, and most of my analysis will be completed within each "sample" . (Thus, I think that I should probably keep the files stored as separate files, and not combine them into a larger list or data frames. I also do not think I want to load all the files for multiple samples at once, as this mayy take up too much memory.) Also, I have similar simplified description of the files; many contain multiple columns of data. ############### I have written a "for" loop that does the following: a. For each "sample period" I load two files. b. I perform a function on the data contain din these two files. c. I take the results and save them as a new file. I proceed to the next sample. Is there a "better" (i.e. more elegant and/or efficient) way to do this, perhaps with one of the "apply" functions? (e.g. apply, sapply, lapply, tapply?) #e.g. my simplified code #this creates example data: sat.001=c(500,400,750) sat.002=c(245,455,767) income.001=c(5020,4200,7250) income.002=c(2425,4525,7627) filenames=c('sat.001', 'sat.002', 'income.001', 'income.002') sapply(filenames,function(x) { save( list=x , file = paste(x ,'.r', sep ='') ) }) rm(sat.001,sat.002,income.001,income.002,filenames) ls() # ############## #my for loop divide = function(x,y) {x/y} #creates a custom function #inputs to my loop: samplenames=c('001','002') x.name='sat' y.name='income' fun='divide' for (i in 1:length(samplenames) ) { x.name.suf = paste(x.name,samplenames[i],sep='.') #name of x file on hrd drive y.name.suf = paste(y.name,samplenames[i],sep='.') #name of y file on hrd drive x=get(load(file = paste(x.name.suf ,'r', sep ='.') , envir = .GlobalEnv) ) #loads and gets the x file y=get(load(file = paste(y.name.suf ,'r', sep ='.') , envir = .GlobalEnv) ) #loads and gets the y file temp=get(fun)(x,y) #applies custom function specified in arguments above # to data contained in x and y files save( list='temp' , file = paste(fun,x.name ,y.name,samplenames[i],sep='.') ) #save the results in files with name that specifies #name of function, name of x, name of y, and sample number #files will be used for later analysis rm(list=paste(x.name.suf , sep ='.')) rm(list=paste(y.name.suf , sep ='.')) rm(x.name.suf,y.name.suf,x,y,temp) } rm(divide,samplenames,x.name,y.name,fun,i) ls() --------------------------------- Bored stiff? Loosen up... [[alternative HTML version deleted]]