thr3ads.net - R help - [R] r: LOOPING [Jul 2005]

If this information is useful, please help other people find it:
Share via:

Clark Allan

2005-Jul-07 07:59 UTC

[R] r: LOOPING

hi all

i know that one should try and limit the amount of looping in R
programs. i have supplied some code below. i am interested in seeing how
the code cold be rewritten if we dont use the loops.


a brief overview of what is done in the code.
=======================================================================================================================================
1. the input file contains 120*500*61 cells. 120*500 rows and 61
columns.

2. we need to import the cells in 500 at a time and perform the same
operations on each sub group

3. the file contais numeric values. there are quite a lot of missing
values. this has been coded as NA in the text file (the file that is
imported)

4. for each variable we check for outliers. this is done by setting all
values that are greater than 3 standard deviations (sd) from the mean of
a variable to be equal to the 3 sd value.

5. the data set has one response variable , the first column, and 60
explanatory variables.

6. we regress each of the explanatory variables against the response and
record the slope of the explanatory variable. (i.e. simple linear
regression is performed)

7. nsize = 500 since we import 500 rows at a time

8. nruns = how many groups you want to run the analysis on

=======================================================================================================================================

TRY<-function(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)
{

#the matrix with the payoff weights
fit.reg<-matrix(nrow=nruns,ncol=nvar-1)

for (ii in 1:nruns)
{
skip=1+(ii-1)*nsize

	#import the data in batches of "nsize*nvar"
	#save as a matrix and then delete "dscan" to save memory space

dscan<-scan(file=filename,sep="\t",skip=skip,nlines=nsize,fill=T,quiet=T)
	dm<-matrix(dscan,nrow=nsize,byrow=T)
	rm(dscan)

	#this calculates which of the columns have entries in the columns 
	#that are not NA
	#only perform regressions on those with more than 2 data points
	#obviously the number of points has to be much larger than 2
	#col.points = the number of points in the column that are not NA

	col.points<-apply(dm,2,function(x)
sum(match(x,rep(NA,nsize),nomatch=0)))
	col.points

	#adjust for outliers
	dm.new<-dm
	mean.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
	sd.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))
	top.dm.new<-mean.dm.new+3*sd.dm.new
	bottom.dm.new<-mean.dm.new-3*sd.dm.new

	for (i in 1:nvar)
	{
		dm.new[,i][dm.new[,i]>top.dm.new[i]]<-top.dm.new[i]
		dm.new[,i][dm.new[,i]<bottom.dm.new[i]]<-bottom.dm.new[i]
	}

	#standardize the variables
	#we dont have to change the variable names here but i did!
	means.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
	std.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))

	dm.new<-sweep(sweep(dm.new,2,means.dm.new,"-"),2,std.dm.new,"/")

	for (j in 2:nvar)
	{	
		'WE DO NOT PERFORM THE REGRESSION IF ALL VALUES IN THE COLUMN ARE
"NA"
		if (col.points[j]!=nsize)
		{	
			#fit the regression equations
			fit.reg[ii,j-1]<-summary(lm(dm.new[,1]~dm.new[,j]))$coef[2,1]
		}
		else fit.reg[ii,j-1]<-"L"
	}
}

dm.names<-scan(file=filename,sep="\t",skip=0,nlines=1,fill=T,quiet=T,what="charachter")
dm.names<-matrix(dm.names,nrow=1,ncol=nvar,byrow=T)
colnames(fit.reg)<-dm.names[-1]

output<-c("$fit.reg")

list(fit.reg=fit.reg,output=output)

}

a=TRY(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)


=======================================================================================================================================



thanking you in advance
/
allan

Uwe Ligges

2005-Jul-07 10:17 UTC

head link

[R] r: LOOPING

Clark Allan wrote:
> hi all
> 
> i know that one should try and limit the amount of looping in R
> programs. i have supplied some code below. i am interested in seeing how
> the code cold be rewritten if we dont use the loops.

It is not always a good thing to remove loops (without having looked at 
each details of the code below).
One case seems to be described below, where you probably already get 
memory problems and hence cannot (at least not without memory penalty) 
"vectorize" any more.

"Compromise" is the keyword.

Best,
Uwe Ligges
> 
> a brief overview of what is done in the code.
> =============================================>
=============================================>
=============================================>
> 1. the input file contains 120*500*61 cells. 120*500 rows and 61
> columns.
> 
> 2. we need to import the cells in 500 at a time and perform the same
> operations on each sub group
> 
> 3. the file contais numeric values. there are quite a lot of missing
> values. this has been coded as NA in the text file (the file that is
> imported)
> 
> 4. for each variable we check for outliers. this is done by setting all
> values that are greater than 3 standard deviations (sd) from the mean of
> a variable to be equal to the 3 sd value.
> 
> 5. the data set has one response variable , the first column, and 60
> explanatory variables.
> 
> 6. we regress each of the explanatory variables against the response and
> record the slope of the explanatory variable. (i.e. simple linear
> regression is performed)
> 
> 7. nsize = 500 since we import 500 rows at a time
> 
> 8. nruns = how many groups you want to run the analysis on
> 
> =============================================>
=============================================>
=============================================>
> 
> TRY<-function(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)
> {
> 
> #the matrix with the payoff weights
> fit.reg<-matrix(nrow=nruns,ncol=nvar-1)
> 
> for (ii in 1:nruns)
> {
> skip=1+(ii-1)*nsize
> 
> 	#import the data in batches of "nsize*nvar"
> 	#save as a matrix and then delete "dscan" to save memory space
> 
>
dscan<-scan(file=filename,sep="\t",skip=skip,nlines=nsize,fill=T,quiet=T)
> 	dm<-matrix(dscan,nrow=nsize,byrow=T)
> 	rm(dscan)
> 
> 	#this calculates which of the columns have entries in the columns 
> 	#that are not NA
> 	#only perform regressions on those with more than 2 data points
> 	#obviously the number of points has to be much larger than 2
> 	#col.points = the number of points in the column that are not NA
> 
> 	col.points<-apply(dm,2,function(x)
> sum(match(x,rep(NA,nsize),nomatch=0)))
> 	col.points
> 
> 	#adjust for outliers
> 	dm.new<-dm
> 	mean.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
> 	sd.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))
> 	top.dm.new<-mean.dm.new+3*sd.dm.new
> 	bottom.dm.new<-mean.dm.new-3*sd.dm.new
> 
> 	for (i in 1:nvar)
> 	{
> 		dm.new[,i][dm.new[,i]>top.dm.new[i]]<-top.dm.new[i]
> 		dm.new[,i][dm.new[,i]<bottom.dm.new[i]]<-bottom.dm.new[i]
> 	}
> 
> 	#standardize the variables
> 	#we dont have to change the variable names here but i did!
> 	means.dm.new<-apply(dm.new,2,function(x) mean(x,na.rm=T))
> 	std.dm.new<-apply(dm.new,2,function(x) sd(x,na.rm=T))
> 
> 
dm.new<-sweep(sweep(dm.new,2,means.dm.new,"-"),2,std.dm.new,"/")
> 
> 	for (j in 2:nvar)
> 	{	
> 		'WE DO NOT PERFORM THE REGRESSION IF ALL VALUES IN THE COLUMN ARE
"NA"
> 		if (col.points[j]!=nsize)
> 		{	
> 			#fit the regression equations
> 			fit.reg[ii,j-1]<-summary(lm(dm.new[,1]~dm.new[,j]))$coef[2,1]
> 		}
> 		else fit.reg[ii,j-1]<-"L"
> 	}
> }
> 
>
dm.names<-scan(file=filename,sep="\t",skip=0,nlines=1,fill=T,quiet=T,what="charachter")
> dm.names<-matrix(dm.names,nrow=1,ncol=nvar,byrow=T)
> colnames(fit.reg)<-dm.names[-1]
> 
> output<-c("$fit.reg")
> 
> list(fit.reg=fit.reg,output=output)
> 
> }
> 
> a=TRY(nsize=500,filename="C:/A.txt",nvar=61,nruns=1)
> 
> 
> =============================================>
=============================================>
=============================================>
> 
> 
> 
> thanking you in advance
> /
> allan
> 
> 
> ------------------------------------------------------------------------
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Vehbi Sinan Tunalioglu

2005-Jul-08 07:30 UTC

head link

[R] r: LOOPING

Uwe Ligges wrote:> Clark Allan wrote:
>>
>>i know that one should try and limit the amount of looping in R
>>programs. i have supplied some code below. i am interested in seeing how
>>the code cold be rewritten if we dont use the loops.
> 
> It is not always a good thing to remove loops (without having looked at 
> each details of the code below).
> "Compromise" is the keyword.
If you have big routines for each iteration and especially you have back
references in the data structures you are manipulating, it becomes
really _hard to translate_ loop statements to filter-map-accumulator
routines provided by R (i.e. *apply functions). I really cannot find my
way in some situations. What I did so far is to write those routines in
C. Dirty hack :(

Maybe one should write a tutorial: "Howto avoid loops in R" by giving
possible scenarios.

--vst

Possibly Parallel Threads

Search for more reasonably related threads

R help - Jul 2005 - r: LOOPING

[R] r: LOOPING

[R] r: LOOPING

[R] r: LOOPING

Possibly Parallel Threads