Hi All, I have millions of regression lines to fit. So I am looking for the most efficient approach in R. Details: I have a large desing matrix X. The dimension is n by p. Each time when fitting the model, select rows from this matrix X and form a new design matrix, called X_current. There is another binary matrix M, with dim m by n, and each row is a 1*n vector. It helps to determin X_current. For instance, if the first row of M is (1,0,1,0,1,..,1), then I remove the 2nd and 4th row from X, and use this submatrix as the design matrix to fit the model. if the second row is (0,0,0,1,...,1), then I use X[4:n,] as input matrix to do the regression, etc. I would like to extract a single element of estimated parameter beta, and the SSE from the results. So far what I know is to use these functions which performs better than lm(): 1. lsfit() # this works faster than lm(), but have to extract the results without using ls.print(print=F), maybe I/O slows down. And not better than the other two approaches 2. directly use solve(X'X) and the formula from score equation to get the results # not sure how well for almost singular or wired matrices 3. directly use qr() with LAPACK or qr.solve() and the same formula # qr decomposion is reliable but takes a little more time than solve() Below is a piece of code for demonstration and time comparision. I do not know the best way to generate random non-singular design matrix, so I add try() to ignore errors. I did not include apply/sapply/lapply, seems they can save the memory but in my case do not improve. Thank you. #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- M = matrix( NA, nrow = 10^4, ncol = 200) for (i in 1:10^4) { M[i,] = sample(c(0,1), size=200, replace = T) } X = cbind( rnorm(200,1,10) ,1:200, sample(1:10^5,200,T)) Y = rnorm(200) # loop version resu1=matrix(NA,nrow=10^4, ncol=2) resu2=matrix(NA,nrow=10^4, ncol=2) resu3=matrix(NA,nrow=10^4, ncol=2) resu4=matrix(NA,nrow=10^4, ncol=2) t1 = proc.time() for (i in 1:10^4) { fit.ls = lsfit( x = X[M[i,],], y = Y[M[i,]], intercept=F) resu1[i,] = c( fit.ls$coeff[2], sum((fit.ls$res)^2)) } t2 = proc.time() for (i in 1:10^4) { hatM = t(X[M[i,],])%*%X[M[i,],] try( beta <- ( solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) try( varr <- sum((Y[M[i,]]-X[M[i,],]%*%beta)^2), silent = T) try( resu2[i, ] <- c( beta[2], varr ), silent = T) } t3 = proc.time() for (i in 1:10^4) { hatM = t(X[M[i,],])%*%X[M[i,],]] try( beta <- ( qr.solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) try( varr <- sum(( Y[M[i,]]-X[M[i,],]%*%beta)^2 ), silent = T) try( resu2[i, ] <- c( beta[2], varr )), silent = T) } t4 = proc.time() (t2-t1)[3] (t3-t2)[3] (t4-t3)[3] #----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
This page will tell you how to delete rows: stat.ethz.ch/pipermail/r-help/2007-November/146220.html I have used lm() with matrices, however, there maybe a more efficient method. I don't have the need for matrix functions, often. Hopefully, someone can direct you, if there is a more efficient method. ~Nicole Ford Ph.D. Student Graduate Assistant/ Instructor Department of Government and International Affairs University of South Florida office: SOC 012M Sent from my iPhone On Feb 26, 2013, at 11:41 AM, Jie <jimmycloud@gmail.com> wrote:> Hi All, > > I have millions of regression lines to fit. So I am looking for the > most efficient approach in R. > > Details: > I have a large desing matrix X. The dimension is n by p. > Each time when fitting the model, select rows from this matrix X and > form a new design matrix, called X_current. > There is another binary matrix M, with dim m by n, and each row is a > 1*n vector. It helps to determin X_current. > For instance, if the first row of M is (1,0,1,0,1,..,1), then I remove > the 2nd and 4th row from X, and use this submatrix as the design > matrix to fit the model. > if the second row is (0,0,0,1,...,1), then I use X[4:n,] as input > matrix to do the regression, > etc. > > I would like to extract a single element of estimated parameter beta, > and the SSE from the results. > > So far what I know is to use these functions which performs better than lm(): > 1. lsfit() # this works faster than lm(), but have to extract the > results without using ls.print(print=F), maybe I/O slows down. And not > better than the other two approaches > 2. directly use solve(X'X) and the formula from score equation to get > the results # not sure how well for almost singular or wired matrices > 3. directly use qr() with LAPACK or qr.solve() and the same formula > # qr decomposion is reliable but takes a little more time than solve() > > Below is a piece of code for demonstration and time comparision. I do > not know the best way to generate random non-singular design matrix, > so I add try() to ignore errors. > I did not include apply/sapply/lapply, seems they can save the memory > but in my case do not improve. Thank you. > > #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > M = matrix( NA, nrow = 10^4, ncol = 200) > for (i in 1:10^4) > { > M[i,] = sample(c(0,1), size=200, replace = T) > } > > X = cbind( rnorm(200,1,10) ,1:200, sample(1:10^5,200,T)) > Y = rnorm(200) > > > > # loop version > resu1=matrix(NA,nrow=10^4, ncol=2) > resu2=matrix(NA,nrow=10^4, ncol=2) > resu3=matrix(NA,nrow=10^4, ncol=2) > resu4=matrix(NA,nrow=10^4, ncol=2) > > > t1 = proc.time() > for (i in 1:10^4) > { > fit.ls = lsfit( x = X[M[i,],], y = Y[M[i,]], intercept=F) > resu1[i,] = c( fit.ls$coeff[2], sum((fit.ls$res)^2)) > } > t2 = proc.time() > for (i in 1:10^4) > { > hatM = t(X[M[i,],])%*%X[M[i,],] > try( beta <- ( solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) > try( varr <- sum((Y[M[i,]]-X[M[i,],]%*%beta)^2), silent = T) > try( resu2[i, ] <- c( beta[2], varr ), silent = T) > } > t3 = proc.time() > > for (i in 1:10^4) > { > hatM = t(X[M[i,],])%*%X[M[i,],]] > try( beta <- ( qr.solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) > try( varr <- sum(( Y[M[i,]]-X[M[i,],]%*%beta)^2 ), silent = T) > try( resu2[i, ] <- c( beta[2], varr )), silent = T) > } > t4 = proc.time() > > (t2-t1)[3] > (t3-t2)[3] > (t4-t3)[3] > > > #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > ______________________________________________ > R-help@r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]
I am sorry I realized your asking how to extract elements.... I hope this will -at least- put you in the right direction until someone else can help. data.princeton.edu/R/linearModels.html ~Nicole Ford Ph.D. Student Graduate Assistant/ Instructor Department of Government and International Affairs University of South Florida office: SOC 012M Sent from my iPhone On Feb 26, 2013, at 11:41 AM, Jie <jimmycloud@gmail.com> wrote:> Hi All, > > I have millions of regression lines to fit. So I am looking for the > most efficient approach in R. > > Details: > I have a large desing matrix X. The dimension is n by p. > Each time when fitting the model, select rows from this matrix X and > form a new design matrix, called X_current. > There is another binary matrix M, with dim m by n, and each row is a > 1*n vector. It helps to determin X_current. > For instance, if the first row of M is (1,0,1,0,1,..,1), then I remove > the 2nd and 4th row from X, and use this submatrix as the design > matrix to fit the model. > if the second row is (0,0,0,1,...,1), then I use X[4:n,] as input > matrix to do the regression, > etc. > > I would like to extract a single element of estimated parameter beta, > and the SSE from the results. > > So far what I know is to use these functions which performs better than lm(): > 1. lsfit() # this works faster than lm(), but have to extract the > results without using ls.print(print=F), maybe I/O slows down. And not > better than the other two approaches > 2. directly use solve(X'X) and the formula from score equation to get > the results # not sure how well for almost singular or wired matrices > 3. directly use qr() with LAPACK or qr.solve() and the same formula > # qr decomposion is reliable but takes a little more time than solve() > > Below is a piece of code for demonstration and time comparision. I do > not know the best way to generate random non-singular design matrix, > so I add try() to ignore errors. > I did not include apply/sapply/lapply, seems they can save the memory > but in my case do not improve. Thank you. > > #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > M = matrix( NA, nrow = 10^4, ncol = 200) > for (i in 1:10^4) > { > M[i,] = sample(c(0,1), size=200, replace = T) > } > > X = cbind( rnorm(200,1,10) ,1:200, sample(1:10^5,200,T)) > Y = rnorm(200) > > > > # loop version > resu1=matrix(NA,nrow=10^4, ncol=2) > resu2=matrix(NA,nrow=10^4, ncol=2) > resu3=matrix(NA,nrow=10^4, ncol=2) > resu4=matrix(NA,nrow=10^4, ncol=2) > > > t1 = proc.time() > for (i in 1:10^4) > { > fit.ls = lsfit( x = X[M[i,],], y = Y[M[i,]], intercept=F) > resu1[i,] = c( fit.ls$coeff[2], sum((fit.ls$res)^2)) > } > t2 = proc.time() > for (i in 1:10^4) > { > hatM = t(X[M[i,],])%*%X[M[i,],] > try( beta <- ( solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) > try( varr <- sum((Y[M[i,]]-X[M[i,],]%*%beta)^2), silent = T) > try( resu2[i, ] <- c( beta[2], varr ), silent = T) > } > t3 = proc.time() > > for (i in 1:10^4) > { > hatM = t(X[M[i,],])%*%X[M[i,],]] > try( beta <- ( qr.solve( hatM ) %*% t( X[M[i,],] )%*%Y[M[i,]] ), silent = T) > try( varr <- sum(( Y[M[i,]]-X[M[i,],]%*%beta)^2 ), silent = T) > try( resu2[i, ] <- c( beta[2], varr )), silent = T) > } > t4 = proc.time() > > (t2-t1)[3] > (t3-t2)[3] > (t4-t3)[3] > > > #---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > ______________________________________________ > R-help@r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.[[alternative HTML version deleted]]