Hello all, I am working with a very large data set into R, and I have no interest in reviving my SAS skills. To do this, I will need to drop unwanted variables given the size of the data file. The most common strategy seems to be subsetting the data after it is read into R. Unfortunately, given the size of the data set, I can't get the file read and then subsquently do the subset procedure. I would be appreciative of help on the following: 1. What are the possibilities of reading in just a small set of variables during the <read.table> statement (or another 'read' statement)? That is, is it possible specify just the variables that I want to keep? 2. Can I randomly select a set of observations during the 'read' statement? I have searched various R resources for this information, so if I am simply overlooking a key resource on this issue, pointing that out to me would be greatly appreciated. Thanks in advance. Brian [[alternative HTML version deleted]]
On Jan 3, 2008 9:00 AM, BEP <perronbe at gmail.com> wrote:> Hello all, > > I am working with a very large data set into R, and I have no interest in > reviving my SAS skills. To do this, I will need to drop unwanted variables > given the size of the data file. The most common strategy seems to be > subsetting the data after it is read into R. Unfortunately, given the size > of the data set, I can't get the file read and then subsquently do the > subset procedure. I would be appreciative of help on the following: > > 1. What are the possibilities of reading in just a small set of variables > during the <read.table> statement (or another 'read' statement)? That is, > is it possible specify just the variables that I want to keep?read.table can skip columns. Specify the releveant component of colClasses as NULL.> > 2. Can I randomly select a set of observations during the 'read' statement? > > > I have searched various R resources for this information, so if I am simply > overlooking a key resource on this issue, pointing that out to me would be > greatly appreciated. >The development version of sqldf can do all of the above (i.e. read in a subset of columns, a subset of rows or a random subset of rows) subject to certain limitations on the input format. See Example 6 on the home page: http://sqldf.googlecode.com readTable in the R.utils package can also read in a subset of rows and columns.
BEP wrote:> Hello all, > > I am working with a very large data set into R, and I have no interest in > reviving my SAS skills. To do this, I will need to drop unwanted variables > given the size of the data file. The most common strategy seems to be > subsetting the data after it is read into R. Unfortunately, given the size > of the data set, I can't get the file read and then subsquently do the > subset procedure. I would be appreciative of help on the following: > > 1. What are the possibilities of reading in just a small set of variables > during the <read.table> statement (or another 'read' statement)? That is, > is it possible specify just the variables that I want to keep? > > 2. Can I randomly select a set of observations during the 'read' statement? > > > I have searched various R resources for this information, so if I am simply > overlooking a key resource on this issue, pointing that out to me would be > greatly appreciated. > > Thanks in advance. > > BrianCheck this for input of specific columns from a large data matrix: mysubsetdata<-do.call("cbind",scan(file='location and name of your file',what=list(NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,0,NULL,0,NULL,NULL),flush=TRUE)) This will input only columns 10 and 11 into 'mysubsetdata'. With this method you can work out the way to select random columns. HTH Rub?n