I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x <- readLines(ff) closeAllConnections() x <- x[x != ""] # REMOVE BLANKS x=x[grep("^[-0-9]", x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]))))) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]]
On 11/2/2009 2:03 PM, Gene Leynes wrote:> I've been trying to figure out how to read in a large file for a few days > now, and after extensive research I'm still not sure what to do. > > I have a large comma delimited text file that contains 59 fields in each > record. > There is also a header every 121 recordsYou can open the connection before reading, then read in blocks of lines and process those. You don't need to reopen it every time. For example, ff <- file(fname, open="rt") # rt is read text for (block in 1:nblocks) { x <- readLines(ff, n=121) # process this block } close(ff) Duncan Murdoch> > This function works well for smallish records > getcsv=function(fname){ > ff=file(description = fname) > x <- readLines(ff) > closeAllConnections() > x <- x[x != ""] # REMOVE BLANKS > x=x[grep("^[-0-9]", x)] # REMOVE ALL TEXT > > spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE > > xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]))))) > return(xx) > } > It's not elegant, but it works. > For 121,000 records it completes in 2.3 seconds > For 121,000*5 records it completes in 63 seconds > For 121,000*10 records it doesn't complete > > When I try other methods to read the file in chunks (using scan), the > process breaks down because I have to start at the beginning of the file on > every iteration. > For example: > fnn=function(n,col){ > a=122*(n-1)+2 > xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) > xx=xx[xx!=''] > xx=matrix(xx,ncol=49,byrow=TRUE) > xx[,col] > } > system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds > system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds > system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds > > Even though I'm only getting the 26th column for 10 sets of records, it > takes a lot longer the further into the file I go. > > How can I tell scan to pick up where it left off, without it starting at the > beginning?? There must be a good example somewhere. > > I have done a lot of research (in fact, thank you to Michael J. Crawley and > others for your help thus far) > > Thanks, > > Gene > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Hi Gene, Rather than using R to parse this file, have you considered using either grep or sed to pre-process the file and then read it in? It looks like you just want lines starting with numbers, so something like grep '^[0-9]\+' thefile.csv > otherfile.csv should be much faster, and then you can just read in otherfile.csv using read.csv(). Best, Jim Gene Leynes wrote:> I've been trying to figure out how to read in a large file for a few days > now, and after extensive research I'm still not sure what to do. > > I have a large comma delimited text file that contains 59 fields in each > record. > There is also a header every 121 records > > This function works well for smallish records > getcsv=function(fname){ > ff=file(description = fname) > x <- readLines(ff) > closeAllConnections() > x <- x[x != ""] # REMOVE BLANKS > x=x[grep("^[-0-9]", x)] # REMOVE ALL TEXT > > spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE > > xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]))))) > return(xx) > } > It's not elegant, but it works. > For 121,000 records it completes in 2.3 seconds > For 121,000*5 records it completes in 63 seconds > For 121,000*10 records it doesn't complete > > When I try other methods to read the file in chunks (using scan), the > process breaks down because I have to start at the beginning of the file on > every iteration. > For example: > fnn=function(n,col){ > a=122*(n-1)+2 > xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) > xx=xx[xx!=''] > xx=matrix(xx,ncol=49,byrow=TRUE) > xx[,col] > } > system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds > system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds > system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds > > Even though I'm only getting the 26th column for 10 sets of records, it > takes a lot longer the further into the file I go. > > How can I tell scan to pick up where it left off, without it starting at the > beginning?? There must be a good example somewhere. > > I have done a lot of research (in fact, thank you to Michael J. Crawley and > others for your help thus far) > > Thanks, > > Gene > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826
Gene, You might want to look at function read.csv.ffdf from package ff which can read large csv-files into a ffdf object. That's kind of data.frame which is stored on disk resp. in the file-system-cache. Once you subscript part of it, you get a regular data.frame. Jens Oehlschl?gel -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
If the headers all start with the same letter, "A" say, and the data only contain numbers on their lines then just use read.table(..., comment = "A") On Mon, Nov 2, 2009 at 2:03 PM, Gene Leynes <gleynes+r at gmail.com> wrote:> I've been trying to figure out how to read in a large file for a few days > now, and after extensive research I'm still not sure what to do. > > I have a large comma delimited text file that contains 59 fields in each > record. > There is also a header every 121 records > > This function works well for smallish records > getcsv=function(fname){ > ? ?ff=file(description = fname) > ? ?x <- readLines(ff) > ? ?closeAllConnections() > ? ?x <- x[x != ""] ? ? ? ? ?# REMOVE BLANKS > ? ?x=x[grep("^[-0-9]", x)] ?# REMOVE ALL TEXT > > ? ?spl=strsplit(x,',') ? ? ?# THIS PART IS SLOW, BUT MANAGABLE > > xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]))))) > ? ?return(xx) > } > It's not elegant, but it works. > For 121,000 records it completes in 2.3 seconds > For 121,000*5 records it completes in 63 seconds > For 121,000*10 records it doesn't complete > > When I try other methods to read the file in chunks (using scan), the > process breaks down because I have to start at the beginning of the file on > every iteration. > For example: > fnn=function(n,col){ > ? ?a=122*(n-1)+2 > ? ?xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) > ? ?xx=xx[xx!=''] > ? ?xx=matrix(xx,ncol=49,byrow=TRUE) > ? ?xx[,col] > } > system.time(sapply(1:10,fnn,c=26)) ? ? # 0.31 Seconds > system.time(sapply(91:90,fnn,c=26)) ? ?# 1.09 Seconds > system.time(sapply(901:910,fnn,c=26)) ?# 5.78 Seconds > > Even though I'm only getting the 26th column for 10 sets of records, it > takes a lot longer the further into the file I go. > > How can I tell scan to pick up where it left off, without it starting at the > beginning?? ?There must be a good example somewhere. > > I have done a lot of research (in fact, thank you to Michael J. Crawley and > others for your help thus far) > > Thanks, > > Gene > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >