I have very large csv files (up to 1GB each of ASCII text). I'd like to be able to read them directly in to R. The problem I am having is with the variable length of the data in each record. Here's a (simplified) example: $ cat foo.csv Name,Start Month,Data Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 The records consist of rows with some set comma-separated fields (e.g. the "Name" & "Start Month" fields in the above) and then the data follow as a variable-length list of comma-separated values until a new line is encountered. Now I can use e.g. fileName="foo.csv" ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) which does the job nicely: V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA NA NA NA NA NA NA NA NA 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 but the problem is with files on the order of 1GB this either crunches for ever or runs out of memory trying ... plus having all those NAs isn't too pretty to look at. (I have a MATLAB version that can read this stuff into an array of cells in about 3 minutes). I really want a fast way to read the data part into a list; that way I can access data in the array of lists containing the records by doing something ta[[i]]$data. Ideas? Thanks, Jack. --------------------------------- [[alternative HTML version deleted]]
Use file() connection in conjunction with readLines() and strsplit() should do it. I would try to count the number of lines in the file first, and create a list with that many components, then fill it in. I believe the "array of cells" in Matlab is sort of equivalent to a list in R, but that's beyond my knowledge of Matlab... Andy From: John McHenry> > I have very large csv files (up to 1GB each of ASCII text). > I'd like to be able to read them directly in to R. The > problem I am having is with the variable length of the data > in each record. > > Here's a (simplified) example: > > $ cat foo.csv > Name,Start Month,Data > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854 > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > The records consist of rows with some set comma-separated > fields (e.g. the "Name" & "Start Month" fields in the above) > and then the data follow as a variable-length list of > comma-separated values until a new line is encountered. > > Now I can use e.g. > > fileName="foo.csv" > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > which does the job nicely: > > V1 V2 V3 V4 V5 V6 V7 V8 V9 > V10 V11 V12 V13 V14 V15 V16 V17 > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA > NA NA NA NA NA NA NA NA > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > but the problem is with files on the order of 1GB this > either crunches for ever or runs out of memory trying ... > plus having all those NAs isn't too pretty to look at. > > (I have a MATLAB version that can read this stuff into an > array of cells in about 3 minutes). > > I really want a fast way to read the data part into a list; > that way I can access data in the array of lists containing > the records by doing something ta[[i]]$data. > > Ideas? > > Thanks, > > Jack. > > > --------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
On 06-Dec-05 John McHenry wrote:> I have very large csv files (up to 1GB each of ASCII text). I'd like to > be able to read them directly in to R. The problem I am having is with > the variable length of the data in each record. > > Here's a (simplified) example: > > $ cat foo.csv > Name,Start Month,Data > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.8546,0.2696,0 > .3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > The records consist of rows with some set comma-separated fields > (e.g. the "Name" & "Start Month" fields in the above) and then the data > follow as a variable-length list of comma-separated values until a new > line is encountered.While you may well get a good R solution from the experts, in such a situation (as in so many) I would be tempted to pre-process the file with 'awk' (installed by default on Unix/Linux systems, available also for Windows). The following will give you a CSV file with a constant number of fields per line. While this does not eliminate the NAs which you apparently find unsightly, it should be a fast and clean way of doing the basic job, since it a line-by-line operation in two passes, so there should be no question. of choking the system (unless you run out of HD space as a result of creating the second file). Two passes, on the lines of Pass 1: cat foo.csv | awk ' BEGIN{FS=","; n=0} {m=NF; if(m>n){n=m}} END{print n} ' which gives you the maximum number of fields in any line. Suppose (for example) that this number is 37. Then Pass 2: cat foo.csv | awk -v maxF=37 ' BEGIN{FS=","; OFS=","} {if(NF<maxF){$maxF=""}} {print $0} ' > newfoo.csv Tiny example: 1) See foo.csv cat foo.csv 1 1,2 1,2,3 1,2,3,4 1,2 2) Pass 1: cat foo.csv | awk ' BEGIN{FS=","; n=0} {m=NF; if(m>n){n=m}} END{print n} '> 43) So we need 4 fields per line. With maxF=4, Pass 2: cat foo.csv | awk -v maxF=4 ' BEGIN{FS=","; OFS=","} {if(NF<maxF){$maxF=""}} {print $0} ' > newfoo.csv 4) See newfoo.csv cat newfoo.csv 1,,, 1,2,, 1,2,3, 1,2,3,4 1,2,, So you now have a CSV file with a constant number of fields per line. This doesn't make it into lists, though. Hoping this helps, Ted.> > Now I can use e.g. > > fileName="foo.csv" > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > which does the job nicely: > > V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 > V11 V12 V13 V14 V15 V16 V17 > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA NA > NA NA NA NA NA NA NA > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 1.8546 > 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > but the problem is with files on the order of 1GB this either > crunches for ever or runs out of memory trying ... plus having all > those NAs isn't too pretty to look at. > > (I have a MATLAB version that can read this stuff into an array of > cells in about 3 minutes). > > I really want a fast way to read the data part into a list; that way > I can access data in the array of lists containing the records by doing > something ta[[i]]$data. > > Ideas? > > Thanks, > > Jack. > > > --------------------------------- > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html-------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 06-Dec-05 Time: 18:08:54 ------------------------------ XFMail ------------------------------
A slight variation on one of Gabor's idea might work: ## simulate a data file: n <- 2e5 minF <- 20 maxF <- 30 f <- file("test.csv", open="w") invisible(replicate(n, writeLines(paste(runif(sample(minF:maxF, 1)), collapse=","), f))) close(f) f <- file("test.csv", open="r") system.time(dat <- replicate(n, scan(f, nlines=1, sep=","))) close(f) The above code creates a file around 270MB. It took around 46 seconds on my 1GB laptop to read the data into "dat". The corresponding strsplit(readLines()) solution took over a minute, and another 23 seconds to run lapply(dat, as.numeric). Andy -----Original Message----- From: John McHenry [mailto:john_d_mchenry@yahoo.com] Sent: Tuesday, December 06, 2005 3:05 PM To: Gabor Grothendieck Cc: Liaw, Andy; r-help@stat.math.ethz.ch Subject: Re: [R] reading in data with variable length Everything has slowed down with #1 and #3 by about 50%. Can't do #2 & #4 :> ta.num <- lapply(ta0, scan, sep = ",")Error in file(file, "r") : unable to open connection scan seems to want a file or a connection ... Gabor Grothendieck <ggrothendieck@gmail.com> wrote: Could you time these and see how each of these do: # 1 ta.split <- strsplit(ta, split = ",") ta.num <- lapply(ta.split, function(x) as.numeric(x[-(1:2)])) # 2 ta0 <- sub("^[^,]*,[^.]*,", "", ta) ta.num <- lapply(ta0, scan, sep = ",") # 3 - loop version of #1 n <- length(ta) ta.split <- strsplit(ta, split = ",") ta.num <- list(length = n) for(i in 1:n) ta.num[[i]] <- as.numeric(ta.split[[i]][-(1:2)]) # 4 - loop version of #2 n <- length(ta) ta0 <- sub("^[^,]*,[^.]*,", "", ta) ta.num <- list(length = n) for(i in 1:n) ta.num[[i]] <- scan(t0[[i]) On 12/6/05, John McHenry wrote:> I should have mentioned that I already tried the readLines() approach: > > ta<-readLines("foo.csv") > ptm<-proc.time() > f<-character(length(ta)) > for (k in 2:length(ta)) { f[k-1]<-(strsplit(ta[k],",")[[1]])[3] }# <-PARSING EACH LINE AT THIS LEVEL IS WHERE THE REAL INEFFICIENCY IS> (proc.time()-ptm)[3] > [1] 102.75 > > on a 62M file, so I'm guessing that on my 1GB files this will be about > > > (102.75*(1000/61))/60 > [1] 28.07377 > > minutes...which is way, way too long. > > I'm new to R but I'm kind of surprised that this problem isn't well known(couldn't find anything after a long hunt).> > As I mentioned, MATLAB does it using textread which makes a call to itsdll dataread. The data are read using something like:> > [name, startMonth, data]=textread(fileName,'%s%n%[^\n]', 'delimiter',',','bufsize', 1000000, 'headerlines',1);> > which is kind of fscanf-like. data in the above is then a cell array witheach cell being the variable-length data.> > "Liaw, Andy" wrote: > Use file() connection in conjunction with readLines() and strsplit()should> do it. I would try to count the number of lines in the file first, and > create a list with that many components, then fill it in. I believe the > "array of cells" in Matlab is sort of equivalent to a list in R, butthat's> beyond my knowledge of Matlab... > > Andy > > From: John McHenry > > > > I have very large csv files (up to 1GB each of ASCII text). > > I'd like to be able to read them directly in to R. The > > problem I am having is with the variable length of the data > > in each record. > > > > Here's a (simplified) example: > > > > $ cat foo.csv > > Name,Start Month,Data > > Foo,10,-0.5615,2.3065,0.1589,-0.3649,1.5955 > > Bar,21,0.0880,0.5733,0.0081,2.0253,-0.7602,0.7765,0.2810,1.854 > > 6,0.2696,0.3316,0.1565,-0.4847,-0.1325,0.0454,-1.2114 > > > > The records consist of rows with some set comma-separated > > fields (e.g. the "Name" & "Start Month" fields in the above) > > and then the data follow as a variable-length list of > > comma-separated values until a new line is encountered. > > > > Now I can use e.g. > > > > fileName="foo.csv" > > ta<-read.csv(fileName, header=F, skip=1, sep=",", dec=".", fill=T) > > > > which does the job nicely: > > > > V1 V2 V3 V4 V5 V6 V7 V8 V9 > > V10 V11 V12 V13 V14 V15 V16 V17 > > 1 Foo 10 -0.5615 2.3065 0.1589 -0.3649 1.5955 NA NA > > NA NA NA NA NA NA NA NA > > 2 Bar 21 0.0880 0.5733 0.0081 2.0253 -0.7602 0.7765 0.281 > > 1.8546 0.2696 0.3316 0.1565 -0.4847 -0.1325 0.0454 -1.2114 > > > > > > but the problem is with files on the order of 1GB this > > either crunches for ever or runs out of memory trying ... > > plus having all those NAs isn't too pretty to look at. > > > > (I have a MATLAB version that can read this stuff into an > > array of cells in about 3 minutes). > > > > I really want a fast way to read the data part into a list; > > that way I can access data in the array of lists containing > > the records by doing something ta[[i]]$data. > > > > Ideas? > > > > Thanks, > > > > Jack. > > > > > > --------------------------------- > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html > > > > > > >---------------------------------------------------------------------------- --> >---------------------------------------------------------------------------- --> > > > > --------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide!http://www.R-project.org/posting-guide.html>_____ Yahoo! Shopping Find Great Deals on Gifts at Yahoo! <http://shopping.yahoo.com/;_ylc=X3oDMTE2bzVzaHJtBF9TAzk1OTQ5NjM2BHNlYwNtYWl sdGFnBHNsawNob2xpZGF5LTA1> Shopping ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ [[alternative HTML version deleted]]