hi all If I wanna get the total number of lines in a big file without reading the file's content into R as matrix or data frame, any methods or functions? thanks in advance. Regards
On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:> hi all > If I wanna get the total number of lines in a big file without reading > the file's content into R as matrix or data frame, any methods or > functions? > thanks in advance. > RegardsSee ?readLines You can use: length(readLines("FileName")) to get the number of lines read. HTH, Marc Schwartz
Hu Chen wrote:> hi all > If I wanna get the total number of lines in a big file without reading > the file's content into R as matrix or data frame, any methods or > functions?You must read it in R, or how do you think should one determine the number of lines in a file (if you don't want to use another program)? I'd suggest length(readLines(...)). Uwe Ligges> thanks in advance. > Regards
> From: Marc Schwartz > > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote: > > hi all > > If I wanna get the total number of lines in a big file > without reading > > the file's content into R as matrix or data frame, any methods or > > functions? > > thanks in advance. > > Regards > > See ?readLines > > You can use: > > length(readLines("FileName")) > > to get the number of lines read. > > HTH, > > Marc SchwartzOn a system equipped with `wc' (*nix or Windows with such utilities installed and on PATH) I would use that. Otherwise length(count.fields()) might be a good choice. Cheers, Andy
> From: Liaw, Andy > > > From: Marc Schwartz > > > > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote: > > > hi all > > > If I wanna get the total number of lines in a big file > > without reading > > > the file's content into R as matrix or data frame, any methods or > > > functions? > > > thanks in advance. > > > Regards > > > > See ?readLines > > > > You can use: > > > > length(readLines("FileName")) > > > > to get the number of lines read. > > > > HTH, > > > > Marc Schwartz > > > On a system equipped with `wc' (*nix or Windows with such utilities > installed and on PATH) I would use that. Otherwise > length(count.fields()) > might be a good choice. > > Cheers, > AndyMarc alerted me off-list that count.fields() might spent time delimiting fields, which is not needed for the purpose of counting lines, and suggested using sep="\n" as a possible way to make it more efficient. (Thanks, Marc!) Here are some tests on a file with 14337 lines and 8900 fields (space delimited).> system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)[1] 48.86 0.24 49.30 0.00 0.00> system.time(n <- length(count.fields("hcv.ap", sep="\n")), gcFirst=TRUE)[1] 42.19 0.26 42.60 0.00 0.00> n[1] 14337> system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE)[1] 37.77 0.56 38.35 0.00 0.00> n2[1] 14337> system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0, NULL))[[1]],gcFirst=T) Read 1 records [1] 0.00 0.00 0.33 0.08 0.25> n3[1] 14337 My only concern with the readLines() approach is that it still needs to read the entire file into memory (if I'm not mistaken), which may not be desirable:> system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE)[1] 36.72 0.48 37.24 0.00 0.00> object.size(obj)/1024^2[1] 244.6308 So it took 244+ MB just to store the text read in. I would use a loop and read the file in small chunks, if I really want to do it in R. Cheers, Andy
> From: Marc Schwartz > > On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote: > > > Marc alerted me off-list that count.fields() might spent > time delimiting > > fields, which is not needed for the purpose of counting > lines, and suggested > > using sep="\n" as a possible way to make it more efficient. > (Thanks, Marc!) > > > > Here are some tests on a file with 14337 lines and 8900 > fields (space > > delimited). > > > > > system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE) > > [1] 48.86 0.24 49.30 0.00 0.00 > > > system.time(n <- length(count.fields("hcv.ap", > sep="\n")), gcFirst=TRUE) > > [1] 42.19 0.26 42.60 0.00 0.00 > > Andy, > > I suspect that the relatively modest gain to be had here is the result > of count.fields() still scanning the input buffer for the delimiting > character, even though it would occur only once per line using the > newline character. Thus, the overhead is not reduced substantially. > > A scan of the source code for the .Internal function would validate > that. > > Thanks for testing this. > > As both you and Thomas mention, 'wc' is clearly the fastest way to go > based upon your additional figures. > > Best regards, > > MarcMarc, I wrote the following function to read the file in chunks: countLines <- function(file, chunk=1e3) { f <- file(file, "r") on.exit(close(f)) nLines <- 0 while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n nLines } To my surprise:> system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE)[1] 35.24 0.26 35.53 0.00 0.00> system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE)[1] 36.10 0.32 36.43 0.00 0.00 There's almost no penalty (in time) in reading one line at a time. One do save quite a bit of memory, though. Cheers, Andy
Richard A. O'Keefe
2004-Dec-07 02:00 UTC
[R] how to get how many lines there are in a file.
Hu Chen asked > If I wanna get the total number of lines in a big file without reading > the file's content into R as matrix or data frame, any methods or > functions? _Something_ must read it, but it doesn't have to be R. On a UNIX system, you can simply do number.of.lines <- as.numeric(system(paste("wc -l <", file.name), TRUE)) Suppopse file.name is "massive.csv". Then paste("wc -l <", file.name) is "wc -l < massive.csv", which is a UNIX command to write the number of lines in massive.csv to stdout, and system(cmd, TRUE) executes the UNIX command and returns everything it writes to stdout as an R character vector, one element per line of output. In this case, there's one line of output, so one element. Don't forget the TRUE; without it the command's standard output is not captured, just displayed. Finally, as.numeric turns that string into a number. For example, on my machine, > as.numeric(system("wc -l <$HOME/.cshrc", TRUE)) [1] 32 This will work in MacOS X, and you can get 'wc' for Windows, so it can be made to work there too. If the file is large, this is likely to be a lot faster than reading it in R. But the obvious question is "what happens next"? If you want to decide whether the amount of data is too big, then - false positives: data files may contain comments, which will be counted by wc but don't affect the amount of memory you need - false negatives: the amount of memory you need depends on the number (and type) of columns as well as the number of lines, just counting the lines may leave you thinking there is room when there isn't.