thr3ads.net - R help - [R] how to get how many lines there are in a file. [Dec 2004]

If this information is useful, please help other people find it:
Share via:

Hu Chen

2004-Dec-06 14:12 UTC

[R] how to get how many lines there are in a file.

hi all
If I wanna get the total number of lines in a big file without reading
the file's content into R as matrix or data frame, any methods or
functions?
thanks in advance.
Regards

Marc Schwartz

2004-Dec-06 14:30 UTC

head link

[R] how to get how many lines there are in a file.

On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:> hi all
> If I wanna get the total number of lines in a big file without reading
> the file's content into R as matrix or data frame, any methods or
> functions?
> thanks in advance.
> Regards
See ?readLines

You can use:

length(readLines("FileName"))

to get the number of lines read.

HTH,

Marc Schwartz

Uwe Ligges

2004-Dec-06 14:35 UTC

head link

[R] how to get how many lines there are in a file.

Hu Chen wrote:
> hi all
> If I wanna get the total number of lines in a big file without reading
> the file's content into R as matrix or data frame, any methods or
> functions?
You must read it in R, or how do you think should one determine the 
number of lines in a file (if you don't want to use another program)?
I'd suggest length(readLines(...)).

Uwe Ligges


> thanks in advance.
> Regards

Liaw, Andy

2004-Dec-06 14:37 UTC

head link

[R] how to get how many lines there are in a file.

> From: Marc Schwartz
> 
> On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:
> > hi all
> > If I wanna get the total number of lines in a big file 
> without reading
> > the file's content into R as matrix or data frame, any methods or
> > functions?
> > thanks in advance.
> > Regards
> 
> See ?readLines
> 
> You can use:
> 
> length(readLines("FileName"))
> 
> to get the number of lines read.
> 
> HTH,
> 
> Marc Schwartz

On a system equipped with `wc' (*nix or Windows with such utilities
installed and on PATH) I would use that.  Otherwise length(count.fields())
might be a good choice.

Cheers,
Andy

Liaw, Andy

2004-Dec-06 17:26 UTC

head link

[R] how to get how many lines there are in a file.

> From: Liaw, Andy
> 
> > From: Marc Schwartz
> > 
> > On Mon, 2004-12-06 at 22:12 +0800, Hu Chen wrote:
> > > hi all
> > > If I wanna get the total number of lines in a big file 
> > without reading
> > > the file's content into R as matrix or data frame, any
methods or
> > > functions?
> > > thanks in advance.
> > > Regards
> > 
> > See ?readLines
> > 
> > You can use:
> > 
> > length(readLines("FileName"))
> > 
> > to get the number of lines read.
> > 
> > HTH,
> > 
> > Marc Schwartz
> 
> 
> On a system equipped with `wc' (*nix or Windows with such utilities
> installed and on PATH) I would use that.  Otherwise 
> length(count.fields())
> might be a good choice.
> 
> Cheers,
> Andy
Marc alerted me off-list that count.fields() might spent time delimiting
fields, which is not needed for the purpose of counting lines, and suggested
using sep="\n" as a possible way to make it more efficient.  (Thanks,
Marc!)

 Here are some tests on a file with 14337 lines and  8900 fields (space
delimited).
> system.time(n <- length(count.fields("hcv.ap")), gcFirst=TRUE)
[1] 48.86  0.24 49.30  0.00  0.00> system.time(n <- length(count.fields("hcv.ap",
sep="\n")), gcFirst=TRUE)
[1] 42.19  0.26 42.60  0.00  0.00> n
[1] 14337> system.time(n2 <- length(readLines("hcv.ap")), gcFirst=TRUE)
[1] 37.77  0.56 38.35  0.00  0.00> n2
[1] 14337> system.time(n3 <- scan(pipe("wc -l hcv.ap"), what=list(0,
NULL))[[1]],gcFirst=T)
Read 1 records
[1] 0.00 0.00 0.33 0.08 0.25> n3[1] 14337

My only concern with the readLines() approach is that it still needs to read
the entire file into memory (if I'm not mistaken), which may not be
desirable:
> system.time(obj <- readLines("hcv.ap"), gcFirst=TRUE)
[1] 36.72  0.48 37.24  0.00  0.00> object.size(obj)/1024^2[1] 244.6308

So it took 244+ MB just to store the text read in.  I would use a loop and
read the file in small chunks, if I really want to do it in R.

Cheers,
Andy

Liaw, Andy

2004-Dec-06 19:00 UTC

head link

[R] how to get how many lines there are in a file.

> From: Marc Schwartz
> 
> On Mon, 2004-12-06 at 12:26 -0500, Liaw, Andy wrote:
> 
> > Marc alerted me off-list that count.fields() might spent 
> time delimiting
> > fields, which is not needed for the purpose of counting 
> lines, and suggested
> > using sep="\n" as a possible way to make it more efficient. 
>  (Thanks, Marc!)
> > 
> >  Here are some tests on a file with 14337 lines and  8900 
> fields (space
> > delimited).
> > 
> > > system.time(n <- length(count.fields("hcv.ap")),
gcFirst=TRUE)
> > [1] 48.86  0.24 49.30  0.00  0.00
> > > system.time(n <- length(count.fields("hcv.ap", 
> sep="\n")), gcFirst=TRUE)
> > [1] 42.19  0.26 42.60  0.00  0.00
> 
> Andy,
> 
> I suspect that the relatively modest gain to be had here is the result
> of count.fields() still scanning the input buffer for the delimiting
> character, even though it would occur only once per line using the
> newline character. Thus, the overhead is not reduced substantially.
> 
> A scan of the source code for the .Internal function would validate
> that.
> 
> Thanks for testing this.
> 
> As both you and Thomas mention, 'wc' is clearly the fastest way to
go
> based upon your additional figures.
> 
> Best regards,
> 
> Marc
Marc,

I wrote the following function to read the file in chunks:

countLines <- function(file, chunk=1e3) {
    f <- file(file, "r")
    on.exit(close(f))
    nLines <- 0
    while((n <- length(readLines(f, chunk))) > 0) nLines <- nLines + n
    nLines
}

To my surprise:
> system.time(n4 <- countLines3("hcv.ap"), gcFirst=TRUE)
[1] 35.24  0.26 35.53  0.00  0.00> system.time(n4 <- countLines3("hcv.ap", 1), gcFirst=TRUE)[1] 36.10  0.32 36.43  0.00  0.00

There's almost no penalty (in time) in reading one line at a time.  One do
save quite a bit of memory, though.

Cheers,
Andy

Richard A. O'Keefe

2004-Dec-07 02:00 UTC

head link

[R] how to get how many lines there are in a file.

Hu Chen asked
	> If I wanna get the total number of lines in a big file without reading
	> the file's content into R as matrix or data frame, any methods or
	> functions?
	
_Something_ must read it, but it doesn't have to be R.
On a UNIX system, you can simply do

    number.of.lines <- as.numeric(system(paste("wc -l <",
file.name), TRUE))

Suppopse file.name is "massive.csv".
Then paste("wc -l <", file.name) is "wc -l <
massive.csv", which is a
UNIX command to write the number of lines in massive.csv to stdout,
and system(cmd, TRUE) executes the UNIX command and returns everything
it writes to stdout as an R character vector, one element per line of
output.  In this case, there's one line of output, so one element.
Don't forget the TRUE; without it the command's standard output is not
captured, just displayed.
Finally, as.numeric turns that string into a number.

For example, on my machine,
    > as.numeric(system("wc -l <$HOME/.cshrc", TRUE))
    [1] 32

This will work in MacOS X, and you can get 'wc' for Windows, so it can
be
made to work there too.

If the file is large, this is likely to be a lot faster than reading it in R.  

But the obvious question is "what happens next"?  If you want to
decide
whether the amount of data is too big, then
    - false positives:  data files may contain comments, which will be
      counted by wc but don't affect the amount of memory you need
    - false negatives:  the amount of memory you need depends on the
      number (and type) of columns as well as the number of lines,
      just counting the lines may leave you thinking there is room when
      there isn't.

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Dec 2004 - how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

[R] how to get how many lines there are in a file.

Possibly Parallel Threads