thr3ads.net - R help - [R] R tools for large files [Aug 2003]

If this information is useful, please help other people find it:
Share via:

Murray Jorgensen

2003-Aug-25 04:04 UTC

[R] R tools for large files

I'm wondering if anyone has written some functions or code for handling 
very large files in R. I am working with a data file that is 41 
variables times who knows how many observations making up 27MB altogether.

The sort of thing that I am thinking of having R do is

- count the number of lines in a file

- form a data frame by selecting all cases whose line numbers are in a 
supplied vector (which could be used to extract random subfiles of 
particular sizes)

Does anyone know of a package that might be useful for this?

Murray

-- 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

Ko-Kang Kevin Wang

2003-Aug-25 04:18 UTC

head link

[R] R tools for large files

Hi,

Have you looked at "R Data Import/Export"?

On Mon, 25 Aug 2003, Murray Jorgensen wrote:
> Date: Mon, 25 Aug 2003 16:04:17 +1200
> From: Murray Jorgensen <maj at stats.waikato.ac.nz>
> Reply-To: maj at waikato.ac.nz
> To: R-help <r-help at stat.math.ethz.ch>
> Subject: [R] R tools for large files
> 
> I'm wondering if anyone has written some functions or code for handling
> very large files in R. I am working with a data file that is 41 
> variables times who knows how many observations making up 27MB altogether.
> 
> The sort of thing that I am thinking of having R do is
> 
> - count the number of lines in a file
> 
> - form a data frame by selecting all cases whose line numbers are in a 
> supplied vector (which could be used to extract random subfiles of 
> particular sizes)
> 
> Does anyone know of a package that might be useful for this?
> 
> Murray
> 
> 
-- 
Cheers,

Kevin

------------------------------------------------------------------------------
"On two occasions, I have been asked [by members of Parliament],
'Pray, Mr. Babbage, if you put into the machine wrong figures, will
the right answers come out?' I am not able to rightly apprehend the
kind of confusion of ideas that could provoke such a question."

-- Charles Babbage (1791-1871) 
---- From Computer Stupidities: http://rinkworks.com/stupid/

--
Ko-Kang Kevin Wang
Master of Science (MSc) Student
SLC Tutor and Lab Demonstrator
Department of Statistics
University of Auckland
New Zealand
Homepage: http://www.stat.auckland.ac.nz/~kwan022
Ph: 373-7599
    x88475 (City)
    x88480 (Tamaki)

Murray Jorgensen

2003-Aug-25 04:37 UTC

head link

[R] R tools for large files

Could you be more specific? Do you mean the chapter on connections?


Ko-Kang Kevin Wang wrote:
> Hi,
> 
> Have you looked at "R Data Import/Export"?
> 
> On Mon, 25 Aug 2003, Murray Jorgensen wrote:
> 
>

Andrew C. Ward

2003-Aug-25 05:06 UTC

head link

[R] R tools for large files

Dear Murray,

One way that works very well for many people (including me)
is to store the data in an external database, such as MySQL,
and read in just the bits you want using the excellent
package RODBC. Getting a database to do all the selecting
is very fast and efficient, leaving R to concentrate on the
analysis and visualisation. This is all described in the
R Import/Export Manual.


Regards,

Andrew C. Ward

CAPE Centre
Department of Chemical Engineering
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au


Quoting Murray Jorgensen <maj at stats.waikato.ac.nz>:
> I'm wondering if anyone has written some functions or
> code for handling 
> very large files in R. I am working with a data file that
> is 41 
> variables times who knows how many observations making up
> 27MB altogether.
> 
> The sort of thing that I am thinking of having R do is
> 
> - count the number of lines in a file
> 
> - form a data frame by selecting all cases whose line
> numbers are in a 
> supplied vector (which could be used to extract random
> subfiles of 
> particular sizes)
> 
> Does anyone know of a package that might be useful for
> this?
> 
> Murray
> 
> -- 
> Dr Murray Jorgensen     
> http://www.stats.waikato.ac.nz/Staff/maj.html
> Department of Statistics, University of Waikato,
> Hamilton, New Zealand
> Email: maj at waikato.ac.nz                               
> Fax 7 838 4155
> Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile
> 021 1395 862
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>

Richard A. O'Keefe

2003-Aug-25 06:09 UTC

head link

[R] R tools for large files

Murray Jorgensen <maj at stats.waikato.ac.nz> wrote:
	I'm wondering if anyone has written some functions or code for handling 
	very large files in R. I am working with a data file that is 41 
	variables times who knows how many observations making up 27MB altogether.
	
Does that really count as "very large"?
I tried making a file where each line was
"1 2 3 .... 39 40 41"
With 240,000 lines it came to 27.36 million bytes.
You can *hold* that amount of data in R quite easily.
The problem is the time it takes to read it using scan() or read.table().

	The sort of thing that I am thinking of having R do is
	
	- count the number of lines in a file
	
	- form a data frame by selecting all cases whose line numbers are in a 
	supplied vector (which could be used to extract random subfiles of 
	particular sizes)
	
	Does anyone know of a package that might be useful for this?
	
There's a Unix program I posted to comp.sources years ago called
"sample":
    sample -(how many) <(where from)
selects the given number of lines without replacement its standard input
and writes them in random order to its standard output.  Hook it up to a
decent random number generator and you're pretty much done: read.table()
and scan() can read from a pipe.

Prof Brian Ripley

2003-Aug-25 07:12 UTC

head link

[R] R tools for large files

I think that is only a medium-sized file.

On Mon, 25 Aug 2003, Murray Jorgensen wrote:
> I'm wondering if anyone has written some functions or code for handling
> very large files in R. I am working with a data file that is 41 
> variables times who knows how many observations making up 27MB altogether.
> 
> The sort of thing that I am thinking of having R do is
> 
> - count the number of lines in a file
You can do that without reading the file into memory: use
system(paste("wc -l", filename)) or read in blocks of lines via a 
connection
> - form a data frame by selecting all cases whose line numbers are in a 
> supplied vector (which could be used to extract random subfiles of 
> particular sizes)
R should handle that easily in today's memory sizes.  Buy some more RAM if 
you don't already have 1/2Gb.  As others have said, for a real large file,
use a RDBMS to do the selection for you.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Murray Jorgensen

2003-Aug-25 09:16 UTC

head link

[R] R tools for large files

At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:>I think that is only a medium-sized file.
"Large" for my purposes means "more than I really want to read
into memory"
which in turn means "takes more than 30s". I'm at home now and the
file
isn't so I'm not sure if the file is large or not.

More responses interspesed below. BTW, I forgot to mention that I'm using
Windows and so do not have nice unix tools readily available.
>On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>
>> I'm wondering if anyone has written some functions or code for
handling
>> very large files in R. I am working with a data file that is 41 
>> variables times who knows how many observations making up 27MB
altogether.
>> 
>> The sort of thing that I am thinking of having R do is
>> 
>> - count the number of lines in a file
>
>You can do that without reading the file into memory: use
>system(paste("wc -l", filename)) 
Don't think that I can do that in Windows XL.

or read in blocks of lines via a >connection
But that does sound promising!
>
>> - form a data frame by selecting all cases whose line numbers are in a 
>> supplied vector (which could be used to extract random subfiles of 
>> particular sizes)
>
>R should handle that easily in today's memory sizes.  Buy some more RAM
if
>you don't already have 1/2Gb.  As others have said, for a real large
file,
>use a RDBMS to do the selection for you.
It's just that R is so good in reading in initial segments of a file that I
can't believe that it can't be effective in reading more general
(pre-specified) subsets.

Murray
>
>-- 
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

Jim Rogers

2003-Aug-25 13:47 UTC

head link

[R] R tools for large files

>>> I'm wondering if anyone has written some functions or code for 
>>> handling
>>> very large files in R. I am working with a data file that is 41 
>>> variables times who knows how many observations making up 27MB
altogether.>>> 
>>> The sort of thing that I am thinking of having R do is
>>> 
>>> - count the number of lines in a file
>>
>>You can do that without reading the file into memory: use 
>>system(paste("wc -l", filename))
>
>Don't think that I can do that in Windows XL.
There are many ports of unix tools for windows, a recommended collection
for R kindly provided here:

http://www.stats.ox.ac.uk/pub/Rtools/tools.zip

This includes "wc".

Cheers, 
Jim 

James A. Rogers, Ph.D. <rogers at cantatapharm.com>
Statistical Scientist
Cantata Pharmaceuticals
300 Technology Square, 5th floor
Cambridge, MA  02139
617.225.9009 x312
Fax 617.225.9010

Richard A. O'Keefe

2003-Aug-25 23:46 UTC

head link

[R] R tools for large files

Murray Jorgensen <maj at stats.waikato.ac.nz> wrote:
	"Large" for my purposes means "more than I really want to read
	into memory" which in turn means "takes more than 30s".  I'm
at
	home now and the file isn't so I'm not sure if the file is large
	or not.

I repeat my earlier observation.  The AMOUNT OF DATA is easily handled
a typical desktop machine these days.  The problem is not the amount of
data.  The problem is HOW LONG IT TAKES TO READ.  I made several attempts
to read the test file I created yesterday, and each time gave up
impatiently after 5+ minutes elapsed time.  I tried again today (see below)
and went away to have a cop of tea &c; took nearly 10 minute that time and
still hadn't finished.  'mawk' read _and processed_ the same file
happily in under 30 seconds.

One quite serious alternative would be to write a little C function
to read the file into an array, and call that from R.
> system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41))
[1] 3.28 0.79 4.28 0.00 0.00> system.time(save(m, file="m.bin"))
[1] 8.44 0.54 9.08 0.00 0.00> m <- NULL
> system.time(load("m.bin"))
[1] 11.25  0.19 11.51  0.00  0.00> length(m)[1] 10250000

The binary file m.bin is 41 million bytes.

This little transcript shows that a data set of this size can be
comfortably read from disc in under 12 seconds, on the same machine
where scan() took about 50 times as long before I killed it.

So yet another alternative is to write a little program that converts
the data file to R binary format, and then just read the whole thing in.
I think readers will agree that 12 seconds on a 500MHz machine counts
as "takes less than 30s".

	It's just that R is so good in reading in initial segments of a file that I
	can't believe that it can't be effective in reading more general
	(pre-specified) subsets.

R is *good* at it, it's just not *quick*.  Trying to select a subset
in scan() or read.table() wouldn't help all that much, because it would
still have to *scan* the data to determine what to skip.

Two more times:
An unoptimised C program writing 0:(41*250000-1) as a file of
41-number lines:
f% time a.out >m.txt
13.0u 1.0s 0:14 94% 0+0k 0+0io 0pf+0w> system.time(m <- read.table("m.txt", header=FALSE))^C
Timing stopped at: 552.01 15.48 584.51 0 0 

To my eyes, src/main/scan.c shows no signs of having been tuned for speed.
The goals appear to have been power (the R scan() function has LOTS of
options) and correctness, which are perfectly good goals, and the speed
of scan() and read.table() with modest data sizes is quite good enough.

The huge ratio (>552)/(<30) for R/mawk does suggest that there may be
room for some serious improvement in scan(), possibly by means of some
extra hints about total size, possibly by creating a fast path through
the code.

Of course the big point is that however long scan() takes to read the
data set, it only has to be done once.  Leave R running overnight and
in the morning save the dataset out as an R binary file using save().
Then you'll be able to load it again quickly.

Liaw, Andy

2003-Aug-26 02:15 UTC

head link

[R] R tools for large files

> From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] 
> 
> Murray Jorgensen <maj at stats.waikato.ac.nz> wrote:
> 	"Large" for my purposes means "more than I really want to
read
> 	into memory" which in turn means "takes more than 30s". 
I'm at
> 	home now and the file isn't so I'm not sure if the file is large
> 	or not.
> 	
> I repeat my earlier observation.  The AMOUNT OF DATA is 
> easily handled a typical desktop machine these days.  The 
> problem is not the amount of data.  The problem is HOW LONG 
> IT TAKES TO READ.  I made several attempts to read the test 
> file I created yesterday, and each time gave up impatiently 
> after 5+ minutes elapsed time.  I tried again today (see 
> below) and went away to have a cop of tea &c; took nearly 10 
> minute that time and still hadn't finished.  'mawk' read _and 
> processed_ the same file happily in under 30 seconds.
> 
> One quite serious alternative would be to write a little C 
> function to read the file into an array, and call that from R.
> 
> > system.time(m <- matrix(1:(41*250000), nrow=250000, ncol=41))
> [1] 3.28 0.79 4.28 0.00 0.00
> > system.time(save(m, file="m.bin"))
> [1] 8.44 0.54 9.08 0.00 0.00
> > m <- NULL
> > system.time(load("m.bin"))
> [1] 11.25  0.19 11.51  0.00  0.00
> > length(m)
> [1] 10250000
I tried the following on my IBM T22 Thinkpad (P3-933 w/ 512MB):
> system.time(x <- matrix(runif(41*250000), 250000, 41))
[1] 6.02 0.40 6.52   NA   NA> object.size(x)
[1] 82000120> system.time(write(t(x), file="try.dat", ncol=41))
[1] 192.12  81.60 279.64     NA     NA> system.time(xx <- matrix(scan("try.dat"), byrow=TRUE,
ncol=41))Read 10250000 items
[1] 110.90   1.09 126.89     NA     NA> system.time(xx <- read.table("try.dat", header=FALSE,+ colClasses=rep("numeric", 41)))
[1] 106.61   0.48 110.66     NA     NA> system.time(save(x, file="try.rda"))
[1]  9.15  1.05 19.12    NA    NA> rm(x)
> system.time(load("try.rda"))[1] 10.22  0.33 10.69    NA    NA

The last few lines show that the timing I get is approximately the
same as yours, so the other timings shouldn't be too different.

I don't think I can make coffee that fast.  (No, I don't drink it
black!)

Andy

> 
> The binary file m.bin is 41 million bytes.
> 
> This little transcript shows that a data set of this size can 
> be comfortably read from disc in under 12 seconds, on the 
> same machine where scan() took about 50 times as long before 
> I killed it.
> 
> So yet another alternative is to write a little program that 
> converts the data file to R binary format, and then just read 
> the whole thing in. I think readers will agree that 12 
> seconds on a 500MHz machine counts as "takes less than 30s".
> 
> 	It's just that R is so good in reading in initial 
> segments of a file that I
> 	can't believe that it can't be effective in reading more general
> 	(pre-specified) subsets.
> 	
> R is *good* at it, it's just not *quick*.  Trying to select a 
> subset in scan() or read.table() wouldn't help all that much, 
> because it would still have to *scan* the data to determine 
> what to skip.
> 
> Two more times:
> An unoptimised C program writing 0:(41*250000-1) as a file of 
> 41-number lines: f% time a.out >m.txt 13.0u 1.0s 0:14 94% 
> 0+0k 0+0io 0pf+0w
> > system.time(m <- read.table("m.txt", header=FALSE))
> ^C
> Timing stopped at: 552.01 15.48 584.51 0 0 
> 
> To my eyes, src/main/scan.c shows no signs of having been 
> tuned for speed. The goals appear to have been power (the R 
> scan() function has LOTS of
> options) and correctness, which are perfectly good goals, and 
> the speed of scan() and read.table() with modest data sizes 
> is quite good enough.
> 
> The huge ratio (>552)/(<30) for R/mawk does suggest that 
> there may be room for some serious improvement in scan(), 
> possibly by means of some extra hints about total size, 
> possibly by creating a fast path through the code.
> 
> Of course the big point is that however long scan() takes to 
> read the data set, it only has to be done once.  Leave R 
> running overnight and in the morning save the dataset out as 
> an R binary file using save(). Then you'll be able to load it 
> again quickly.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
> 
------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA),
and/or
its affiliates (which may be known outside the United States as Merck Frosst,
Merck Sharp & Dohme or MSD) that may be confidential, proprietary
copyrighted
and/or legally privileged, and is intended solely for the use of the
individual or entity named on this message.  If you are not the intended
recipient, and have received this message in error, please immediately return
this by e-mail and then delete it.

Mulholland, Tom

2003-Aug-26 03:55 UTC

head link

[R] R tools for large files

As some of the conversation has noted the 30 second mark as an arbitrary
benchmark I would also chime in that there is also an assumption that
any non-R related issues that impact upon being able to usefully use R
should be ignored. In the real world we can't always control everything
about our environment. So if there are improvements that can be made
that help mitigate the reality of the world, I would welcome them.

As a little test I broke the rules of my organisation and actually put a
dataset on my C: drive. Not unexpectedly, the  performance vastly
improved. What would in the normal (at home) be a 10 second load becomes
a 40 second load in a corporate environment. I have found the
conversation helpful and it would appear that there are opportunities
for improvement that I would find helpful in my production environment.
The other aside is that I have no UNIX like tools, not because they
don't exist, but because the environment I work in does not allow me to
use them. This is not sufficient reason for me to bleat about it. It
just is. By and large, I just get on with it. My point is that while I
accept that these issues are peripheral to R, they do impact upon the
useability of R.

I'm sure that there are people working with large databases in R (The
SPSS datasets that I regularly interact with vary between 97MB and
200MB) It could be finger trouble on my part, but I find I have to
subset them before I can read them into R. If I thought I could usefully
convert these datasets into something that R could pick and choose from
without reaching the out of memory problem, I would be very happy. In
the meantime my lack of expertise has left me with a workable albeit
clumsy process.

I will continue to champion R in my organisation, but the present score
is SPSS-50, SAS-149, R-1. But all the really creative charts only come
from one engine in this place.
> system.time(load("P:/.../0203Mapdata.rdata"))
[1]  9.79  0.97 37.45    NA    NA> system.time(load("C:/TEMP/0203Mapdata.rdata"))
[1] 10.07  0.18 10.49    NA    NA> version         _              
platform i386-pc-mingw32
arch     i386           
os       mingw32        
system   i386, mingw32  
status                  
major    1              
minor    7.1            
year     2003           
month    06             
day      16             
language R     

_________________________________________________

Tom Mulholland
Senior Policy Officer
WA Country Health Service
Tel: (08) 9222 4062

The contents of this e-mail transmission are confidential and may be
protected by professional privilege. The contents are intended only for
the named recipients of this e-mail. If you are not the intended
recipient, you are hereby notified that any use, reproduction,
disclosure or distribution of the information contained in this e-mail
is prohibited. Please notify the sender immediately.

-----Original Message-----
From: Murray Jorgensen [mailto:maj at stats.waikato.ac.nz] 
Sent: Monday, 25 August 2003 5:16 PM
To: Prof Brian Ripley
Cc: R-help
Subject: Re: [R] R tools for large files

At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:>I think that is only a medium-sized file.
"Large" for my purposes means "more than I really want to read
into
memory" which in turn means "takes more than 30s". I'm at
home now and
the file isn't so I'm not sure if the file is large or not.

More responses interspesed below. BTW, I forgot to mention that I'm
using Windows and so do not have nice unix tools readily available.
>On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>
>> I'm wondering if anyone has written some functions or code for 
>> handling
>> very large files in R. I am working with a data file that is 41 
>> variables times who knows how many observations making up 27MB
altogether.>> 
>> The sort of thing that I am thinking of having R do is
>> 
>> - count the number of lines in a file
>
>You can do that without reading the file into memory: use 
>system(paste("wc -l", filename))
Don't think that I can do that in Windows XL.

or read in blocks of lines via a >connection
But that does sound promising!
>
>> - form a data frame by selecting all cases whose line numbers are in 
>> a
>> supplied vector (which could be used to extract random subfiles of 
>> particular sizes)
>
>R should handle that easily in today's memory sizes.  Buy some more RAM
>if
>you don't already have 1/2Gb.  As others have said, for a real large
file,>use a RDBMS to do the selection for you.
It's just that R is so good in reading in initial segments of a file
that I can't believe that it can't be effective in reading more general
(pre-specified) subsets.

Murray
>
>-- 
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Murray Jorgensen

2003-Aug-26 08:45 UTC

head link

[R] R tools for large files

Hi Martin,

I don't know much about the concept of "connection" but I had
supposed
it to at least include the concept of "file" and perhaps also
"input
device" and "output device'. I guess the important point that you
are
making is that it is sequential in the sense that you describe. I 
suppose at the time that I wrote my emails I didn't *know* that this was 
  the case but rather assumed that this must be so, since it would be 
tedious in the extreme to have to work with the access functions if they 
kept going back to the beginning of the connection.

It may help to explain the application. The large files that I am 
working with are themselves statistical summaries of internet traffic 
flows (you will appreciate why they can be almost arbitrarily large!) I 
am interested in clustering these flows into different classes of 
traffic. I am using a model-based approach, so that the end-point will 
be statistical models for each cluster. Once these have been estimated 
they may be used in the classification of future traffic [including a 
residual class of traffic that does not fit any cluster well].

Based on experience with my clustering software (Multimix) I believe 
that it should work well on data sets of, say, 3000 observations. I plan 
to select a small number of random subsets of this size. The replication 
of these subsets should help me with model selection questions (How many 
Clusters? How complex should each cluster model be?)

Tom Mulholland makes a good point when he notes that many R users (and 
other users) have very little control over their computing environment 
owing to somewhat arbitrary IT management decisions. For this reason it 
will be advantageous to have several solutions to large file problems.

I'm pleased that you think that efficient R functions for manipulating 
numbered lines from files may be written. I'm going to have a go at it 
just as soon as I finish a big item of paperwork!

BTW, I will be out of town and with much reduced email access over the 
next week or so, so if I don't reply to the list or individuals this 
should not be put down to laziness or rudeness!

Cheers,

Murray Jorgensen

PS  Give my regards to Chris Hennig.

Martin Maechler wrote:> Hi Murray,
> 
> from reading your summarizing reply, I wonder if you missed the
> most important point about "connection"s  connection :=
generalization of file):
> 
> Once you open() one, you can read it **sequentially**, e.g., in
> bunches of a "few" lines  i.e., you don't re-start from the
> beginning each time.
> I think this will allow to devise a pretty efficient R function
> for reading (and returning as a vector of strings) line numbers
> (n1, n2,..., nm).   
> 
> Did you know this?  If not, maybe you forward this answer (and
> your reaction to it) to R-help as well.
> 
> Regards,
> Martin Maechler <maechler at stat.math.ethz.ch>
http://stat.ethz.ch/~maechler/
> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
> phone: x-41-1-632-3408		fax: ...-1228			<><
> 
> 
-- 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

David Khabie-Zeitoune

2003-Aug-26 09:03 UTC

head link

[R] R tools for large files

A starting point might be the string splitting function strsplit

For example,
> X = c("1,4,5" "1,2,5" "5,1,2")
> strsplit(X)[[1]]
[1] "1" "4" "5"

[[2]]
[1] "1" "2" "5"

[[3]]
[1] "5" "1" "2"

This returns a list of the parsed vectors. Next you can do something
like:> Z = data.frame(matrix(unlist(X), nrow = 3, byrow=T))
> Z  X1 X2 X3
1  1  4  5
2  1  2  5
3  5  1  2 




-----Original Message-----
From: Ted.Harding at nessie.mcc.ac.uk [mailto:Ted.Harding at nessie.mcc.ac.uk]

Sent: 26 August 2003 09:00
To: R-help
Subject: Re: [R] R tools for large files


This has been an interesting thread! My first reaction to Murray's query
was to think "use standard Unix tools, especially awk", 'awk'
being a
compact, fast, efficient program with great powers for processing lines
of data files (and in particular extracting, subsetting and transforming
database-like files e.g. CSV-type).

Of course, that became a sub-thread in its own right.

But -- and here I know I'm missing a trick which is why I'm responding
now so that someone who knows the trick can tell me -- while I normally
use 'awk' "externally" (i.e. I filter a data file through an
'awk'
program outside of R and then read the resulting file into R), I began
to think about doing it from within R.

Something on the lines of

  X <- system("cat raw_data | awk '...' ", intern=TRUE)

would create an object X which is a character vector, each element of
which is one line from the output of the command "cat ...... ".

E.g. if "raw_data" starts out as

  1,2,3,4,5
  1,3,4,2,5
  5,4,3,2,1
  5,3,4,1,2

then

  X<-system("cat raw_data.csv |
  awk 'BEGIN{FS=\",\"}{if($3>$2){print $1 \",\" $4
\",\" $5}}'",
  intern=TRUE)

gives

  > X
  [1] "1,4,5" "1,2,5" "5,1,2"

Now my Question:
How do I convert X into the dataframe I would have got if I had read
this output from a file instead of into the character vector X?

In other words, how to convert a vector of character strings, each of
which is in comma-separated format as above, into the rows of a
data-frame (or matrix, come to that)?

With thanks,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 26-Aug-03                                       Time: 08:59:48
------------------------------ XFMail ------------------------------

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

(Ted Harding)

2003-Aug-26 09:05 UTC

head link

[R] R tools for large files

On 26-Aug-03 Prof Brian Ripley wrote:> On Tue, 26 Aug 2003 Ted.Harding at nessie.mcc.ac.uk wrote:
> [...]
>>   > X
>>   [1] "1,4,5" "1,2,5" "5,1,2"
>> 
>> Now my Question:
>> [...]
>> In other words, how to convert a vector of character strings, each
>> of which is in comma-separated format as above, into the rows of
>> a data-frame (or matrix, come to that)?
> 
> read.table() on a text connection.
> 
>> X <- c("1,4,5", "1,2,5", "5,1,2")
>> read.table(textConnection(X), header=FALSE, sep=",")
>   V1 V2 V3
> 1  1  4  5
> 2  1  2  5
> 3  5  1  2
Thanks, Brian! Just the job.
Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 167 1972
Date: 26-Aug-03                                       Time: 10:05:14
------------------------------ XFMail ------------------------------

Richard A. O'Keefe

2003-Aug-27 01:03 UTC

head link

[R] R tools for large files

Duncan Murdoch <dmurdoch at pair.com> wrote:
	For example, if you want to read lines 1000 through 1100, you'd do it
	like this:

	 lines <- readLines("foo.txt", 1100)[1000:1100]

I created a dataset thus:
# file foo.awk:
BEGIN {
    s = "01"
    for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
    n = (27 * 1024 * 1024) / (length(s) + 1)
    for (i = 1; i <= n; i++) print s
    exit 0
}
# shell command:
mawk -f foo.awk /dev/null >BIG

That is, each record contains 41 2-digit integers, and the number
of records was chosen so that the total size was approximately
27 dimegabytes.  The number of records turns out to be 230,175.
> system.time(v <- readLines("BIG"))[1] 7.75 0.17 8.13 0.00 0.00
	# With BIG already in the file system cache...> system.time(v <- readLines("BIG", 200000)[199001:200000])[1] 11.73  0.16 12.27  0.00  0.00

What's the importance of this?
First, experiments I shall not weary you with showed that the
time to read N lines grows faster than N.
Second, if you want to select the _last_ thousand lines,
you have to read _all_ of them into memory.

For real efficiency here, what's wanted is a variant of readLines
where n is an index vector (a vector of non-negative integers,
a vector of non-positive integers, or a vector of logicals) saying
which lines should be kept.

The function that would need changing is do_readLines() in
src/main/connections.c, unfortunately I don't understand R internals
well enough to do it myself (yet).

As a matter of fact, that _still_ wouldn't yield real efficiency,
because every character would still have to be read by the modified
readLines(), and it reads characters using Rconn_fgetc(), which is
what gives readLines() its power and utility, but certainly doesn't
give it wings.  (One of the fundamental laws of efficient I/O library
design is to base it on block- or line- at-a-time transfers, not
character-at-a-time.)

The AWK program
    NR <= 199000 { next }
    {print}
    NR == 200000 { exit }
extracts lines 199001:20000 in just 0.76 seconds, about 15 times
faster.  A C program to the same effect, using fgets(), took 0.39
seconds, or about 30 times faster than R.

There are two fairly clear sources of overhead in the R code:
(1) the overhead of reading characters one at a time through Rconn_fgetc()
    instead of a block or line at a time.  mawk doesn't use fgets() for
    reading, and _does_ have the overhead of repeatedly checking a
    regular expression to determine where the end of the line is,
    which it is sensible enough to fast-path.
(2) the overhead of allocating, filling in, and keeping, a whole lot of
    memory which is of no use whatever in computing the final result.
    mawk is actually fairly careful here, and only keeps one line at
    a time in the program shown above.  Let's change it:
	NR <= 199000 {next}
	{a[NR] = $0}
	NR == 200000 {exit}
	END {for (i in a) print a[i]}
    That takes the time from 0.76 seconds to 0.80 seconds

The simplest thing that could possibly work would be to add a function
skipLines(con, n) which simply read and discarded n lines.

	 result <- scan(textConnection(lines), list( .... ))
	> system.time(m <- scan(textConnection(v), integer(41)))Read 41000 items
[1] 0.99 0.00 1.01 0.00 0.00

One whole second to read 41,000 numbers on a 500 MHz machine?
> vv <- rep(v, 240)
Is there any possibility of storing the data in (platform) binary form?
Binary connections (R-data.pdf, section 6.5 "Binary connections") can
be
used to read binary-encoded data.

I wrote a little C program to save out the 230175 records of 41 integers
each in native binary form.  Then in R I did
> system.time(m <- readBin("BIN", integer(), n=230175*41,
size=4))
[1] 0.57 0.52 1.11 0.00 0.00> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))[1] 2.55 0.34 2.95 0.00 0.00

Remember, this doesn't read a *sample* of the data, it reads *all*
the data.  It is so much faster than the alternatives in R that it
just isn't funny.  Trying scan() on the file took nearly 10 minutes
before I killed it the other day, using readBin() is a thousand times
faster than a simple scan() call on this particular data set.

There has *got* to be a way of either generating or saving the data
in binary form, using only "approved" Windows tools.  Heck, it can
probably be done using VBA.

By the way, I've read most of the .pdf files I could find on the CRAN site,
but haven't noticed any description of the R save-file format.  Where should
I have looked?  (Yes, I know about src/main/saveload.c; I was hoping for
some documentation, with maybe some diagrams.)

Adaikalavan Ramasamy

2003-Aug-27 03:37 UTC

head link

[R] R tools for large files

If we are going to use unix tools to create a new dataset before calling
into R, why not simply use 

cat my_big_bad_file | tail +1001 | head -100

to read lines 1000-1100 (assuming one header row).

Or if you have the shortlisted rownames in one file, you can use join
after sort. A working example follows.

########################################################################
#########

#!/bin/bash

# match.sh last modified 10/07/03
# Does the same thing as egrep 'a|b|c|...' file but in batch mode
# A script that matches all occurances of <shortlist> in <data>
using
the first column as common key 

if [ $# -ne 2 ]; then
   echo "Usage: ${0/*\/} <shortlist> <data>"
   exit
fi

TEMP1=/tmp/temp1.`date "+%y%m%d-%H%M%S"`
TEMP2=/tmp/temp2.`date "+%y%m%d-%H%M%S"`
TEMP3=/tmp/temp3.`date "+%y%m%d-%H%M%S"`
TEMP4=/tmp/temp4.`date "+%y%m%d-%H%M%S"`
TEMP5=/tmp/temp5.`date "+%y%m%d-%H%M%S"`

grep -n . $1 | cut -f1 -d: | paste - $1 > $TEMP1
sort -k 2 $TEMP1 > $TEMP2             

tail +2 $2 | sort -k 1 > $TEMP3  # Assume data file has header 

headerRow=`head -1 $2`

join -j1 2 -j2 1 -a 1 -t\        $TEMP2 $TEMP3 > $TEMP4
sort -n -k 2 $TEMP4 > $TEMP5

/bin/echo "$headerRow"
cut -f1,3- $TEMP5                # column 2 contains orderings

rm $TEMP1 $TEMP2 $TEMP3 $TEMP4

########################################################################
#####

-----Original Message-----
From: Richard A. O'Keefe [mailto:ok at cs.otago.ac.nz] 
Sent: Wednesday, August 27, 2003 9:04 AM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] R tools for large files

Duncan Murdoch <dmurdoch at pair.com> wrote:
	For example, if you want to read lines 1000 through 1100, you'd
do it
	like this:

	 lines <- readLines("foo.txt", 1100)[1000:1100]

I created a dataset thus:
# file foo.awk:
BEGIN {
    s = "01"
    for (i = 2; i <= 41; i++) s = sprintf("%s %02d", s, i)
    n = (27 * 1024 * 1024) / (length(s) + 1)
    for (i = 1; i <= n; i++) print s
    exit 0
}
# shell command:
mawk -f foo.awk /dev/null >BIG

That is, each record contains 41 2-digit integers, and the number of
records was chosen so that the total size was approximately 27
dimegabytes.  The number of records turns out to be 230,175.
> system.time(v <- readLines("BIG"))[1] 7.75 0.17 8.13 0.00 0.00
	# With BIG already in the file system cache...> system.time(v <- readLines("BIG", 200000)[199001:200000])[1] 11.73  0.16 12.27  0.00  0.00

What's the importance of this?
First, experiments I shall not weary you with showed that the time to
read N lines grows faster than N. Second, if you want to select the
_last_ thousand lines, you have to read _all_ of them into memory.

For real efficiency here, what's wanted is a variant of readLines where
n is an index vector (a vector of non-negative integers, a vector of
non-positive integers, or a vector of logicals) saying which lines
should be kept.

The function that would need changing is do_readLines() in
src/main/connections.c, unfortunately I don't understand R internals
well enough to do it myself (yet).

As a matter of fact, that _still_ wouldn't yield real efficiency,
because every character would still have to be read by the modified
readLines(), and it reads characters using Rconn_fgetc(), which is what
gives readLines() its power and utility, but certainly doesn't give it
wings.  (One of the fundamental laws of efficient I/O library design is
to base it on block- or line- at-a-time transfers, not
character-at-a-time.)

The AWK program
    NR <= 199000 { next }
    {print}
    NR == 200000 { exit }
extracts lines 199001:20000 in just 0.76 seconds, about 15 times faster.
A C program to the same effect, using fgets(), took 0.39 seconds, or
about 30 times faster than R.

There are two fairly clear sources of overhead in the R code:
(1) the overhead of reading characters one at a time through
Rconn_fgetc()
    instead of a block or line at a time.  mawk doesn't use fgets() for
    reading, and _does_ have the overhead of repeatedly checking a
    regular expression to determine where the end of the line is,
    which it is sensible enough to fast-path.
(2) the overhead of allocating, filling in, and keeping, a whole lot of
    memory which is of no use whatever in computing the final result.
    mawk is actually fairly careful here, and only keeps one line at
    a time in the program shown above.  Let's change it:
	NR <= 199000 {next}
	{a[NR] = $0}
	NR == 200000 {exit}
	END {for (i in a) print a[i]}
    That takes the time from 0.76 seconds to 0.80 seconds

The simplest thing that could possibly work would be to add a function
skipLines(con, n) which simply read and discarded n lines.

	 result <- scan(textConnection(lines), list( .... ))
	> system.time(m <- scan(textConnection(v), integer(41)))Read 41000 items
[1] 0.99 0.00 1.01 0.00 0.00

One whole second to read 41,000 numbers on a 500 MHz machine?
> vv <- rep(v, 240)
Is there any possibility of storing the data in (platform) binary form?
Binary connections (R-data.pdf, section 6.5 "Binary connections") can
be
used to read binary-encoded data.

I wrote a little C program to save out the 230175 records of 41 integers
each in native binary form.  Then in R I did
> system.time(m <- readBin("BIN", integer(), n=230175*41,
size=4))
[1] 0.57 0.52 1.11 0.00 0.00> system.time(m <- matrix(data=m, ncol=41, byrow=TRUE))[1] 2.55 0.34 2.95 0.00 0.00

Remember, this doesn't read a *sample* of the data, it reads *all* the
data.  It is so much faster than the alternatives in R that it just
isn't funny.  Trying scan() on the file took nearly 10 minutes before I
killed it the other day, using readBin() is a thousand times faster than
a simple scan() call on this particular data set.

There has *got* to be a way of either generating or saving the data in
binary form, using only "approved" Windows tools.  Heck, it can
probably
be done using VBA.

By the way, I've read most of the .pdf files I could find on the CRAN
site, but haven't noticed any description of the R save-file format.
Where should I have looked?  (Yes, I know about src/main/saveload.c; I
was hoping for some documentation, with maybe some diagrams.)

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Richard A. O'Keefe

2003-Aug-28 00:35 UTC

head link

[R] R tools for large files

Duncan Murdoch <dmurdoch at pair.com> wrote:
	One complication with reading a block at a time is what to do when you
	read too far.

It's called "buffering".

	Not all connections can use seek() to reposition to the
	beginning, so you'd need to read them one character at a time, (or
	attach a buffer somehow, but then what about rw connections?)
	
You don't need seek() to do buffered block-at-a-time reading.
For example, you can't lseek() on a UNIX terminal, but UNIX C stdio
*does* read a block at a time from a terminal.

I don't see what the problem with read-write connections is supposed
to be.  When you want to read from such a connection, you first force
out any buffered output, and then you read a buffer's worth (if
available) of input.  Of course the read buffer and the write buffer
are separate (C stdio has traditionally got this wrong, with the perverse
consequence that you have to fseek() when switching from reading to writing
or vice versa, but that doesn't mean it can't be got right).


To put all this in context though, remember that S was designed in a UNIX
environment to work in a UNIX environment and it was always intended to
exploit UNIX tools.  Even on a Windows box, if you get R, you get a
bunch of the usual UNIX tools with it.  Amongst other things, Perl is
freely available for Windows, a Perl program to read a couple of
hundred thousand records and spit them out in platform binary would
only be a few lines long, and R _is_ pretty good at reading binary data.
It really is important that R users should be allowed to use it the way
that the language was designed to be used.

Possibly Parallel Threads

Search for more seemingly similar threads

R help - Aug 2003 - R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

[R] R tools for large files

Possibly Parallel Threads