R-devel now has some improved versions of read.table and write.table. For a million-row data frame containing one number, one factor with few levels and one logical column, a 56Mb object. generating it takes 4.5 secs. calling summary() on it takes 2.2 secs. writing it takes 8 secs and an additional 10Mb. saving it in .rda format takes 4 secs. reading it naively takes 28 secs and an additional 240Mb reading it carefully (using nrows, colClasses and comment.char) takes 16 secs and an additional 150Mb (56Mb of which is for the object read in). (The overhead of read.table over scan was about 2 secs, mainly in the conversion back to a factor.) loading from .rda format takes 3.4 secs. [R 2.0.1 read in 23 secs using an additional 210Mb, and wrote in 50 secs using an additional 450Mb.] Will Frank Harrell or someone else please explain to me a real application in which this is not fast enough? -- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
>>>>> "BDR" == Prof Brian Ripley <ripley@stats.ox.ac.uk> >>>>> on Sun, 26 Dec 2004 10:03:30 +0000 (GMT) writes:BDR> R-devel now has some improved versions of read.table BDR> and write.table. For a million-row data frame BDR> containing one number, one factor with few levels and BDR> one logical column, a 56Mb object. BDR> generating it takes 4.5 secs. BDR> calling summary() on it takes 2.2 secs. BDR> writing it takes 8 secs and an additional 10Mb. BDR> saving it in .rda format takes 4 secs. BDR> reading it naively takes 28 secs and an additional BDR> 240Mb BDR> reading it carefully (using nrows, colClasses and BDR> comment.char) takes 16 secs and an additional 150Mb BDR> (56Mb of which is for the object read in). (The BDR> overhead of read.table over scan was about 2 secs, BDR> mainly in the conversion back to a factor.) BDR> loading from .rda format takes 3.4 secs. BDR> [R 2.0.1 read in 23 secs using an additional 210Mb, and BDR> wrote in 50 secs using an additional 450Mb.] Excellent! Thanks a lot Brian (for this and much more)! I wish you continued merry holidays! Martin
R-devel now has some improved versions of read.table and write.table. For a million-row data frame containing one number, one factor with few Brian Ripley wrote: levels and one logical column, a 56Mb object. generating it takes 4.5 secs. calling summary() on it takes 2.2 secs. writing it takes 8 secs and an additional 10Mb. saving it in .rda format takes 4 secs. reading it naively takes 28 secs and an additional 240Mb reading it carefully (using nrows, colClasses and comment.char) takes 16 secs and an additional 150Mb (56Mb of which is for the object read in). (The overhead of read.table over scan was about 2 secs, mainly in the conversion back to a factor.) loading from .rda format takes 3.4 secs. [R 2.0.1 read in 23 secs using an additional 210Mb, and wrote in 50 secs using an additional 450Mb.] Will Frank Harrell or someone else please explain to me a real application in which this is not fast enough? --------------------------------------------------------------------------- Brian - I really appreciate your work on this, and the data. The wise use of read.table that you mentioned should be fine for almost everything I do. There may be other users who need to read larger datasets for which memory usage is an issue. They can speak for themselves though. Sincerely, Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
On a ~1.45 million row x 122 column data frame (one "character", one "factor", and the rest "numeric" columns) I can read it into R 2.0.1 using read.csv() in about 150 seconds; memory usage is ~1.5 GB. This is read in using the `nrows', `comment.char = ""', and `colClasses' arguments. On R-devel (2004-12-31), it takes about 120 seconds; memory usage is the same. Not too shabby! -roger Prof Brian Ripley wrote:> R-devel now has some improved versions of read.table and write.table. > > For a million-row data frame containing one number, one factor with few > levels and one logical column, a 56Mb object. > > generating it takes 4.5 secs. > > calling summary() on it takes 2.2 secs. > > writing it takes 8 secs and an additional 10Mb. > > saving it in .rda format takes 4 secs. > > reading it naively takes 28 secs and an additional 240Mb > > reading it carefully (using nrows, colClasses and comment.char) takes 16 > secs and an additional 150Mb (56Mb of which is for the object read in). > (The overhead of read.table over scan was about 2 secs, mainly in the > conversion back to a factor.) > > loading from .rda format takes 3.4 secs. > > [R 2.0.1 read in 23 secs using an additional 210Mb, and wrote in 50 secs > using an additional 450Mb.] > > > Will Frank Harrell or someone else please explain to me a real > application in which this is not fast enough? >
A technical question here: how does one measure the memory overhead mentioned below? I have a set of functions of my own and would like to profile them. Thanks, Vadim> -----Original Message----- > From: r-devel-bounces@stat.math.ethz.ch > [mailto:r-devel-bounces@stat.math.ethz.ch] On Behalf Of Prof > Brian Ripley > Sent: Sunday, December 26, 2004 2:04 AM > To: R-devel@r-project.org > Subject: [Rd] R's IO speed > > R-devel now has some improved versions of read.table and write.table. > > For a million-row data frame containing one number, one > factor with few levels and one logical column, a 56Mb object. > > generating it takes 4.5 secs. > > calling summary() on it takes 2.2 secs. > > writing it takes 8 secs and an additional 10Mb. > > saving it in .rda format takes 4 secs. > > reading it naively takes 28 secs and an additional 240Mb > > reading it carefully (using nrows, colClasses and > comment.char) takes 16 secs and an additional 150Mb (56Mb of > which is for the object read in). > (The overhead of read.table over scan was about 2 secs, > mainly in the conversion back to a factor.) > > loading from .rda format takes 3.4 secs. > > [R 2.0.1 read in 23 secs using an additional 210Mb, and wrote > in 50 secs using an additional 450Mb.] > > > Will Frank Harrell or someone else please explain to me a > real application > in which this is not fast enough? > > -- > Brian D. Ripley, ripley@stats.ox.ac.uk > Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > University of Oxford, Tel: +44 1865 272861 (self) > 1 South Parks Road, +44 1865 272866 (PA) > Oxford OX1 3TG, UK Fax: +44 1865 272595 > > ______________________________________________ > R-devel@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >