Martin Morgan
2008-Mar-10 18:01 UTC
[Rd] write.table with row.names=FALSE unnecessarily slow?
write.table with large data frames takes quite a long time> system.time({+ write.table(df, '/tmp/dftest.txt', row.names=FALSE) + }, gcFirst=TRUE) user system elapsed 97.302 1.532 98.837 A reason is because dimnames is always called, causing 'anonymous' row names to be created as character vectors. Avoiding this in src/library/utils, along the lines of Index: write.table.R ==================================================================--- write.table.R (revision 44717) +++ write.table.R (working copy) @@ -27,13 +27,18 @@ if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x) + makeRownames <- is.logical(row.names) && !is.na(row.names) && + row.names==TRUE + makeColnames <- is.logical(col.names) && !is.na(col.names) && + col.names==TRUE if(is.matrix(x)) { ## fix up dimnames as as.data.frame would p <- ncol(x) d <- dimnames(x) if(is.null(d)) d <- list(NULL, NULL) - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) - if(is.null(d[[2]]) && p > 0) d[[2]] <- paste("V", 1:p, sep="") + if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x)) + if(is.null(d[[2]]) && p > 0 && makeColnames) + d[[2]] <- paste("V", 1:p, sep="") if(is.logical(quote) && quote) quote <- if(is.character(x)) seq_len(p) else numeric(0) } else { @@ -53,8 +58,8 @@ quote <- ord[quote]; quote <- quote[quote > 0] } } - d <- dimnames(x) - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) + d <- list(if (makeRownames==TRUE) row.names(x) else NULL, + if (makeColnames==TRUE) names(x) else NULL) p <- ncol(x) } nocols <- p==0 improves performance at least in proportion to nrow(x):> system.time({+ write.table(df, '/tmp/dftest1.txt', row.names=FALSE) + }, gcFirst=TRUE) user system elapsed 8.132 0.608 8.899 Martin -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
Martin Morgan
2008-Mar-10 18:07 UTC
[Rd] write.table with row.names=FALSE unnecessarily slow?
I neglected to include my test case,> df <- data.frame(x=1:(10^7))Martin Martin Morgan <mtmorgan at fhcrc.org> writes:> write.table with large data frames takes quite a long time > >> system.time({ > + write.table(df, '/tmp/dftest.txt', row.names=FALSE) > + }, gcFirst=TRUE) > user system elapsed > 97.302 1.532 98.837 > > A reason is because dimnames is always called, causing 'anonymous' row > names to be created as character vectors. Avoiding this in > src/library/utils, along the lines of > > Index: write.table.R > ==================================================================> --- write.table.R (revision 44717) > +++ write.table.R (working copy) > @@ -27,13 +27,18 @@ > > if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x) > > + makeRownames <- is.logical(row.names) && !is.na(row.names) && > + row.names==TRUE > + makeColnames <- is.logical(col.names) && !is.na(col.names) && > + col.names==TRUE > if(is.matrix(x)) { > ## fix up dimnames as as.data.frame would > p <- ncol(x) > d <- dimnames(x) > if(is.null(d)) d <- list(NULL, NULL) > - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) > - if(is.null(d[[2]]) && p > 0) d[[2]] <- paste("V", 1:p, sep="") > + if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x)) > + if(is.null(d[[2]]) && p > 0 && makeColnames) > + d[[2]] <- paste("V", 1:p, sep="") > if(is.logical(quote) && quote) > quote <- if(is.character(x)) seq_len(p) else numeric(0) > } else { > @@ -53,8 +58,8 @@ > quote <- ord[quote]; quote <- quote[quote > 0] > } > } > - d <- dimnames(x) > - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) > + d <- list(if (makeRownames==TRUE) row.names(x) else NULL, > + if (makeColnames==TRUE) names(x) else NULL) > p <- ncol(x) > } > nocols <- p==0 > > improves performance at least in proportion to nrow(x): > >> system.time({ > + write.table(df, '/tmp/dftest1.txt', row.names=FALSE) > + }, gcFirst=TRUE) > user system elapsed > 8.132 0.608 8.899 > > Martin > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M2 B169 > Phone: (206) 667-2793 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793
Martin Maechler
2008-Mar-11 16:21 UTC
[Rd] write.table with row.names=FALSE unnecessarily slow?
MartinMo> write.table with large data frames takes quite a long time MartinMo> system.time({ MartinMo> + write.table(df, '/tmp/dftest.txt', row.names=FALSE) MartinMo> + }, gcFirst=TRUE) MartinMo> user system elapsed MartinMo> 97.302 1.532 98.837 MartinMo> A reason is because dimnames is always called, causing 'anonymous' row MartinMo> names to be created as character vectors. Avoiding this in MartinMo> src/library/utils, along the lines of Thank you, Martin. Note that we needed to fix your patch (for the case where the dataframe has 'matrix column'), and I'd like to further remark that I consider '.... == TRUE ' to be quite ugly (or inefficient) in all circumstances. Martin Maechler, ETH Zurich Index: write.table.R ==================================================================--- write.table.R (revision 44717) +++ write.table.R (working copy) @@ -27,13 +27,18 @@ if(!is.data.frame(x) && !is.matrix(x)) x <- data.frame(x) + makeRownames <- is.logical(row.names) && !is.na(row.names) && + row.names==TRUE + makeColnames <- is.logical(col.names) && !is.na(col.names) && + col.names==TRUE if(is.matrix(x)) { ## fix up dimnames as as.data.frame would p <- ncol(x) d <- dimnames(x) if(is.null(d)) d <- list(NULL, NULL) - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) - if(is.null(d[[2]]) && p > 0) d[[2]] <- paste("V", 1:p, sep="") + if (is.null(d[[1]]) && makeRownames) d[[1]] <- seq_len(nrow(x)) + if(is.null(d[[2]]) && p > 0 && makeColnames) + d[[2]] <- paste("V", 1:p, sep="") if(is.logical(quote) && quote) quote <- if(is.character(x)) seq_len(p) else numeric(0) } else { @@ -53,8 +58,8 @@ quote <- ord[quote]; quote <- quote[quote > 0] } } - d <- dimnames(x) - if(is.null(d[[1]])) d[[1]] <- seq_len(nrow(x)) + d <- list(if (makeRownames==TRUE) row.names(x) else NULL, + if (makeColnames==TRUE) names(x) else NULL) p <- ncol(x) } nocols <- p==0> improves performance at least in proportion to nrow(x):> > system.time({ > + write.table(df, '/tmp/dftest1.txt', row.names=FALSE) > + }, gcFirst=TRUE) > user system elapsed > 8.132 0.608 8.899> Martin > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109> Location: Arnold Building M2 B169 > Phone: (206) 667-2793
Apparently Analagous Threads
- write.table with quote=TRUE fails on nested data.frames
- [Bioc-devel] promptClass
- Absolute cumulative curve with ecdf/stepfun?
- image() generates many border lines in pdf, not on screen (quartz) - R 2.9.1 GUI 1.28 Tiger build 32-bit (5444) - OS X 10.5.8
- [PATCH] don't unnecessarily override methods for 1.9.2 in AS::Multibyte::Chars