On my Windows 10 laptop I see evidence of the operating system caching information about recently accessed files. This makes it hard to say how the speed might be improved. Is there a way to clear this cache?> system.time(L1 <- size.f.pkg(R.home("library")))user system elapsed 0.48 2.81 30.42> system.time(L2 <- size.f.pkg(R.home("library")))user system elapsed 0.35 1.10 1.43> identical(L1,L2)[1] TRUE> length(L1)[1] 30> length(dir(R.home("library"),recursive=TRUE))[1] 12949 On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help < r-help at r-project.org> wrote:> Dear List Members, > > > I tried to compute the file sizes of each installed package and the > process is terribly slow. > > It took ~ 10 minutes for 512 packages / 1.6 GB total size of files. > > > 1.) Package Sizes > > > system.time({ > x = size.pkg(file=NULL); > }) > # elapsed time: 509 s !!! > # 512 Packages; 1.64 GB; > # R 4.1.1 on MS Windows 10 > > > The code for the size.pkg() function is below and the latest version is > on Github: > > https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R > > > Questions: > Is there a way to get the file size faster? > It takes long on Windows as well, but of the order of 10-20 s, not 10 > minutes. > Do I miss something? > > > 1.b.) Alternative > > It came to my mind to read first all file sizes and then use tapply or > aggregate - but I do not see why it should be faster. > > Would it be meaningful to benchmark each individual package? > > Although I am not very inclined to wait 10 minutes for each new try out. > > > 2.) Big Packages > > Just as a note: there are a few very large packages (in my list of 512 > packages): > > 1 123,566,287 BH > 2 113,578,391 sf > 3 112,252,652 rgdal > 4 81,144,868 magick > 5 77,791,374 openNLPmodels.en > > I suspect that sf & rgdal have a lot of duplicated data structures > and/or duplicate code and/or duplicated libraries - although I am not an > expert in the field and did not check the sources. > > > Sincerely, > > > Leonard > > ======> > > # Package Size: > size.f.pkg = function(path=NULL) { > if(is.null(path)) path = R.home("library"); > xd = list.dirs(path = path, full.names = FALSE, recursive = FALSE); > size.f = function(p) { > p = paste0(path, "/", p); > sum(file.info(list.files(path=p, pattern=".", > full.names = TRUE, all.files = TRUE, recursive = TRUE))$size); > } > sapply(xd, size.f); > } > > size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") { > x = size.f.pkg(path=path); > x = as.data.frame(x); > names(x) = "Size" > x$Name = rownames(x); > # Order > if(sort) { > id = order(x$Size, decreasing=TRUE) > x = x[id,]; > } > if( ! is.null(file)) { > if( ! is.character(file)) { > print("Error: Size NOT written to file!"); > } else write.csv(x, file=file, row.names=FALSE); > } > return(x); > } > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
On a $150 second-hand laptop with 0.9GB of library, and a single-user installation of R so only one place to look LIBRARY=$HOME/R/x86_64-pc-linux-gnu-library/4.0 cd $LIBRARY echo "kbytes package" du -sk * | sort -k1n took 150 msec to report the disc space needed for every package. That' On Sun, 26 Sept 2021 at 06:14, Bill Dunlap <williamwdunlap at gmail.com> wrote:> > On my Windows 10 laptop I see evidence of the operating system caching > information about recently accessed files. This makes it hard to say how > the speed might be improved. Is there a way to clear this cache? > > > system.time(L1 <- size.f.pkg(R.home("library"))) > user system elapsed > 0.48 2.81 30.42 > > system.time(L2 <- size.f.pkg(R.home("library"))) > user system elapsed > 0.35 1.10 1.43 > > identical(L1,L2) > [1] TRUE > > length(L1) > [1] 30 > > length(dir(R.home("library"),recursive=TRUE)) > [1] 12949 > > On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help < > r-help at r-project.org> wrote: > > > Dear List Members, > > > > > > I tried to compute the file sizes of each installed package and the > > process is terribly slow. > > > > It took ~ 10 minutes for 512 packages / 1.6 GB total size of files. > > > > > > 1.) Package Sizes > > > > > > system.time({ > > x = size.pkg(file=NULL); > > }) > > # elapsed time: 509 s !!! > > # 512 Packages; 1.64 GB; > > # R 4.1.1 on MS Windows 10 > > > > > > The code for the size.pkg() function is below and the latest version is > > on Github: > > > > https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R > > > > > > Questions: > > Is there a way to get the file size faster? > > It takes long on Windows as well, but of the order of 10-20 s, not 10 > > minutes. > > Do I miss something? > > > > > > 1.b.) Alternative > > > > It came to my mind to read first all file sizes and then use tapply or > > aggregate - but I do not see why it should be faster. > > > > Would it be meaningful to benchmark each individual package? > > > > Although I am not very inclined to wait 10 minutes for each new try out. > > > > > > 2.) Big Packages > > > > Just as a note: there are a few very large packages (in my list of 512 > > packages): > > > > 1 123,566,287 BH > > 2 113,578,391 sf > > 3 112,252,652 rgdal > > 4 81,144,868 magick > > 5 77,791,374 openNLPmodels.en > > > > I suspect that sf & rgdal have a lot of duplicated data structures > > and/or duplicate code and/or duplicated libraries - although I am not an > > expert in the field and did not check the sources. > > > > > > Sincerely, > > > > > > Leonard > > > > ======> > > > > > # Package Size: > > size.f.pkg = function(path=NULL) { > > if(is.null(path)) path = R.home("library"); > > xd = list.dirs(path = path, full.names = FALSE, recursive = FALSE); > > size.f = function(p) { > > p = paste0(path, "/", p); > > sum(file.info(list.files(path=p, pattern=".", > > full.names = TRUE, all.files = TRUE, recursive = TRUE))$size); > > } > > sapply(xd, size.f); > > } > > > > size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") { > > x = size.f.pkg(path=path); > > x = as.data.frame(x); > > names(x) = "Size" > > x$Name = rownames(x); > > # Order > > if(sort) { > > id = order(x$Size, decreasing=TRUE) > > x = x[id,]; > > } > > if( ! is.null(file)) { > > if( ! is.character(file)) { > > print("Error: Size NOT written to file!"); > > } else write.csv(x, file=file, row.names=FALSE); > > } > > return(x); > > } > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Dear Bill, Does list.files() always sort the results? It seems so. The option: full.names = FALSE does not have any effect: the results seem always sorted. Maybe it is better to process the files in an unsorted order: as stored on the disk? Sincerely, Leonard On 9/25/2021 8:13 PM, Bill Dunlap wrote:> On my Windows 10 laptop I see evidence of the operating system caching > information about recently accessed files.? This makes it hard to say > how the speed might be improved.? Is there a way to clear this cache? > > > system.time(L1 <- size.f.pkg(R.home("library"))) > ? ?user ?system elapsed > ? ?0.48 ? ?2.81 ? 30.42 > > system.time(L2 <- size.f.pkg(R.home("library"))) > ? ?user ?system elapsed > ? ?0.35 ? ?1.10 ? ?1.43 > > identical(L1,L2) > [1] TRUE > > length(L1) > [1] 30 > > length(dir(R.home("library"),recursive=TRUE)) > [1] 12949 > > On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help > <r-help at r-project.org <mailto:r-help at r-project.org>> wrote: > > Dear List Members, > > > I tried to compute the file sizes of each installed package and the > process is terribly slow. > > It took ~ 10 minutes for 512 packages / 1.6 GB total size of files. > > > 1.) Package Sizes > > > system.time({ > ???? ??? x = size.pkg(file=NULL); > }) > # elapsed time: 509 s !!! > # 512 Packages; 1.64 GB; > # R 4.1.1 on MS Windows 10 > > > The code for the size.pkg() function is below and the latest > version is > on Github: > > https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R > <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R> > > > Questions: > Is there a way to get the file size faster? > It takes long on Windows as well, but of the order of 10-20 s, not 10 > minutes. > Do I miss something? > > > 1.b.) Alternative > > It came to my mind to read first all file sizes and then use > tapply or > aggregate - but I do not see why it should be faster. > > Would it be meaningful to benchmark each individual package? > > Although I am not very inclined to wait 10 minutes for each new > try out. > > > 2.) Big Packages > > Just as a note: there are a few very large packages (in my list of > 512 > packages): > > 1? 123,566,287?????????????? BH > 2? 113,578,391?????????????? sf > 3? 112,252,652??????????? rgdal > 4?? 81,144,868?????????? magick > 5?? 77,791,374 openNLPmodels.en > > I suspect that sf & rgdal have a lot of duplicated data structures > and/or duplicate code and/or duplicated libraries - although I am > not an > expert in the field and did not check the sources. > > > Sincerely, > > > Leonard > > ======> > > # Package Size: > size.f.pkg = function(path=NULL) { > ???? if(is.null(path)) path = R.home("library"); > ???? xd = list.dirs(path = path, full.names = FALSE, recursive > FALSE); > ???? size.f = function(p) { > ???? ??? p = paste0(path, "/", p); > ???? ??? sum(file.info <http://file.info>(list.files(path=p, > pattern=".", > ???? ??? ??? full.names = TRUE, all.files = TRUE, recursive > TRUE))$size); > ???? } > ???? sapply(xd, size.f); > } > > size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") { > ???? x = size.f.pkg(path=path); > ???? x = as.data.frame(x); > ???? names(x) = "Size" > ???? x$Name = rownames(x); > ???? # Order > ???? if(sort) { > ???? ??? id = order(x$Size, decreasing=TRUE) > ???? ??? x = x[id,]; > ???? } > ???? if( ! is.null(file)) { > ???? ??? if( ! is.character(file)) { > ???? ??? ??? print("Error: Size NOT written to file!"); > ???? ??? } else write.csv(x, file=file, row.names=FALSE); > ???? } > ???? return(x); > } > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- > To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > <https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]