On 9/27/2021 1:06 AM, Leonard Mada wrote:>
> Dear Bill,
>
>
> Does list.files() always sort the results?
>
> It seems so. The option: full.names = FALSE does not have any effect:
> the results seem always sorted.
>
>
> Maybe it is better to process the files in an unsorted order: as
> stored on the disk?
>
After some more investigations:
This took only a few seconds:
sapply(list.dirs(path=path, full.name=F, recursive=F),
??? function(f) length(list.files(path = paste0(path, "/", f),
full.names = FALSE, recursive = TRUE)))
# maybe with caching, but the difference is enormous
Seems BH contains *by far* the most files: 11701 files.
But excluding it from processing did have only a liniar effect: still 377 s.
I had a look at src/main/platform.c, but do not fully understand it.
Sincerely,
Leonard
>
> Sincerely,
>
>
> Leonard
>
>
> On 9/25/2021 8:13 PM, Bill Dunlap wrote:
>> On my Windows 10 laptop I see evidence of the operating system
>> caching information about recently accessed files.? This makes it
>> hard to say how the speed might be improved.? Is there a way to clear
>> this cache?
>>
>> > system.time(L1 <- size.f.pkg(R.home("library")))
>> ? ?user ?system elapsed
>> ? ?0.48 ? ?2.81 ? 30.42
>> > system.time(L2 <- size.f.pkg(R.home("library")))
>> ? ?user ?system elapsed
>> ? ?0.35 ? ?1.10 ? ?1.43
>> > identical(L1,L2)
>> [1] TRUE
>> > length(L1)
>> [1] 30
>> > length(dir(R.home("library"),recursive=TRUE))
>> [1] 12949
>>
>> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help
>> <r-help at r-project.org <mailto:r-help at r-project.org>>
wrote:
>>
>> Dear List Members,
>>
>>
>> I tried to compute the file sizes of each installed package and the
>> process is terribly slow.
>>
>> It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>>
>>
>> 1.) Package Sizes
>>
>>
>> system.time({
>> ???? ??? x = size.pkg(file=NULL);
>> })
>> # elapsed time: 509 s !!!
>> # 512 Packages; 1.64 GB;
>> # R 4.1.1 on MS Windows 10
>>
>>
>> The code for the size.pkg() function is below and the latest
>> version is
>> on Github:
>>
>> https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>> <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
>>
>>
>> Questions:
>> Is there a way to get the file size faster?
>> It takes long on Windows as well, but of the order of 10-20 s,
>> not 10
>> minutes.
>> Do I miss something?
>>
>>
>> 1.b.) Alternative
>>
>> It came to my mind to read first all file sizes and then use
>> tapply or
>> aggregate - but I do not see why it should be faster.
>>
>> Would it be meaningful to benchmark each individual package?
>>
>> Although I am not very inclined to wait 10 minutes for each new
>> try out.
>>
>>
>> 2.) Big Packages
>>
>> Just as a note: there are a few very large packages (in my list
>> of 512
>> packages):
>>
>> 1? 123,566,287?????????????? BH
>> 2? 113,578,391?????????????? sf
>> 3? 112,252,652??????????? rgdal
>> 4?? 81,144,868?????????? magick
>> 5?? 77,791,374 openNLPmodels.en
>>
>> I suspect that sf & rgdal have a lot of duplicated data
structures
>> and/or duplicate code and/or duplicated libraries - although I am
>> not an
>> expert in the field and did not check the sources.
>>
>>
>> Sincerely,
>>
>>
>> Leonard
>>
>> ======>>
>>
>> # Package Size:
>> size.f.pkg = function(path=NULL) {
>> ???? if(is.null(path)) path = R.home("library");
>> ???? xd = list.dirs(path = path, full.names = FALSE, recursive
>> FALSE);
>> ???? size.f = function(p) {
>> ???? ??? p = paste0(path, "/", p);
>> ???? ??? sum(file.info <http://file.info>(list.files(path=p,
>> pattern=".",
>> ???? ??? ??? full.names = TRUE, all.files = TRUE, recursive
>> TRUE))$size);
>> ???? }
>> ???? sapply(xd, size.f);
>> }
>>
>> size.pkg = function(path=NULL, sort=TRUE,
file="Packages.Size.csv") {
>> ???? x = size.f.pkg(path=path);
>> ???? x = as.data.frame(x);
>> ???? names(x) = "Size"
>> ???? x$Name = rownames(x);
>> ???? # Order
>> ???? if(sort) {
>> ???? ??? id = order(x$Size, decreasing=TRUE)
>> ???? ??? x = x[id,];
>> ???? }
>> ???? if( ! is.null(file)) {
>> ???? ??? if( ! is.character(file)) {
>> ???? ??? ??? print("Error: Size NOT written to file!");
>> ???? ??? } else write.csv(x, file=file, row.names=FALSE);
>> ???? }
>> ???? return(x);
>> }
>>
>> ______________________________________________
>> R-help at r-project.org <mailto:R-help at r-project.org>
mailing list
>> -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> <https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
[[alternative HTML version deleted]]