james.holtman@convergys.com
2004-Dec-12 22:03 UTC
[R] 'object.size' takes a long time to return a value
I was using 'object.size' to see how much memory a list was taking up. After executing the command, I had thought that my computer had locked up. After further testing, I determined that it was taking 241 seconds for object.size to return a value. I did notice in the release notes that 'object.size' did take longer when the list contained character vectors. Is the time that it is taking 'object.size' to return a value to be expected for such a list? Much better results were obtained when the character vectors were converted to factors. ###### Results from the testing ###################> str(x.1)List of 10 $ : chr [1:227299] "sadc" "sar" "date" "ksh" ... $ : chr [1:227299] "aprperf" "aprperf" "aprperf" "aprperf" ... $ : num [1:227299] 23 23 0 23 23 0 0 0 0 23 ... $ : num [1:227299] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:227299] 3600 3600 0.01 3600 3600 0.01 0.01 0.01 0.01 3600 ... $ : num [1:227299] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:227299] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:227299] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:227299] 62608 67968 29 10208 13128 ... $ : num [1:227299] 0 1 0 0 1 0 0 0 0 0 ... # takes a long time (241 seconds) to report the size> gc();system.time(print(object.size(x.1)))used (Mb) gc trigger (Mb) Ncells 711007 19.0 2235810 59.8 Vcells 5191294 39.7 14409257 110.0 [1] 34154972 [1] 241.07 0.00 241.08 NA NA # trying list of 1000> x.2 <- list.subset(x.1, 1:1000);gc();system.time(print(object.size(x.2)))used (Mb) gc trigger (Mb) Ncells 711006 19.0 2235810 59.8 Vcells 4300288 32.9 14409257 110.0 [1] 145860 [1] 0.01 0.00 0.01 NA NA # trying list of 10,000> x.2 <- list.subset(x.1,1:10000);gc();system.time(print(object.size(x.2))) used (Mb) gc trigger (Mb) Ncells 711006 19.0 2235810 59.8 Vcells 4381288 33.5 14409257 110.0 [1] 1491948 [1] 0.28 0.00 0.28 NA NA # list of 40,000> x.2 <- list.subset(x.1,1:40000);gc();system.time(print(object.size(x.2))) used (Mb) gc trigger (Mb) Ncells 711006 19.0 2235810 59.8 Vcells 4651288 35.5 14409257 110.0 [1] 5988460 [1] 7.15 0.00 7.15 NA NA # list of 60,000> x.2 <- list.subset(x.1,1:60000);gc();system.time(print(object.size(x.2))) used (Mb) gc trigger (Mb) Ncells 711006 19.0 2235810 59.8 Vcells 4831288 36.9 14409257 110.0 [1] 9001556 [1] 17.33 0.00 17.32 NA NA # list of 100,000> x.2 <- list.subset(x.1,1:100000);gc();system.time(print(object.size(x.2))) used (Mb) gc trigger (Mb) Ncells 711006 19.0 2235810 59.8 Vcells 5191288 39.7 14409257 110.0 [1] 15044780 [1] 51.85 0.00 51.86 NA NA # list structure of the last object> str(x.2)List of 10 $ : chr [1:100000] "sadc" "sar" "date" "ksh" ... $ : chr [1:100000] "aprperf" "aprperf" "aprperf" "aprperf" ... $ : num [1:100000] 23 23 0 23 23 0 0 0 0 23 ... $ : num [1:100000] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:100000] 3600 3600 0.01 3600 3600 0.01 0.01 0.01 0.01 3600 ... $ : num [1:100000] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:100000] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:100000] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:100000] 62608 67968 29 10208 13128 ... $ : num [1:100000] 0 1 0 0 1 0 0 0 0 0 ... # with the first two items on the list converted to factors, # 'object.size' performs a lot better> str(x.1)List of 10 $ : Factor w/ 175 levels "#bpbkar","#bpcd",..: 132 133 60 93 13 160 60 84 60 132 ... $ : Factor w/ 8 levels "apra3g","aprperf",..: 2 2 2 2 2 2 2 2 2 2 ... $ : num [1:227299] 23 23 0 23 23 0 0 0 0 23 ... $ : num [1:227299] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:227299] 3600 3600 0.01 3600 3600 0.01 0.01 0.01 0.01 3600 ... $ : num [1:227299] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:227299] 0 0 0 0 0 0 0 0 0 0 ... $ : num [1:227299] 0.01 0 0.01 0 0.01 0 0.01 0 0 0.01 ... $ : num [1:227299] 62608 67968 29 10208 13128 ... $ : num [1:227299] 0 1 0 0 1 0 0 0 0 0 ...> system.time(print(object.size(x.1))) # now it is fast[1] 16374176 [1] 0 0 0 NA NA> version_ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 0.1 year 2004 month 11 day 15 language R>__________________________________________________________ James Holtman "What is the problem you are trying to solve?" Executive Technical Consultant -- Office of Technology, Convergys james.holtman at convergys.com +1 (513) 723-2929 -- "NOTICE: The information contained in this electronic mail ...{{dropped}}
Martin Maechler
2004-Dec-13 11:20 UTC
[R] 'object.size' takes a long time to return a value
>>>>> "james" == james holtman <james.holtman at convergys.com> >>>>> on Sun, 12 Dec 2004 17:03:31 -0500 writes:james> I was using 'object.size' to see how much memory a james> list was taking up. After executing the command, I james> had thought that my computer had locked up. After james> further testing, I determined that it was taking 241 james> seconds for object.size to return a value. james> I did notice in the release notes that 'object.size' james> did take longer when the list contained character james> vectors. Is the time that it is taking 'object.size' james> to return a value to be expected for such a list? yes, partly its expected to take longer than for others, but, actually, it does take longer than I would have expected, even after starting to think about it: Every element of your character vector is a string which is coded ``as a vector of bytes with a string terminator'' (simplification). To find a string length, i.e., what the R function nchar() also does, "one" has to read all character up to the string terminator. That's much slower than just using the hard coded fact that an integer is 4 bytes or a double is 8. james> Much better results were obtained when the character james> vectors were converted to factors. yes; since your factor only had a dozen or at most 175 levels; and only these are character; the factor *data* are integers. However, what I say above does not explain everything about the slowness of object.size( <character> ). We would have to go into the C code and the exact implementation of object.size() to see the reason - and think about possible improvements. BTW: Note that R saves memory when character elements are "shared"; e.g., for me (on 64-bit Linux, 2.0.1patched), > object.size(rep("abcedfghijklmn", 3)) [1] 152 > object.size(c("abcedfghijklmn", "ABCEDFGHIJKLMN", "ABCEDFGHijklmn")) [1] 296 Here is some code to experiment further which slowly constructs character vectors where (I think) no "sharing" takes place: rChar <- function(n, m, ch.set = c(LETTERS,letters)) { ## Purpose: create random character vector ## ---------------------------------------------------------------------- ## Arguments: n: length of vector ## m: "average" string length ## ---------------------------------------------------------------------- ## Author: Martin Maechler, Date: 13 Dec 2004, 11:35 sapply(rpois(n, lambda=m), function(m) paste(sample(ch.set, size=m), collapse="")) } lc <- rChar(1e5, 4)# already takes several seconds on a fast machine ## This is on 64-bit [AMD Athlon(tm) 64 Processor 2800+] "lynne": system.time(print(object.size(lc))) ## [1] 7240464 ## [1] 2.11 0.00 2.14 0.00 0.00 system.time(print(sum(nchar(lc)))) # which is **MUCH** faster ## [1] 399461 ## [1] 0.02 0.00 0.02 0.00 0.00 ## but still quite slower system.time(print(for(i in 1:10)sn <- sum(nchar(lc))))## 0.10 ## than lx <- rnorm(1e5) system.time(print(for(i in 1:10)os <- object.size(lx)))## 0.01 ##------------ Note that if we continue this topic, it should probably be moved to R-devel, since it's getting technical and about R internals (in coded in C). -- Martin Maechler, ETH Zurich