Jay Emerson
2008-Jan-30 15:20 UTC
[Rd] Understanding an R improvement that already occurred.
I was surprised to observe the following difference between 2.4.1 and 2.6.0 after a long overdue upgrade a few months ago of our departmental server. It wasn't a bug fix, but a subtle improvement. Here's the simplest example I could create. The size is excessive, on the order of the Netflix Competition data. The integer matrix is about 1.12 GB, and if coerced to numeric it is 2.24 GB. The peak memory consumption of the first (old) operation was 1.2 + 2.24 + 2.24 = 5.6 GB. The peak memory consumption of the second (new) operation is 1.12 + 2.24 = 3.36 GB. (See below) In contrast, if a numeric matrix is used, there are no differences between the versions (so the improvement seems related to the integer type or the decision when/how to do the coercion). And of course I realize that x <- x + as.integer(1) is an option, but that isn't the point of this exercise. I'm curious, but also spending time on memory-related work. Someone deserves a 'thank you' and a pat on the pack for making this sort of improvement. Surely someone can step forward and take a bow, and perhaps explain the nature of the improvement? On a related note, a new package bigmemoRy will be available soon, handling massive matrices of double, integer, short, or char in RAM. In Unix (sorry, Windows), these matrices can also be used with shared memory (with mutexes implemented) for parallel processing. It's a niche market, obviously, ideal for data larger than 1 GB (roughly) but still within the boundaries of the RAM. It may be a useful developer tool for big-data problems. ------------------------ R version 2.4.1 (linux):> x <- matrix(as.integer(0), 1e+08, 3) > x <- x + 1 > gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 233754 12.5 467875 25 350000 18.7 Vcells 300119431 2289.8 787870506 6011 750119944 5723.0 ------------------------ R version 2.6.0 (linux):> x <- matrix(as.integer(0), 1e+08, 3) > x <- x + 1 > gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 137931 7.4 350000 18.7 350000 18.7 Vcells 300126402 2289.8 472877829 3607.8 450126789 3434.2 -- John W. Emerson (Jay) Assistant Professor of Statistics Director of Graduate Studies (on leave 07-08) Department of Statistics Yale University http://www.stat.yale.edu/~jay Statistical Consultant, REvolution Computing
Henrik Bengtsson
2008-Jan-30 15:53 UTC
[Rd] Understanding an R improvement that already occurred.
On Jan 30, 2008 7:20 AM, Jay Emerson <jayemerson at gmail.com> wrote:> I was surprised to observe the following difference between 2.4.1 and > 2.6.0 after a long overdue upgrade a few months ago of our > departmental server. It wasn't a bug fix, but a subtle improvement. > Here's the simplest example I could create. The size is excessive, on > the order of the Netflix Competition data. > > The integer matrix is about 1.12 GB, and if coerced to numeric it is > 2.24 GB. The peak memory consumption of the first (old) operation was > 1.2 + 2.24 + 2.24 = 5.6 GB. The peak memory consumption of the second > (new) operation is 1.12 + 2.24 = 3.36 GB. (See below) > > In contrast, if a numeric matrix is used, there are no differences > between the versions (so the improvement seems related to the integer > type or the decision when/how to do the coercion). And of course I > realize that x <- x + as.integer(1) is an option, but that isn't the > point of this exercise. > > I'm curious, but also spending time on memory-related work. Someone > deserves a 'thank you' and a pat on the pack for making this sort of > improvement. Surely someone can step forward and take a bow, and > perhaps explain the nature of the improvement? > > On a related note, a new package bigmemoRy will be available soon, > handling massive matrices of double, integer, short, or char in RAM. > In Unix (sorry, Windows), these matrices can also be used with shared > memory (with mutexes implemented) for parallel processing. It's a > niche market, obviously, ideal for data larger than 1 GB (roughly) but > still within the boundaries of the RAM. It may be a useful developer > tool for big-data problems. > > ------------------------ > R version 2.4.1 (linux): > > x <- matrix(as.integer(0), 1e+08, 3) > > x <- x + 1 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 233754 12.5 467875 25 350000 18.7 > Vcells 300119431 2289.8 787870506 6011 750119944 5723.0 > ------------------------ > R version 2.6.0 (linux): > > x <- matrix(as.integer(0), 1e+08, 3) > > x <- x + 1 > > gc() > used (Mb) gc trigger (Mb) max used (Mb) > Ncells 137931 7.4 350000 18.7 350000 18.7 > Vcells 300126402 2289.8 472877829 3607.8 450126789 3434.2That's interesting - I never noticed that change. On the same topic, in R 2.7.0 devel, the (re-)assignment in the following example does no longer create an extra copy:> x <- matrix(1, nrow=5000, ncol=5000)gc()> gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 132056 7.1 350000 18.7 350000 18.7 Vcells 25136968 191.8 28050871 214.1 25137357 191.8> x[1,1] <- 2 > gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 132060 7.1 350000 18.7 350000 18.7 Vcells 25136969 191.8 29533414 225.4 25137357 191.8 In R 2.6.1 that 2nd assignment would result in:> x[1,1] <- 2 > gc()used (Mb) gc trigger (Mb) max used (Mb) Ncells 138119 7.4 350000 18.7 350000 18.7 Vcells 25126464 191.7 52877950 403.5 50126482 382.5 See https://stat.ethz.ch/pipermail/r-devel/2007-September/047008.html for background. Thanks a lot whoever (Luke?) took the time to update matrix(). /Henrik> > > -- > John W. Emerson (Jay) > Assistant Professor of Statistics > Director of Graduate Studies (on leave 07-08) > Department of Statistics > Yale University > http://www.stat.yale.edu/~jay > Statistical Consultant, REvolution Computing > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >