Douglas Bates
1997-Sep-10 22:03 UTC
R-alpha: model.matrix misbehaves on two-sided formulas
I am using R for a graduate class on linear models that I teach. I sent the class a sample transcript of explicit construction of the model matrix and use of the QR decomposition. One of the members of the class picked up on the fact that the model matrix was not what I advertised that it should be when a two-sided formula is used. A one-sided formula works properly. For example, > library(st849) # loads data sets from the text > e1.1 density speed [1,] 20.4 38.8 [2,] 27.4 31.5 [3,] 106.2 10.6 [4,] 80.4 16.1 [5,] 141.3 7.7 [6,] 130.9 8.3 [7,] 121.7 8.5 [8,] 106.5 11.1 [9,] 130.5 8.6 [10,] 101.1 11.1 [11,] 123.9 9.8 [12,] 144.2 7.8 [13,] 29.5 31.8 [14,] 30.8 31.6 [15,] 26.5 34.0 [16,] 35.7 28.9 [17,] 30.0 28.8 [18,] 106.2 10.5 [19,] 97.0 12.3 [20,] 90.1 13.2 [21,] 106.7 11.4 [22,] 99.3 11.2 [23,] 107.2 10.3 [24,] 109.1 11.4 > class(e1.1) [1] "data.frame" > model.matrix(~density, data = e1.1) # what we would expect (Intercept) density [1,] 1 20.4 [2,] 1 27.4 [3,] 1 106.2 [4,] 1 80.4 [5,] 1 141.3 [6,] 1 130.9 [7,] 1 121.7 [8,] 1 106.5 [9,] 1 130.5 [10,] 1 101.1 [11,] 1 123.9 [12,] 1 144.2 [13,] 1 29.5 [14,] 1 30.8 [15,] 1 26.5 [16,] 1 35.7 [17,] 1 30.0 [18,] 1 106.2 [19,] 1 97.0 [20,] 1 90.1 [21,] 1 106.7 [22,] 1 99.3 [23,] 1 107.2 [24,] 1 109.1 attr(,"assign") [1] 0 1 > model.matrix(sqrt(speed) ~ density, data = e1.1) # not what we want (Intercept) density [1,] 1 38.8 [2,] 1 31.5 [3,] 1 10.6 [4,] 1 16.1 [5,] 1 7.7 [6,] 1 8.3 [7,] 1 8.5 [8,] 1 11.1 [9,] 1 8.6 [10,] 1 11.1 [11,] 1 9.8 [12,] 1 7.8 [13,] 1 31.8 [14,] 1 31.6 [15,] 1 34.0 [16,] 1 28.9 [17,] 1 28.8 [18,] 1 10.5 [19,] 1 12.3 [20,] 1 13.2 [21,] 1 11.4 [22,] 1 11.2 [23,] 1 10.3 [24,] 1 11.4 attr(,"assign") [1] 0 1 I think that somewhere along the line an expression like eval(form[[2]], data) is being used when it would be better to use eval(form[[length(form)]], data) -- Douglas Bates bates@stat.wisc.edu Statistics Department 608/262-2598 University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Thomas Lumley
1997-Sep-10 22:15 UTC
R-alpha: model.matrix misbehaves on two-sided formulas
On 10 Sep 1997, Douglas Bates wrote:> I am using R for a graduate class on linear models that I teach. I > sent the class a sample transcript of explicit construction of the > model matrix and use of the QR decomposition. One of the members of > the class picked up on the fact that the model matrix was not what I > advertised that it should be when a two-sided formula is used. A > one-sided formula works properly. For example, > > library(st849) # loads data sets from the text > > e1.1 > density speed > [1,] 20.4 38.8 > > model.matrix(~density, data = e1.1) # what we would expect > (Intercept) density > [1,] 1 20.4 > [2,] 1 27.4 > > model.matrix(sqrt(speed) ~ density, data = e1.1) # not what we want > (Intercept) density > [1,] 1 38.8 > [2,] 1 31.5 > > I think that somewhere along the line an expression like > eval(form[[2]], data) > is being used when it would be better to use > eval(form[[length(form)]], data)No, it's worse than that. For versions of R up to 0.50-a4, the new alpha release, model.matrix(formula,dataframe) assumes that its data argument was generated by dataframe<-model.frame(formula,data) It does *not* check the names or order of the columns. This means that model.matrix(formula, some.data.frame) will usually give the wrong answer. In R0.50-a4 this is fixed. Thomas Lumley ------------------------------------------------------+------ Biostatistics : "Never attribute to malice what : Uni of Washington : can be adequately explained by : Box 357232 : incompetence" - Hanlon's Razor : Seattle WA 98195-7232 : : ------------------------------------------------------------ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
I have two difficult but not impossible suggestions for long-term improvements to memory handling in R. The first is in foreign function calling. It would sometimes save a lot of time if you could designate arguments to to .C/.Fortran as input or output only (like Fortran 90 INTENT IN/OUT descriptions). At the moment every argument to foreign code is copied into a non-relocatable area, then copied back into a new R variable. If a variable was INTENT IN, it would not be copied back, and that element of the return value would be NULL; if a variable was INTENT OUT it would not be copied in and R would just malloc() the correct amount of space in the non-relocatable area. I've been writing quite a bit of C code that turns huge matrices into small ones, involving a lot of memory movement that could easily be removed. The default would have to still be copying everything, for backwards compatibility and idiotproofing. The second, more difficult suggestion, is to do something other than stop when memory is exhausted. It doesn't matter if what you do is slow, as the alternative is complete failure. It should be possible for R to try and grab a bigger chunk of memory (perhaps 50-100% bigger) and move into it. The relocatable part of the heap wouldn't be a big problem but the non-relocatable part would need to be kept where it was and a new one started. On the other hand, this probably requires rewriting a whole bunch of stuff and might be very painful. -thomas =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-