Mark.Bravington at csiro.au
2007-Oct-31 23:56 UTC
[R] model.matrix consistency across repeated calls
I am using R to construct model matrices that I then pass into C for subsequent fitting. Suppose I have a data.frame so big that, if I called 'model.matrix' directly on the whole thing, the results would be too big to handle (because factors expand to multiple columns, etc.). Instead, I really want to sequentially call 'model.matrix' on subsets of rows, and then 'rbind' the [compressed] results. However, this is not guaranteed to give the same result as just calling 'model.matrix' on the whole thing. Certain terms used in formulae, such as 'poly', are sensitive to the range of their argument; and I'm also worried about things like columns sometimes disappearing when particular levels of a factor don't appear in one of the subsets (I don't think that one actually happens, but I'm not *sure*). Can anyone suggest how to achieve consistency-of-interpretation across calls to 'model.matrix'? For example: are there certain types of term in formulae that I just have to avoid? Or can I benefit somehow from 'model.frame'(which I have never understood...)? In case you are wondering: I'm not going to directly rbind the results together, of course. After each call to model.matrix, I pass the result of that call into C where I compress it massively, and the compressed version of the whole thing squeezes into memory OK. Thanks for any help (preferably replying to me as well as to the list-- ta) Mark -- Mark Bravington CSIRO Mathematical & Information Sciences Marine Laboratory Castray Esplanade Hobart 7001 TAS ph (+61) 3 6232 5118 fax (+61) 3 6232 5012 mob (+61) 438 315 623