Hello list. For some reason, the makers of glmnet do not accept a dataframe as input. They expect the input to be a matrix, where the dummies are already precoded. Now I have created a sample dataset with . 11 factor columns with two levels . 4 factor columns with three levels . 135 continuous columns (from a standard normal) . 100 observations (rows) Say this dataframe is in dfrPredictors. What I do now, is use the following code: form<-paste("~",paste(colnames(dfrPredictors), collapse="+"), sep="") dfrTmp<-model.frame(dfrPredictors, na.action=na.pass) result<- as.matrix(model.matrix(as.formula(form), data=dfrTmp))[,-1] This works (although admittedly, I don't understand everything of it). However, I notice that for this rather limited dataset, this conversion takes around 0.1 seconds user/elapsed time (on a relatively speedy laptop). For my current work, I need to do this a lot of times on very similar dataframes (in fact, they are multiply imputed from the same 'original' dataframe), so I need all the speed I can get. Does anybody know of a way that is quicker than the above? Note: because of other uses of the dataframe, I don't have the option to do this conversion before the imputation, so I really need the conversion itself to work quickly. Thanks, Nick Sabbe -- ping: nick.sabbe at ugent.be link: http://biomath.ugent.be wink: A1.056, Coupure Links 653, 9000 Gent ring: 09/264.59.36 -- Do Not Disapprove
Martin Maechler
2011-Feb-01 11:33 UTC
[R] Preparing dataset for glmnet: factors to dummies
>>>>> "NS" == Nick Sabbe <nick.sabbe at ugent.be> >>>>> on Tue, 1 Feb 2011 10:46:01 +0100 writes:NS> Hello list. NS> For some reason, the makers of glmnet do not accept a dataframe as input. NS> They expect the input to be a matrix, where the dummies are already NS> precoded. NS> Now I have created a sample dataset with NS> . 11 factor columns with two levels NS> . 4 factor columns with three levels NS> . 135 continuous columns (from a standard normal) NS> . 100 observations (rows) NS> Say this dataframe is in dfrPredictors. please do provide your R code next time, so we'll have a fully reproducible example .... NS> What I do now, is use the following code: NS> form<-paste("~",paste(colnames(dfrPredictors), collapse="+"), sep="") NS> dfrTmp<-model.frame(dfrPredictors, na.action=na.pass) NS> result<- as.matrix(model.matrix(as.formula(form), data=dfrTmp))[,-1] NS> This works (although admittedly, I don't understand everything of it). NS> However, I notice that for this rather limited dataset, this conversion NS> takes around 0.1 seconds user/elapsed time (on a relatively speedy laptop). NS> For my current work, I need to do this a lot of times on very similar NS> dataframes (in fact, they are multiply imputed from the same 'original' NS> dataframe), so I need all the speed I can get. NS> Does anybody know of a way that is quicker than the above? Note: because of NS> other uses of the dataframe, I don't have the option to do this conversion NS> before the imputation, so I really need the conversion itself to work NS> quickly. The glmnet package fortunately also works with sparse matrices (as from the 'Matrix' package). In Matrix, there's the function sparse.model.matrix() which should work like model.matrix() but produce a sparse matrix. This is typically considerably faster when the resulting matrix is large and sparse, notably because the memory footprint is so much smaller. We (Matrix authors) have gone a step further, and written a model.Matrix() function with argument 'sparse = FALSE / TRUE' which should even more closely mirror the functionality of R's model.matrix() (as that produces only standard, i.e., dense matrices). The functionality of model.Matrix() has been moved out of the Matrix package into the package 'MatrixModels', and that package also provides -- somewhat experimental -- functionality for fitting GLMs with sparse model matrices. We'd be glad to get feedback on your uses and observations with these sparse model matrices. Martin Maechler, ETH Zurich
I believe that glmnet scales variables by their standard deviations. This would not be appropriate for categorical predictors. Frank ----- Frank Harrell Department of Biostatistics, Vanderbilt University -- View this message in context: http://r.789695.n4.nabble.com/Preparing-dataset-for-glmnet-factors-to-dummies-tp3250791p3253210.html Sent from the R help mailing list archive at Nabble.com.