thr3ads.net - R help - [R] Preparing dataset for glmnet: factors to dummies [Feb 2011]

If this information is useful, please help other people find it:
Share via:

Nick Sabbe

2011-Feb-01 09:46 UTC

[R] Preparing dataset for glmnet: factors to dummies

Hello list.

For some reason, the makers of glmnet do not accept a dataframe as input.
They expect the input to be a matrix, where the dummies are already
precoded.
Now I have created a sample dataset with
. 11 factor columns with two levels
. 4 factor columns with three levels
. 135 continuous columns (from a standard normal)
. 100 observations (rows)
Say this dataframe is in dfrPredictors.

What I do now, is use the following code:

form<-paste("~",paste(colnames(dfrPredictors),
collapse="+"), sep="")
dfrTmp<-model.frame(dfrPredictors, na.action=na.pass)
result<- as.matrix(model.matrix(as.formula(form), data=dfrTmp))[,-1]

This works (although admittedly, I don't understand everything of it).
However, I notice that for this rather limited dataset, this conversion
takes around 0.1 seconds user/elapsed time (on a relatively speedy laptop).

For my current work, I need to do this a lot of times on very similar
dataframes (in fact, they are multiply imputed from the same 'original'
dataframe), so I need all the speed I can get.
Does anybody know of a way that is quicker than the above? Note: because of
other uses of the dataframe, I don't have the option to do this conversion
before the imputation, so I really need the conversion itself to work
quickly.

Thanks,


Nick Sabbe
--
ping: nick.sabbe at ugent.be
link: http://biomath.ugent.be
wink: A1.056, Coupure Links 653, 9000 Gent
ring: 09/264.59.36

-- Do Not Disapprove

Martin Maechler

2011-Feb-01 11:33 UTC

head link

[R] Preparing dataset for glmnet: factors to dummies

>>>>> "NS" == Nick Sabbe <nick.sabbe at ugent.be>
>>>>>     on Tue, 1 Feb 2011 10:46:01 +0100 writes:
    NS> Hello list.
    NS> For some reason, the makers of glmnet do not accept a dataframe as
input.
    NS> They expect the input to be a matrix, where the dummies are already
    NS> precoded.
    NS> Now I have created a sample dataset with
    NS> . 11 factor columns with two levels
    NS> . 4 factor columns with three levels
    NS> . 135 continuous columns (from a standard normal)
    NS> . 100 observations (rows)
    NS> Say this dataframe is in dfrPredictors.

please do provide your R code next time, so we'll have a fully
reproducible example ....

    NS> What I do now, is use the following code:

    NS> form<-paste("~",paste(colnames(dfrPredictors),
collapse="+"), sep="")
    NS> dfrTmp<-model.frame(dfrPredictors, na.action=na.pass)
    NS> result<- as.matrix(model.matrix(as.formula(form),
data=dfrTmp))[,-1]

    NS> This works (although admittedly, I don't understand everything of
it).
    NS> However, I notice that for this rather limited dataset, this
conversion
    NS> takes around 0.1 seconds user/elapsed time (on a relatively speedy
laptop).

    NS> For my current work, I need to do this a lot of times on very similar
    NS> dataframes (in fact, they are multiply imputed from the same
'original'
    NS> dataframe), so I need all the speed I can get.

    NS> Does anybody know of a way that is quicker than the above? Note:
because of
    NS> other uses of the dataframe, I don't have the option to do this
conversion
    NS> before the imputation, so I really need the conversion itself to work
    NS> quickly.

The glmnet package fortunately also works with sparse matrices
(as from the 'Matrix' package).  In Matrix, there's the function
sparse.model.matrix()   which should work like model.matrix()
but produce a sparse matrix. 
This is typically considerably faster when the resulting matrix
is large and sparse, notably because the memory footprint is so
much smaller.

We (Matrix authors) have gone a step further, and written
a  model.Matrix()  function with argument  'sparse = FALSE / TRUE'
which should even more closely mirror the functionality of R's
model.matrix() (as that produces only standard, i.e., dense matrices).

The functionality of model.Matrix() has been moved out of the
Matrix package into the package 'MatrixModels',
and that package also provides -- somewhat experimental --
functionality for fitting GLMs with sparse model matrices.

We'd be glad to get feedback on your uses and observations with
these sparse model matrices.

Martin Maechler, ETH Zurich

Frank Harrell

2011-Feb-02 00:02 UTC

head link

[R] Preparing dataset for glmnet: factors to dummies

I believe that glmnet scales variables by their standard deviations.  This
would not be appropriate for categorical predictors.

Frank


-----
Frank Harrell
Department of Biostatistics, Vanderbilt University
-- 
View this message in context:
http://r.789695.n4.nabble.com/Preparing-dataset-for-glmnet-factors-to-dummies-tp3250791p3253210.html
Sent from the R help mailing list archive at Nabble.com.

Maybe Matching Threads

Search for more reasonably related threads

R help - Feb 2011 - Preparing dataset for glmnet: factors to dummies

[R] Preparing dataset for glmnet: factors to dummies

[R] Preparing dataset for glmnet: factors to dummies

[R] Preparing dataset for glmnet: factors to dummies

Maybe Matching Threads