Dario Strbenac
2018-Feb-08 04:00 UTC
[Rd] sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present
Good day, Sometimes, sparse.model.matrix outputs a dgCMatrix which has column names consisting of factor levels that were not in the original dataset. The first factor appears to be correctly transformed, but the following factors don't. For example: diamonds <- as.data.frame(ggplot2::diamonds)> colnames(sparse.model.matrix(~ . -1, diamonds))[1] "carat" "cutFair" "cutGood" "cutVery Good" "cutPremium" "cutIdeal" "color.L" "color.Q" "color.C" "color^4" "color^5" [12] "color^6" "clarity.L" "clarity.Q" "clarity.C" "clarity^4" "clarity^5" "clarity^6" "clarity^7" "depth" "table" "price" [23] "x" "y" "z" The variables color and clarity don't have factor levels which have been suffixed to them in the transformed matrix. The values in those columns are also wrong. Changing the Ord.factor columns into simply being factors fixes the problem.> diamonds[, "cut"] <- factor(as.character(diamonds[, "cut"])) > diamonds[, "color"] <- factor(as.character(diamonds[, "color"])) > diamonds[, "clarity"] <- factor(as.character(diamonds[, "clarity"]))> colnames(sparse.model.matrix(~ . -1, diamonds)) # No more invented factor levels.[1] "carat" "cutFair" "cutGood" "cutIdeal" "cutPremium" "cutVery Good" "colorE" "colorF" "colorG" "colorH" [11] "colorI" "colorJ" "clarityIF" "claritySI1" "claritySI2" "clarityVS1" "clarityVS2" "clarityVVS1" "clarityVVS2" "depth" [21] "table" "price" "x" "y" "z" Can it be made to work correctly for both plain and ordered factors?> sessionInfo()R Under development (unstable) (2018-02-06 r74231) Platform: i386-w64-mingw32/i386 (32-bit) other attached packages: [1] Matrix_1.2-12 loaded via a namespace (and not attached): [1] colorspace_1.3-2 scales_0.5.0 compiler_3.5.0 lazyeval_0.2.1 [5] plyr_1.8.4 pillar_1.1.0 gtable_0.2.0 tibble_1.4.2 [9] Rcpp_0.12.15 ggplot2_2.2.1 grid_3.5.0 rlang_0.1.6 [13] munsell_0.4.3 lattice_0.20-35 -------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia
Ben Bolker
2018-Feb-08 12:51 UTC
[Rd] sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present
color and clarity are ordered factors, so sparse.model.matrix is generating orthogonal-polynomial contrasts (see ?contr.poly). This is by design ... what are you trying to do? Are you interested in fac2sparse? On 18-02-07 11:00 PM, Dario Strbenac wrote:> Good day, > > Sometimes, sparse.model.matrix outputs a dgCMatrix which has column names consisting of factor levels that were not in the original dataset. The first factor appears to be correctly transformed, but the following factors don't. For example: > > diamonds <- as.data.frame(ggplot2::diamonds) >> colnames(sparse.model.matrix(~ . -1, diamonds)) > [1] "carat" "cutFair" "cutGood" "cutVery Good" "cutPremium" "cutIdeal" "color.L" "color.Q" "color.C" "color^4" "color^5" > [12] "color^6" "clarity.L" "clarity.Q" "clarity.C" "clarity^4" "clarity^5" "clarity^6" "clarity^7" "depth" "table" "price" > [23] "x" "y" "z" > > The variables color and clarity don't have factor levels which have been suffixed to them in the transformed matrix. The values in those columns are also wrong. Changing the Ord.factor columns into simply being factors fixes the problem. > >> diamonds[, "cut"] <- factor(as.character(diamonds[, "cut"])) >> diamonds[, "color"] <- factor(as.character(diamonds[, "color"])) >> diamonds[, "clarity"] <- factor(as.character(diamonds[, "clarity"])) > >> colnames(sparse.model.matrix(~ . -1, diamonds)) # No more invented factor levels. > [1] "carat" "cutFair" "cutGood" "cutIdeal" "cutPremium" "cutVery Good" "colorE" "colorF" "colorG" "colorH" > [11] "colorI" "colorJ" "clarityIF" "claritySI1" "claritySI2" "clarityVS1" "clarityVS2" "clarityVVS1" "clarityVVS2" "depth" > [21] "table" "price" "x" "y" "z" > > Can it be made to work correctly for both plain and ordered factors? > >> sessionInfo() > R Under development (unstable) (2018-02-06 r74231) > Platform: i386-w64-mingw32/i386 (32-bit) > > other attached packages: > [1] Matrix_1.2-12 > > loaded via a namespace (and not attached): > [1] colorspace_1.3-2 scales_0.5.0 compiler_3.5.0 lazyeval_0.2.1 > [5] plyr_1.8.4 pillar_1.1.0 gtable_0.2.0 tibble_1.4.2 > [9] Rcpp_0.12.15 ggplot2_2.2.1 grid_3.5.0 rlang_0.1.6 > [13] munsell_0.4.3 lattice_0.20-35 > > -------------------------------------- > Dario Strbenac > University of Sydney > Camperdown NSW 2050 > Australia > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Dario Strbenac
2018-Feb-09 00:00 UTC
[Rd] sparse.model.matrix Generates Non-Existent Factor Levels if Ord.factor Columns Present
Good day, The intention is to convert the dataset into a format suitable for the random forest classifier implemented by the CRAN package xgboost. The input data is required to be transformed into one-hot format using the sparse.discrim.matrix function, as specified by the package's vignette of URL https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html I did not know to read the help page for contr.poly after reading the sparse.discrim.matrix help page. Perhaps there could be a helpful mention added to it? -------------------------------------- Dario Strbenac University of Sydney Camperdown NSW 2050 Australia