khosoda at med.kobe-u.ac.jp
2011-Aug-17 15:12 UTC
[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction
Hi all, I'm trying to do model reduction for logistic regression. I have 13 predictor (4 continuous variables and 9 binary variables). Using subject matter knowledge, I selected 4 important variables. Regarding the rest 9 variables, I tried to perform data reduction by principal component analysis (PCA). However, 8 of 9 variables were binary and only one continuous. I transformed the data by transcan of rms package and did PCA with princomp. PC1 explained only 20% of the variance. Still, I used the PC1 as a predictor of the logistic model and obtained some results. Then, I tried multiple correspondence analysis (MCA). The only one continuous variable was age. I transformed "age" variable to "age_Q" factor variable as the followings.> quantile(mydata.df$age)0% 25% 50% 75% 100% 53.00 66.75 72.00 76.25 85.00> age_Q <- cut(x17.df$age, right=TRUE, breaks=c(-Inf, 66, 72, 76, Inf),labels=c("53-66", "67-72", "73-76", "77-85"))> table(age_Q)age_Q 53-66 67-72 73-76 77-85 26 27 25 26 Then, I used mjca of ca pacakge for MCA.> mjca1 <- mjca(mydata.df[, c("age_Q","sex","symptom", "HT", "DM","IHD","smoking","DL", "Statin")])> summary(mjca1)Principal inertias (eigenvalues): dim value % cum% scree plot 1 0.009592 43.4 43.4 ************************* 2 0.003983 18.0 61.4 ********** 3 0.001047 4.7 66.1 ** 4 0.000367 1.7 67.8 -------- ----- Total: 0.022111 The dimension 1 explained 43% of the variance. Then, I was wondering which values I could use like PC1 in PCA. I explored in mjca1 and found "rowcoord".> mjca1$rowcoord[,1] [,2] [,3] [,4] [1,] 0.07403748 0.8963482181 0.10828273 1.581381849 [2,] 0.92433996 -1.1497911361 1.28872517 0.304065865 [3,] 0.49833354 0.6482940556 -2.11114314 0.365023261 [4,] 0.18998290 -1.4028117048 -1.70962159 0.451951744 [5,] -0.13008173 0.2557656854 1.16561601 -1.012992485 ......................................................... ......................................................... [101,] -1.86940216 0.5918128751 0.87352987 -1.118865117 [102,] -2.19096615 1.2845448725 0.25227354 -0.938612155 [103,] 0.77981265 -1.1931087587 0.23934034 0.627601413 [104,] -2.37058237 -1.4014005013 -0.73578248 -1.455055095 Then, I used mjca1$rowcoord[, 1] as the followings.> mydata.df$NewScore <- mjca1$rowcoord[, 1]I used this "NewScore" as one of the predictors for the model instead of original 9 variables. The final logistic model obtained by use of MCA was similar to the one obtained by use of PCA. My questions are; 1. Is it O.K. to perform PCA for data consisting of 1 continuous variable and 8 binary variables? 2. Is it O.K to perform transformation of age from continuous variable to factor variable for MCA? 3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of logistic regression model like PC1 of PCA? I would appreciate your help in advance. -- Kohkichi Hosoda
Daniel Malter
2011-Aug-18 08:39 UTC
[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction
Pooling nominal with numeric variables and running pca on them sounds like conceptual nonsense to me. You use PCA to reduce the dimensionality of the data if the data are numeric. For categorical data analysis, you should use latent class analysis or something along those lines. The fact that your first PC captures only 20 percent of the variance indicates that either you apply the wrong technique or that dimensionality reduction is of little use for these data more generally. The first step should generally be to check the correlations/associations between the variables to inspect whether what you intend to do makes sense. HTH, Daniel khosoda wrote:> > Hi all, > > I'm trying to do model reduction for logistic regression. I have 13 > predictor (4 continuous variables and 9 binary variables). Using subject > matter knowledge, I selected 4 important variables. Regarding the rest 9 > variables, I tried to perform data reduction by principal component > analysis (PCA). However, 8 of 9 variables were binary and only one > continuous. I transformed the data by transcan of rms package and did > PCA with princomp. PC1 explained only 20% of the variance. Still, I used > the PC1 as a predictor of the logistic model and obtained some results. > > Then, I tried multiple correspondence analysis (MCA). The only one > continuous variable was age. I transformed "age" variable to "age_Q" > factor variable as the followings. > >> quantile(mydata.df$age) > 0% 25% 50% 75% 100% > 53.00 66.75 72.00 76.25 85.00 >> age_Q <- cut(x17.df$age, right=TRUE, breaks=c(-Inf, 66, 72, 76, Inf), > labels=c("53-66", "67-72", "73-76", "77-85")) >> table(age_Q) > age_Q > 53-66 67-72 73-76 77-85 > 26 27 25 26 > > Then, I used mjca of ca pacakge for MCA. > >> mjca1 <- mjca(mydata.df[, c("age_Q","sex","symptom", "HT", "DM", > "IHD","smoking","DL", "Statin")]) > >> summary(mjca1) > > Principal inertias (eigenvalues): > > dim value % cum% scree plot > 1 0.009592 43.4 43.4 ************************* > 2 0.003983 18.0 61.4 ********** > 3 0.001047 4.7 66.1 ** > 4 0.000367 1.7 67.8 > -------- ----- > Total: 0.022111 > > The dimension 1 explained 43% of the variance. Then, I was wondering > which values I could use like PC1 in PCA. I explored in mjca1 and found > "rowcoord". > >> mjca1$rowcoord > [,1] [,2] [,3] [,4] > [1,] 0.07403748 0.8963482181 0.10828273 1.581381849 > [2,] 0.92433996 -1.1497911361 1.28872517 0.304065865 > [3,] 0.49833354 0.6482940556 -2.11114314 0.365023261 > [4,] 0.18998290 -1.4028117048 -1.70962159 0.451951744 > [5,] -0.13008173 0.2557656854 1.16561601 -1.012992485 > ......................................................... > ......................................................... > [101,] -1.86940216 0.5918128751 0.87352987 -1.118865117 > [102,] -2.19096615 1.2845448725 0.25227354 -0.938612155 > [103,] 0.77981265 -1.1931087587 0.23934034 0.627601413 > [104,] -2.37058237 -1.4014005013 -0.73578248 -1.455055095 > > Then, I used mjca1$rowcoord[, 1] as the followings. > >> mydata.df$NewScore <- mjca1$rowcoord[, 1] > > I used this "NewScore" as one of the predictors for the model instead of > original 9 variables. > > The final logistic model obtained by use of MCA was similar to the one > obtained by use of PCA. > > My questions are; > > 1. Is it O.K. to perform PCA for data consisting of 1 continuous > variable and 8 binary variables? > > 2. Is it O.K to perform transformation of age from continuous variable > to factor variable for MCA? > > 3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of > logistic regression model like PC1 of PCA? > > I would appreciate your help in advance. > > -- > Kohkichi Hosoda > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- View this message in context: http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3752062.html Sent from the R help mailing list archive at Nabble.com.
Mark Difford
2011-Aug-18 09:33 UTC
[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction
On Aug 17, 2011 khosoda wrote:> 1. Is it O.K. to perform PCA for data consisting of 1 continuous > variable and 8 binary variables? > 2. Is it O.K to perform transformation of age from continuous variable > to factor variable for MCA? > 3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of > logistic regression model like PC1 of PCA?Hi Kohkichi, If you want to do this, i.e. PCA-type analysis with different variable-types, then look at dudi.mix() in package ade4 and homals() in package homals. Regards, Mark. ----- Mark Difford (Ph.D.) Research Associate Botany Department Nelson Mandela Metropolitan University Port Elizabeth, South Africa -- View this message in context: http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3752168.html Sent from the R help mailing list archive at Nabble.com.
khosoda at med.kobe-u.ac.jp
2011-Aug-18 16:28 UTC
[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction
Dear Mark, Thank you very much for your mail. This is what I really wanted! I tried dudi.mix in ade4 package. > ade4plaque.df <- x18.df[c("age", "sex", "symptom", "HT", "DM", "IHD", "smoking", "DL", "Statin")] > head(ade4plaque.df) age sex symptom HT DM IHD smoking hyperlipidemia Statin 1 62 M asymptomatic positive negative negative positive positive positive 2 82 M symptomatic positive negative negative negative positive positive 3 64 M asymptomatic negative positive negative negative positive positive 4 55 M symptomatic positive positive positive negative positive positive 5 67 M symptomatic positive negative negative negative negative positive 6 79 M asymptomatic positive positive negative negative positive positive > x18.dudi.mix <- dudi.mix(ade4plaque.df) > x18.dudi.mix$eig [1] 1.7750557 1.4504641 1.2178640 1.0344946 0.8496640 0.8248379 0.7011151 0.6367328 0.5097718 > x18.dudi.mix$eig[1:9]/sum(x18.dudi.mix$eig) [1] 0.19722841 0.16116268 0.13531822 0.11494385 0.09440711 0.09164866 0.07790168 0.07074809 0.05664131 Still first component explained only 19.8% of the variances, right? Then, I investigated values of dudi.mix corresponding to PC1 of PCA. Help file say; l1 principal components, data frame with n rows and nf columns li row coordinates, data frame with n rows and nf columns So, I guess I should use x18.dudi.mix$l1[, 1]. Am I right? Or should I use multiple correpondence analysis because the first plane explained 43% of the variance? Thank you for your help in advance. Kohkichi (11/08/18 18:33), Mark Difford wrote:> On Aug 17, 2011 khosoda wrote: > >> 1. Is it O.K. to perform PCA for data consisting of 1 continuous >> variable and 8 binary variables? >> 2. Is it O.K to perform transformation of age from continuous variable >> to factor variable for MCA? >> 3. Is "mjca1$rowcoord[, 1]" the correct values as a predictor of >> logistic regression model like PC1 of PCA? > > Hi Kohkichi, > > If you want to do this, i.e. PCA-type analysis with different > variable-types, then look at dudi.mix() in package ade4 and homals() in > package homals. > > Regards, Mark. > > ----- > Mark Difford (Ph.D.) > Research Associate > Botany Department > Nelson Mandela Metropolitan University > Port Elizabeth, South Africa > -- > View this message in context: http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3752168.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- ************************************************* ????????????? ???????? ??? ?? ? ??650-0017?????????7??5-1 Phone: 078-382-5966 Fax : 078-382-5979 E-mail address Office: khosoda at med.kobe-u.ac.jp Home : khosoda at venus.dti.ne.jp
khosoda at med.kobe-u.ac.jp
2011-Aug-20 13:15 UTC
[R] How to use PC1 of PCA and dim1 of MCA as a predictor in logistic regression model for data reduction
Dear Mark, Thank you very much for your advice. I will try it. I really appreciate your all kind advice. Thanks a lot again. Best regards, Kohkichi (11/08/19 22:28), Mark Difford wrote:> On Aug 19, 2011 khosoda wrote: > >> I used x10.homals4$objscores[, 1] as a predictor for logistic regression >> as in the same way as PC1 in PCA. >> Am I going the right way? > > Hi Kohkichi, > > Yes, but maybe explore the "sets=" argument (set "Response" as the target > variable and the others as the predictor variables). Then use Dim1 scores. > Also think about fitting a rank-1 restricted model, combined with the sets> option. > > See the vignette to the package and look at > > @ARTICLE{MIC98, > author = {Michailides, G. and de Leeuw, J.}, > title = {The {G}ifi system of descriptive multivariate analysis}, > journal = {Statistical Science}, > year = {1998}, > volume = {13}, > pages = {307--336}, > abstract = {} > } > > Regards, Mark. > > ----- > Mark Difford (Ph.D.) > Research Associate > Botany Department > Nelson Mandela Metropolitan University > Port Elizabeth, South Africa > -- > View this message in context: http://r.789695.n4.nabble.com/How-to-use-PC1-of-PCA-and-dim1-of-MCA-as-a-predictor-in-logistic-regression-model-for-data-reduction-tp3750251p3755163.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >