Dear all here is my data called "rglp" structure(list(vzorek = structure(1:17, .Label = c("179/1/1", "179/2/1", "180/1", "181/1", "182/1", "183/1", "184/1", "185/1", "186/1", "187/1", "188/1", "189/1", "190/1", "191/1", "192/1", "R310", "R610L"), class = "factor"), iep = c(7.51, 7.79, 5.14, 6.35, 5.82, 7.13, 5.95, 7.27, 6.29, 7.5, 7.3, 7.27, 6.46, 6.95, 6.32, 6.32, 6.34), skupina = c(7.34, 7.34, 5.14, 6.23, 6.23, 7.34, 6.23, 7.34, 6.23, 7.34, 7.34, 7.34, 6.23, 7.34, 6.23, 6.23, 6.23), sio2 = c(0.023, 0.011, 0.88, 0.028, 0.031, 0.029, 0.863, 0.898, 0.95, 0.913, 0.933, 0.888, 0.922, 0.882, 0.923, 1, 1), p2o5 = c(0.78, 0.784, 1.834, 1.906, 1.915, 0.806, 1.863, 0.775, 0.817, 0.742, 0.783, 0.759, 0.787, 0.758, 0.783, 3, 2), al2o3 = c(5.812, 5.819, 3.938, 5.621, 3.928, 3.901, 5.621, 5.828, 4.038, 5.657, 3.993, 5.735, 4.002, 5.728, 4.042, 6, 5), dus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("ano", "ne"), class = "factor")), .Names = c("vzorek", "iep", "skupina", "sio2", "p2o5", "al2o3", "dus"), class = "data.frame", row.names = c(NA, -17L)) and I try to do principal component analysis. Here is one without scaling fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2) biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) you can see that data make 3 groups according to variables sio2 and dus which seems to be reasonable as lowest group has different value of dus = "ano" while highest group has low value of sio2. But when I do the same with scale=T fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2, scale=T) biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) I get completely different picture which is not possible to interpret in such an easy way. So if anybody can advice me if I shall follow recommendation from help page (which says The default is FALSE for consistency with S, but in general scaling is advisable. or if I shall stay with scale = FALSE and with simply interpretable result? Thank you. Petr Pikal petr.pikal at precheza.cz
Duncan Murdoch
2009-Aug-19 12:49 UTC
[R] scale or not to scale that is the question - prcomp
On 19/08/2009 8:31 AM, Petr PIKAL wrote:> Dear all > > here is my data called "rglp" > > structure(list(vzorek = structure(1:17, .Label = c("179/1/1", > "179/2/1", "180/1", "181/1", "182/1", "183/1", "184/1", "185/1", > "186/1", "187/1", "188/1", "189/1", "190/1", "191/1", "192/1", > "R310", "R610L"), class = "factor"), iep = c(7.51, 7.79, 5.14, > 6.35, 5.82, 7.13, 5.95, 7.27, 6.29, 7.5, 7.3, 7.27, 6.46, 6.95, > 6.32, 6.32, 6.34), skupina = c(7.34, 7.34, 5.14, 6.23, 6.23, > 7.34, 6.23, 7.34, 6.23, 7.34, 7.34, 7.34, 6.23, 7.34, 6.23, 6.23, > 6.23), sio2 = c(0.023, 0.011, 0.88, 0.028, 0.031, 0.029, 0.863, > 0.898, 0.95, 0.913, 0.933, 0.888, 0.922, 0.882, 0.923, 1, 1), > p2o5 = c(0.78, 0.784, 1.834, 1.906, 1.915, 0.806, 1.863, > 0.775, 0.817, 0.742, 0.783, 0.759, 0.787, 0.758, 0.783, 3, > 2), al2o3 = c(5.812, 5.819, 3.938, 5.621, 3.928, 3.901, 5.621, > 5.828, 4.038, 5.657, 3.993, 5.735, 4.002, 5.728, 4.042, 6, > 5), dus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, > 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("ano", "ne"), class = > "factor")), .Names = c("vzorek", > "iep", "skupina", "sio2", "p2o5", "al2o3", "dus"), class = "data.frame", > row.names = c(NA, > -17L)) > > and I try to do principal component analysis. Here is one without scaling > > fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2) > biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) > > you can see that data make 3 groups according to variables sio2 and dus > which seems to be reasonable as lowest group has different value of dus = > "ano" while highest group has low value of sio2. > > But when I do the same with scale=T > > fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2, > scale=T) > biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) > > I get completely different picture which is not possible to interpret in > such an easy way. > > So if anybody can advice me if I shall follow recommendation from help > page (which says > The default is FALSE for consistency with S, but in general scaling is > advisable. > or if I shall stay with scale = FALSE and with simply interpretable > result?I would say the answer depends on the meaning of the variables. In the unusual case that they are measured in dimensionless units, it might make sense not to scale. But if you are using arbitrary units of measurement, do you want your answer to depend on them? For example, if you change from Kg to mg, the numbers will become much larger, the variable will contribute much more variance, and it will become a more important part of the largest principal component. Is that sensible? Duncan Murdoch
Thank you Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 14:49:52:> On 19/08/2009 8:31 AM, Petr PIKAL wrote: > > Dear all > ><snip>> > I would say the answer depends on the meaning of the variables. In the > unusual case that they are measured in dimensionless units, it might > make sense not to scale. But if you are using arbitrary units of > measurement, do you want your answer to depend on them? For example, if> you change from Kg to mg, the numbers will become much larger, the > variable will contribute much more variance, and it will become a more > important part of the largest principal component. Is that sensible?Basically variables are in percentages (all between 0 and 6%) except dus which is present or not present (for the purpose of prcomp transformed to 0/1 by as.numeric:). The only variable which is not such is iep which is basically in range 5-8. So ranges of all variables are quite similar. What surprises me is that biplot without scaling I can interpret by used variables while biplot with scaling is totally different and those two pictures does not match at all. This is what surprised me as I would expected just a small difference between results from those two settings as all numbers are quite comparable and does not differ much. Best regards Petr> > Duncan Murdoch
Duncan Murdoch
2009-Aug-19 13:25 UTC
[R] scale or not to scale that is the question - prcomp
On 19/08/2009 9:02 AM, Petr PIKAL wrote:> Thank you > > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 14:49:52: > >> On 19/08/2009 8:31 AM, Petr PIKAL wrote: >>> Dear all >>> > > <snip> > >> I would say the answer depends on the meaning of the variables. In the >> unusual case that they are measured in dimensionless units, it might >> make sense not to scale. But if you are using arbitrary units of >> measurement, do you want your answer to depend on them? For example, if > >> you change from Kg to mg, the numbers will become much larger, the >> variable will contribute much more variance, and it will become a more >> important part of the largest principal component. Is that sensible? > > Basically variables are in percentages (all between 0 and 6%) except dus > which is present or not present (for the purpose of prcomp transformed to > 0/1 by as.numeric:). The only variable which is not such is iep which is > basically in range 5-8. So ranges of all variables are quite similar. > > What surprises me is that biplot without scaling I can interpret by used > variables while biplot with scaling is totally different and those two > pictures does not match at all. This is what surprised me as I would > expected just a small difference between results from those two settings > as all numbers are quite comparable and does not differ much.If you look at the standard deviations in the two cases, I think you can see why this happens: Scaled: Standard deviations: [1] 1.3335175 1.2311551 1.0583667 0.7258295 0.2429397 Not Scaled: Standard deviations: [1] 1.0030048 0.8400923 0.5679976 0.3845088 0.1531582 The first two sds are close, so small changes to the data will affect their direction a lot. Your biplots look at the 2nd and 3rd components. Duncan Murdoch
Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 15:25:00:> On 19/08/2009 9:02 AM, Petr PIKAL wrote: > > Thank you > > > > Duncan Murdoch <murdoch at stats.uwo.ca> napsal dne 19.08.2009 14:49:52: > > > >> On 19/08/2009 8:31 AM, Petr PIKAL wrote: > >>> Dear all > >>> > > > > <snip> > > > >> I would say the answer depends on the meaning of the variables. Inthe> >> unusual case that they are measured in dimensionless units, it might > >> make sense not to scale. But if you are using arbitrary units of > >> measurement, do you want your answer to depend on them? For example,if> > > >> you change from Kg to mg, the numbers will become much larger, the > >> variable will contribute much more variance, and it will become amore> >> important part of the largest principal component. Is that sensible? > > > > Basically variables are in percentages (all between 0 and 6%) exceptdus> > which is present or not present (for the purpose of prcomp transformedto> > 0/1 by as.numeric:). The only variable which is not such is iep whichis> > basically in range 5-8. So ranges of all variables are quite similar. > > > > What surprises me is that biplot without scaling I can interpret byused> > variables while biplot with scaling is totally different and those two> > pictures does not match at all. This is what surprised me as I would > > expected just a small difference between results from those twosettings> > as all numbers are quite comparable and does not differ much. > > > If you look at the standard deviations in the two cases, I think you can> see why this happens: > > Scaled: > > Standard deviations: > [1] 1.3335175 1.2311551 1.0583667 0.7258295 0.2429397 > > Not Scaled: > > Standard deviations: > [1] 1.0030048 0.8400923 0.5679976 0.3845088 0.1531582 > > > The first two sds are close, so small changes to the data will affectI see. But I would expect that changes to data made by scaling would not change it in such a way that unscaled and scaled results are completely different.> their direction a lot. Your biplots look at the 2nd and 3rd components.Yes because grouping in 2nd and 3rd component biplot can be easily explained by values of some variables (without scaling). I must admit that I do not use prcomp much often and usually scaling can give me "explainable" result, especially if I use it to "variable reduction". Therefore I am reluctant to use it in this case. when I try "more standard" way> fit<-lm(iep~sio2+al2o3+p2o5+as.numeric(dus), data=rglp) > summary(fit)Call: lm(formula = iep ~ sio2 + al2o3 + p2o5 + as.numeric(dus), data = rglp) Residuals: Min 1Q Median 3Q Max -0.41751 -0.15568 -0.03613 0.20124 0.43046 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.12085 0.62257 11.438 8.24e-08 *** sio2 -0.67250 0.20953 -3.210 0.007498 ** al2o3 0.40534 0.08641 4.691 0.000522 *** p2o5 -0.76909 0.11103 -6.927 1.59e-05 *** as.numeric(dus) -0.64020 0.18101 -3.537 0.004094 ** I get quite plausible result which can be interpreted without problems. My data is a result of designed experiment (more or less :) and therefore all variables are significant. Is that the reason why scaling may bye inappropriate in this case? Regards Petr Pikal> > Duncan Murdoch
scaling changes the metric, ie which things are close to each other. there is no reason to expect the picture to look the same when you change the metric. On the other hand, your two pictures don't look so different to me. It appears that the scaled plot is similar to the unscaled plot, with the roles of the second and third pc reversed, ie the scaled plot is similar but rotated and distorted. For example, the observations forming the strip across the bottom of the first plot form a vertical strip on the right hand side of the second plot. albyn On Wed, Aug 19, 2009 at 02:31:23PM +0200, Petr PIKAL wrote:> Dear all > > here is my data called "rglp" > > structure(list(vzorek = structure(1:17, .Label = c("179/1/1", > "179/2/1", "180/1", "181/1", "182/1", "183/1", "184/1", "185/1", > "186/1", "187/1", "188/1", "189/1", "190/1", "191/1", "192/1", > "R310", "R610L"), class = "factor"), iep = c(7.51, 7.79, 5.14, > 6.35, 5.82, 7.13, 5.95, 7.27, 6.29, 7.5, 7.3, 7.27, 6.46, 6.95, > 6.32, 6.32, 6.34), skupina = c(7.34, 7.34, 5.14, 6.23, 6.23, > 7.34, 6.23, 7.34, 6.23, 7.34, 7.34, 7.34, 6.23, 7.34, 6.23, 6.23, > 6.23), sio2 = c(0.023, 0.011, 0.88, 0.028, 0.031, 0.029, 0.863, > 0.898, 0.95, 0.913, 0.933, 0.888, 0.922, 0.882, 0.923, 1, 1), > p2o5 = c(0.78, 0.784, 1.834, 1.906, 1.915, 0.806, 1.863, > 0.775, 0.817, 0.742, 0.783, 0.759, 0.787, 0.758, 0.783, 3, > 2), al2o3 = c(5.812, 5.819, 3.938, 5.621, 3.928, 3.901, 5.621, > 5.828, 4.038, 5.657, 3.993, 5.735, 4.002, 5.728, 4.042, 6, > 5), dus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, > 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("ano", "ne"), class = > "factor")), .Names = c("vzorek", > "iep", "skupina", "sio2", "p2o5", "al2o3", "dus"), class = "data.frame", > row.names = c(NA, > -17L)) > > and I try to do principal component analysis. Here is one without scaling > > fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2) > biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) > > you can see that data make 3 groups according to variables sio2 and dus > which seems to be reasonable as lowest group has different value of dus = > "ano" while highest group has low value of sio2. > > But when I do the same with scale=T > > fit<-prcomp(~iep+sio2+al2o3+p2o5+as.numeric(dus), data=rglp, factors=2, > scale=T) > biplot(fit, choices=2:3,xlabs=rglp$vzorek, cex=.8) > > I get completely different picture which is not possible to interpret in > such an easy way. > > So if anybody can advice me if I shall follow recommendation from help > page (which says > The default is FALSE for consistency with S, but in general scaling is > advisable. > or if I shall stay with scale = FALSE and with simply interpretable > result? > > Thank you. > > Petr Pikal > petr.pikal at precheza.cz > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >