Jari Oksanen
2012-May-23 10:49 UTC
[Rd] prcomp with previously scaled data: predict with 'newdata' wrong
Hello folks, it may be regarded as a user error to scale() your data prior to prcomp() instead of using its 'scale.' argument. However, it is a user thing that may happen and sounds a legitimate thing to do, but in that case predict() with 'newdata' can give wrong results: x <- scale(USArrests) sol <- prcomp(x) all.equal(predict(sol), predict(sol, newdata=x)) ## [1] "Mean relative difference: 0.9033485" Predicting with the same data gives different results than the original PCA of the data. The reason of this behaviour seems to be in these first lines of stats:::prcomp.default(): x <- scale(x, center = center, scale = scale.) cen <- attr(x, "scaled:center") sc <- attr(x, "scaled:scale") If input data 'x' have 'scaled:scale' attribute, it will be retained if scale() is called with argument "scale = FALSE" like is the case with default options in prcomp(). So scale(scale(x, scale = TRUE), scale = FALSE) will have the 'scaled:center' of the outer scale() (i.e, numerical zero), but the 'scaled:scale' of the inner scale(). Function princomp finds the 'scale' directly instead of looking at the attributes of the input data, and works like expected: sol <- princomp(x) all.equal(predict(sol), predict(sol, newdata=x)) ## [1] TRUE I don't have any nifty solution to this -- only checking the 'scale.' attribute and acting accordingly: sc <- if (scale.) attr(x, "scaled:scale") else FALSE Cheers, Jari Oksanen
Jari Oksanen
2012-May-23 11:02 UTC
[Rd] prcomp with previously scaled data: predict with 'newdata' wrong
To fix myself: the stupid solution I suggested won't work as 'scale.' need not be TRUE or FALSE, but it can be a vector of scales. The following looks like being able to handle this, but is not transparent nor elegant: sc <- if (isTRUE(scale.)) attr(x, "scaled:scale") else scale. I trust you find an elegant solution (if you think this is worth fixing). Cheers, Jari Oksanen PS. Sorry for the top posting: cannot help with the email system I have in my work desktop. ________________________________________ From: r-devel-bounces at r-project.org [r-devel-bounces at r-project.org] on behalf of Jari Oksanen [jari.oksanen at oulu.fi] Sent: 23 May 2012 13:51 To: r-devel at stat.math.ethz.ch Subject: [Rd] prcomp with previously scaled data: predict with 'newdata' wrong Hello folks, it may be regarded as a user error to scale() your data prior to prcomp() instead of using its 'scale.' argument. However, it is a user thing that may happen and sounds a legitimate thing to do, but in that case predict() with 'newdata' can give wrong results: x <- scale(USArrests) sol <- prcomp(x) all.equal(predict(sol), predict(sol, newdata=x)) ## [1] "Mean relative difference: 0.9033485" Predicting with the same data gives different results than the original PCA of the data. The reason of this behaviour seems to be in these first lines of stats:::prcomp.default(): x <- scale(x, center = center, scale = scale.) cen <- attr(x, "scaled:center") sc <- attr(x, "scaled:scale") If input data 'x' have 'scaled:scale' attribute, it will be retained if scale() is called with argument "scale = FALSE" like is the case with default options in prcomp(). So scale(scale(x, scale = TRUE), scale = FALSE) will have the 'scaled:center' of the outer scale() (i.e, numerical zero), but the 'scaled:scale' of the inner scale(). Function princomp finds the 'scale' directly instead of looking at the attributes of the input data, and works like expected: sol <- princomp(x) all.equal(predict(sol), predict(sol, newdata=x)) ## [1] TRUE I don't have any nifty solution to this -- only checking the 'scale.' attribute and acting accordingly: sc <- if (scale.) attr(x, "scaled:scale") else FALSE Cheers, Jari Oksanen ______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel