Leonardo Ferreira Fontenelle
2016-Apr-30 02:40 UTC
[R] Unexpected scores from weighted PCA with svyprcomp()
Hello! I'd like to create an assets-based economic indicator using data from a national household survey. The economic indicator is to be the first principal component from a principal components analysis, which (given the source of the data) I believe should take in consideration the sampling weights of the observations. After running the PCA with svyprcomp(), from the survey package, I wanted to list the loading (with regard to the first principal component) and the scale of the variables, so that I can tell people how to "reconstitute" the economic indicator from the variables without any knowledge of PCA. This reconstituted indicator wouldn't be centered, but that's OK because the important thing for the application is the relative position of the observations. The unexpected (at least for me) behavior was that the principal component returned by svyprcomp() was very different from from the reconstituted indicator as well as from the indicator returned by predict(). "Different" here means weak correlation and different distributions. I hope the following code illustrates what I mean: ==== svycor <- function(formula, design) { # https://stat.ethz.ch/pipermail/r-help/2003-July/036645.html stopifnot(require(survey)) covariance.matrix <- svyvar(formula, design) variables <- diag(covariance.matrix) correlation.matrix <- covariance.matrix / sqrt(variables %*% t(variables)) return(correlation.matrix) } library(survey) data(api) dclus2 <- svydesign(ids = ~ dnum + snum, fpc = ~ fpc1 + fpc2, data apiclus2) pc <- svyprcomp( ~ api99 + api00, design = dclus2, scale = TRUE, scores = TRUE) dclus2$variables$pc1 <- pc$x[, "PC1"] dclus2$variables$pc2 <- predict(pc, apiclus2)[, "PC1"] mycoef <- pc$rotation[, "PC1"] / pc$scale dclus2$variables$pc3 <- with(apiclus2, api99 * mycoef["api99"] + api00 * mycoef["api00"]) svycor(~ pc1 + pc2 + pc3, dclus2)[, ] # pc1 pc2 pc3 # pc1 1.0000000 0.2078789 0.2078789 # pc2 0.2078789 1.0000000 1.0000000 # pc3 0.2078789 1.0000000 1.0000000 plot(svysmooth(~ pc1, dclus2), xlim = c(-2.5, 5), ylim = 0:1) lines(svysmooth(~ pc2, dclus2), col = 2) lines(svysmooth(~ pc3, dclus2), col = 3) legend("topright", legend = c('pc$x[, "PC1"]', 'predict(pc, apiclus2)[, "PC1"]', 'Reconstituted indicator'), col = 1:3, lty = 1) sessionInfo() # R version 3.2.4 Revised (2016-03-16 r70336) # Platform: x86_64-pc-linux-gnu (64-bit) # Running under: Arch Linux # # locale: # [1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C # [3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=pt_BR.UTF-8 # [5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=pt_BR.UTF-8 # [7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C # [9] LC_ADDRESS=C LC_TELEPHONE=C # [11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C # # attached base packages: # [1] grid stats graphics utils datasets grDevices # [7] methods base # # other attached packages: # [1] KernSmooth_2.23-15 survey_3.30-3 # # loaded via a namespace (and not attached): # [1] tools_3.2.4 ==== This lack of correlation doesn't happen if the survey design object has uniform sampling weights or if the the data is analyzed with prcomp(). Why does the returned principal component is so different from the predicted and the reconstituted ones? Are predict() and my "reconstitution" missing something? Are the three methods equally valid but with different interpretations? Is there a bug in svyprcomp() ?? Thanks in advance, Leonardo Ferreira Fontenelle http://lattes.cnpq.br/9234772336296638