Hello,
I have been toying with the survey package's withReplicates function, which
lets users easily extend the survey package to support any weighted statistic.
There are a number of ML algorithms in various packages that accept weights, and
it is fairly easy to use them with withReplicates. Below is a na?ve example:
library(survey)
library(rpart)
library(gbm)
data(api)
# create survey object
dstrat<-svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
rstrat<-as.svrepdesign(dstrat)
# try rpart
predr <- as.data.frame(withReplicates(rstrat, function(w, data) {
predict(rpart(api00~ell+meals+mobility,data=data,weights=w))
}))
# try gbm
predg <- as.data.frame(withReplicates(rstrat, function(w, data) {
predict(gbm(api00~ell+meals+mobility,data=data,weights=w,
n.trees=100))
}))
# try regular svyglm
preds <- as.data.frame(predict(svyglm(api00~ell+meals+mobility,rstrat)))
head(data.frame(predr,predg,preds))
With rpart, the standard errors are absurdly large, and clearly incorrect. With
gbm, the results seem reasonable.
I see in this extremely old post that you can't use quantile regression with
withReplicates for some survey designs and expect to get reasonable results:
https://stat.ethz.ch/pipermail/r-help/2008-August/171620.html
Quantiles and survey stats are messy business so that issue may be unique to
quantile regressions, but based on that post it would seem that the function,
and survey design need to have certain properties for withReplicates to generate
valid SEs. This is not documented with withReplicates though.
So my question is, what properties does an ML algorithm/survey design need for
withReplicates to generate valid SEs?
Kind Regards,
Carl Ganz