On Mon, 5 Jun 2006, Mark Hempelmann wrote:
> Dear WizaRds,
>
> I am struggling with the use of twophase in package survey. My goal
> is to compute a simple example in two phase sampling:
>
> phase 1: I sample n1=1000 circuit boards and find 80 non functional
> phase 2: Given the n1=1000 sample I sample n2=100 and find 15 non
> functional. Let's say, phase 2 shows this result together with phase 1:
> ...................phase1........
> ...................ok defunct....
> phase2 ok..........85....0.....85
> .......defunct......5...10.....15
> sum................90...10....100
>
> That is in R:
> fail <- data.frame(id=1:1000 , x=c(rep(0,920), rep(1,80)),
> y=c(rep(0,985), rep(1,15)), n1=rep(1000,1000), n2=rep(100,1000),
> N=rep(5000,1000))
>
> des.fail <- twophase(id=list(~id,~id), data=fail, subset=~I(x==1))
> # fpc=list(~n1,~n2)
The second-phase sample is described by subset=~I(x==1), so you have
sampled only 80 in phase two, not 100.
> svymean(~y, des.fail)
>
> gives mean y 0.1875, SE 0.0196, but theoretically,
> we have x.bar1 (phase1)=0.08 and y.bar2 (phase2)=0.15 defect boards.
15/80=0.1875
> Two phase sampling assumes some relation between the easily/ fast
> received x-information and the elaborate/ time-consuming y-information,
> say a ratio r=sum y (phase2)/ sum x (phase2)=15/10=1.5 (out of the above
> table)
Not quite. Two-phase sampling is *useful* only where there is a
relationship. No relationship is *assumed*.
There are two ways you can take advantage of a relationship. The first is
to stratify the phase-two sampling based on phase one information. In
this case you need a strata= argument to twophase().
The second way to use a relationship is to calibrate phase two to phase
one, using the calibrate() function. This is analogous to the regression
estimator you describe.
A good example to look at is in vignette("epi"). This describes a
two-phase sample where about 4000 people are in the first stage (a cancer
clinical trial) and then the second phase is sampled based on relapse and
on disease type ("histology") determined at the local hospital.
Disease type is determined more accurately at a central lab for everyone
who relapses, everyone whose locally-determined disease type is bad, and
20% of the rest.
There is also an example of calibration, post-stratifying the second phase
to the first phase on disease stage, for the same data.
Finally, note that twophase() does not use the unbiased estimator of
variance. It uses a modification that is easier to compute for cluster
samples, as described in vignette("phase1"). There is no difference
if
the first phase is sampled from an infinite population (or with
replacement), which is the case in vignette("epi").
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle