budget {twostage} | R Documentation |
Optimal design for two-stage-study with budget constraint
using the Mean Score method
BACKGROUND
This function calculates the total number of study observations and the second-stage sampling fractions that will maximise precision subject to an available budget. The user must also supply the unit cost of observations at the first and second stage, and the vector of prevalences in each of the strata defined by different levels of dependent variable and first stage covariates .
Before running the budget
function you should run the coding
function, to see in which order you must supply the vector of
prevalences. (see help (coding
) for details)
budget (x=x,y=y,z=z,factor=NULL,prev=prev,var="var",b=b,c1=c1,c2=c2)
REQUIRED ARGUMENTS
x |
matrix of predictor variables |
y |
response variable (binary 0-1) |
z |
matrix of the first stage variables which must be categorical (can be more than one column) |
prev |
vector of estimated prevalences for each (y,z) stratum |
var |
The name of the predictor variable whose coefficient is to be optimised. See DETAILS if this is a factor variable |
b |
the total budget available |
c1 |
the cost per first stage observation |
c2 |
the cost per second stage observation OPTIONAL ARGUMENTS |
factor |
the names of any factor variables in the predictor matrix |
The response, predictor and first stage variables
have to be numeric. If you have multiple columns of
z, say (z1,z2,..zn), these will be recoded into
a single vector new.z
z1 | z2 | z3 | new.z |
0 | 0 | 0 | 1 |
1 | 0 | 0 | 2 |
0 | 1 | 0 | 3 |
1 | 1 | 0 | 4 |
0 | 0 | 1 | 5 |
1 | 0 | 1 | 6 |
0 | 1 | 1 | 7 |
1 | 1 | 1 | 8 |
If some of the value combinations do not exist
in your data, the function will adjust accordingly.
For example if the combination (0,1,1) is absent,
then (1,1,1) will be coded as 7.
If you wish to optimise the coefficient of a factor variable,
you need to specify which level of the variable to optimise.
For example, if "weight" is a factor variable with 3 categories
1,2 and 3 then var="weight2" will optimise the estimation of the
coefficient which measures the difference between weight=2 and
the baseline (weight=1). By default the baseline is always
the category with the smallest value.
The following lists will be returned:
n |
the optimal number of observations (first stage sample size) |
se |
the standard error of estimates achieved by the optimal design and a list called design consisting of the following items: |
ylevel |
the different levels of the response variable |
zlevel |
the different levels of first stage covariates z. |
prev |
the prevalence of each (ylevel ,zlevel ) stratum |
n2 |
the sample size of pilot observations for each (ylevel ,zlevel ) stratum |
prop |
optimal 2nd stage sampling proportion for each (ylevel ,zlevel ) stratum |
samp.2nd |
optimal 2nd stage sample size for each (ylevel ,zlevel ) stratum |
Reilly,M and M.S. Pepe. 1995. A mean score method for
missing and auxiliary covariate data in
regression models. Biometrika 82:299-314
Reilly,M. 1996. Optimal sampling strategies for two-stage studies. Amer. J. Epidemiol. 143:92-100
ms.nprev
,fixed.n
,
precision
,cass1
,
cass2
,coding
## Not run: We give an example using the pilot subsample from the CASS data discussed in Reilly(1996). The data are in the cass2 matrix, which can be loaded using ## End(Not run) data(cass2) ## Not run: and a description of the dataset can be seen using help(cass2) ## Not run: In our examples below, we use sex and weight as auxiliary variables. Given an available budget of £10,000, a first-stage cost of £ 1/unit and second-stage cost £ 0.5/unit, the codes below will calculate the sampling strategy that optimises the precision of the coefficient for SEX : see output below.## End(Not run) data(cass2) y=cass2[,1] #response variable z=cass2[,10] #auxiliary variable x=cass2[,c(2,4:9)] #predictor variables # run CODING function to see in which order we should enter prevalences coding(x=x,y=y,z=z) # supplying the prevalence (from Table 5, Reilly 1996) prev=c(0.0197823937,0.1339020772,0.6698813056,0.0544015826, + 0.0503214639,0.0467359050,0.0009891197,0.0040801187,0.0127349159, + 0.0022255193,0.0032146390,0.0017309594) # optimise sex coefficient budget(x=x,y=y,z=z,var="sex",prev=prev,b=10000,c1=1,c2=0.5) ## Not run: OUTPUT [1] "please run coding function to see the order in which you" [1] "must supply the first-stage sample sizes or prevalences" [1] " Type ?coding for details!" [1] "For calls requiring n1 or prev as input, use the following order" ylevel z new.z n2 [1,] 0 1 1 10 [2,] 0 2 2 10 [3,] 0 3 3 10 [4,] 0 4 4 10 [5,] 0 5 5 10 [6,] 0 6 6 10 [7,] 1 1 1 8 [8,] 1 2 2 10 [9,] 1 3 3 10 [10,] 1 4 4 10 [11,] 1 5 5 10 [12,] 1 6 6 10 [1] "Check sample sizes/prevalences" $n [1] 9166 $design ylevel zlevel prev n2 prop samp.2nd [1,] 0 1 0.0197823937 10 0.5230 95 [2,] 0 2 0.1339020772 10 0.2841 349 [3,] 0 3 0.6698813056 10 0.0726 446 [4,] 0 4 0.0544015826 10 0.4488 224 [5,] 0 5 0.0503214639 10 0.2480 114 [6,] 0 6 0.0467359050 10 0.4922 211 [7,] 1 1 0.0009891197 8 1.0000 9 [8,] 1 2 0.0040801187 10 1.0000 37 [9,] 1 3 0.0127349159 10 1.0000 117 [10,] 1 4 0.0022255193 10 1.0000 20 [11,] 1 5 0.0032146390 10 1.0000 29 [12,] 1 6 0.0017309594 10 1.0000 16 $se [,1] (Intercept) 1.193504705 sex 0.217235702 weight 0.006718422 age 0.014588813 angina 0.245831383 chf 0.077039239 lve 0.010071151 surg 0.179887419## End(Not run)