fixed.n {twostage} | R Documentation |
Optimal second stage sampling fractions (and sample sizes) using mean score method in logistic regression setting, based on first-stage sample sizes and pilot second-stage data as input.
Optimality is with respect to the standard error of a coefficient
of interest, specified in the call to the function.
BACKGROUND
This function gives the optimal second stage sampling fractions
(and sample sizes) for applications where a first-stage sample
(of size n) has already been gathered and the size of sample to
be gathered at the second stage is also fixed. Such a situation
might arise where outcome data (Y) and some covariates (Z) are
available on a database, and it is decided to pursue additional
variables on a subsample of subjects, where the size of the
subsample is determined by time/cost considerations (an example
would be the testing of stored bloods for a new marker which has
been discovered since an initial case-control study was done).
Since the first-stage data is available, the count (or proportion)
of first-stage observations in each (Z,Y) stratum can be computed,
and one of these vectors must be provided in the call to the "fixed.n"
function.
The optimal second-stage sampling fractions can also be found
for the situation where the first-stage data is NOT available
provided we specify the ratio of second stage sample size
to first stage sample size (i.e the overall sampling fraction
at the second stage), and estimates of prevalences of the
(Z,Y) strata in the population. However, this situation is
likely to be rare compared to the first scenario above.
Before running the fixed.n
function you should run the coding
function, to see in which order you must supply the vector of
prevalences. (see help (coding
) for details)
fixed.n (x=x,y=y,z=z,factor=NULL,n2=n2,var="var",n1="option",prev="option",frac="option")
REQUIRED ARGUMENTS
x |
matrix of predictor variables |
y |
response variable (binary 0-1) |
z |
matrix of the first stage variables which must be categorical (can be more than one column) |
n2 |
size of second stage sample |
var |
The name of the predictor variable whose coefficient is to be optimised.
See DETAILS if this is a factor variable and one of the following: |
n1 |
vector of the first stage sample sizes for each (y,z) stratum OR |
prev |
vector of estimated prevalences for each (y,z) stratum, AND |
frac |
the second stage sampling fraction i.e., the ratio of second stage sample
size to first stage sample size
(NOTE: if prev is given, frac will also be required) OPTIONAL ARGUMENTS |
factor |
the names of any factor variables in the predictor matrix |
The response, predictor and first stage variables
have to be numeric. If you have multiple columns of
z, say (z1,z2,..zn), these will be recoded into
a single vector new.z
. These new.z
values are
reported as zlevel
in the output (see value
).
z1 | z2 | z3 | new.z |
0 | 0 | 0 | 1 |
1 | 0 | 0 | 2 |
0 | 1 | 0 | 3 |
1 | 1 | 0 | 4 |
0 | 0 | 1 | 5 |
1 | 0 | 1 | 6 |
0 | 1 | 1 | 7 |
1 | 1 | 1 | 8 |
If some of the value combinations do not exist
in your data, the function will adjust accordingly.
For example if the combination (0,1,1) is absent,
then (1,1,1) will be coded as 7.
If you wish to optimise the coefficient of a factor variable,
you need to specify which level of the variable to optimise.
For example, if "weight" is a factor variable with 3 categories
1,2 and 3 then var="weight2" will optimise the estimation of the
coefficient which measures the difference between weight=2 and
the baseline (weight=1). By default the baseline is always
the category with the smallest value.
A list called design
consisting of the following items:
ylevel |
the different levels of response variable |
zlevel |
the different levels of first stage variables z. |
n1 |
the first stage sample size for each (ylevel ,zlevel ) stratum |
n2 |
the sample size of pilot observations for each (ylevel ,zlevel ) stratum |
prop |
optimal 2nd stage sampling proportion for each (ylevel ,zlevel ) stratum |
samp.2nd |
optimal 2nd stage sample size for each (ylevel ,zlevel ) stratum and a list called se containing: |
se |
the standard errors of estimates achieved by the optimal design. |
Reilly,M and M.S. Pepe. 1995. A mean score method for
missing and auxiliary covariate data in
regression models. Biometrika 82:299-314
Reilly,M. 1996. Optimal sampling strategies for two-stage studies. Amer. J. Epidemiol. 143:92-100
ms.nprev
,budget
,
precision
,cass1
,
cass2
,coding
## Not run: This example of computing second stage sampling fractions subject to a fixed total second-stage sample size uses the CASS data (Reilly, 1996). Once the TWOSTAGE library has been attached, this data can be made available by: ## End(Not run) data(cass1) ## Not run: and a detailed description of the data can be obtained by help (cass1) ## Not run: In this example, we suppose that the CASS registry only has available the mortality(Y) and sex(Z) for the 8096 "first-stage" subjects. The pilot data consists of 25 observations from each (Y,Z) stratum, where the sizes of the strata are (see Reilly 1996): Y Z N 0 0 6666 0 1 1228 1 0 144 1 1 58 We wish to use this pilot information to compute the optimal design to minimise the variance of the sex coefficient in a logistic model with Sex and Age as predictors . Assume that we wish to sample a total of 1000 subjects at the second stage. The following commands give the output below: ## End(Not run) data(cass1) y=cass1[,1] #--- the response variable is mortality z=cass1[,3] #--- the auxiliary variable is sex x=cass1[,2:3] #--- the variables in the model are sex and age # run CODING function to see in which order we should enter n1 coding(x=x,y=y,z=z) #supplying the first stage sample sizes n1=c(6666, 1228, 144, 58) # variable to be optimised (in our case sex) fixed.n(x=x,y=y,z=z,n1=n1,var="sex",n2=1000) ## Not run: will give us the following output [1] "please run coding function to see the order in which you" [1] "must supply the first-stage sample sizes or prevalences" [1] " Type ?coding for details!" [1] "For calls requiring n1 or prev as input, use the following order" ylevel z new.z n2 [1,] 0 0 0 25 [2,] 0 1 1 25 [3,] 1 0 0 25 [4,] 1 1 1 25 [1] "Check sample sizes/prevalences" $design ylevel zlevel n1 n2 prop samp.2nd [1,] 0 0 6666 25 0.1128 752 [2,] 0 1 1228 25 0.0375 46 [3,] 1 0 144 25 1.0000 144 [4,] 1 1 58 25 1.0000 58 $se [,1] (Intercept) 0.55496070 age 0.00956422 sex 0.16472156 ## End(Not run)