mixed.mtc {StatMatch} | R Documentation |
This function implements some mixed methods to perform statistical matching between two data sources such that no units are in common and one or more continuous variables are shared.
mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=0, micro=FALSE, constr.alg="lpSolve")
data.rec |
A matrix or data frame that plays the role of recipient in the statistical matching application. This data set must contain all variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments match.vars and y.rec . Note that, all variables must be continuous. Missing values (NA ) are not allowed.
|
data.don |
A matrix or data frame that plays the role of donor in the statistical matching application. This data set must contain all the numeric variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments match.vars and z.don . All variables must be continuous. Missing values (NA ) are not allowed.
|
match.vars |
A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables (X). |
y.rec |
A character vector with the name of the target variable Y that is observed only for units in data.rec . Only one continuous variable is allowed.
|
z.don |
A character vector with the name of the target variable Z that is observed only for units in data.don . Only one continuous variable is allowed.
|
method |
A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. X and Z vs. X. Maximum Likelihood method is used when method="ML" (default); on the contrary, when method="MS" the parameters are estimated according to Moriarity and Scheuren (2001 and 2003). See Details for further information.
|
rho.yz |
A numeric value representing the guess for the correlation among the Y (y.rec ) and the Z variable (z.don ) that are not jointly observed. Note that when method="MS" , cor.yz must specify the value of the correlation coefficient rho_YZ; on the contrary, when method="ML" , it must specify the partial correlation coefficient among Y and Z given X (rho_YZ|X).
By default ( rho.yz=0 ), in absence of auxiliary information concerning the correlation coefficient or the partial correlation coefficient, statistical matching is carried out under the assumption of independence among Y and Z given X (Conditional Independence Assumption, CIA ), i.e. rho_YZ|X = 0.
|
micro |
Logical. When micro=FALSE (default) only the parameter estimates are returned. On the contrary, when micro=TRUE data.rec filled in with the values for the variable Z is returned too. The donors for filling in Z in data.rec are identified using a constrained distance hot deck method. In this case, the number of units (rows) in data.don must be grater or equal to the number of units (rows) in data.rec . See next argument and Details for further information.
|
constr.alg |
A string that has to be specified when micro=TRUE , in order to solve the transportation problem involved by the constrained distance hot deck method. Two choices are available: “lpSolve” and “relax”. In the first case,
constr.alg="lpSolve" , the transportation problem is solved by means of the function lp.transport available in the package lpSolve. When
constr.alg="relax" the transportation problem is solved using RELAX–IV algorithm from Bertsekas and Tseng (1994), implemented in function pairmatch available in the package optmatch. Note that constr.alg="relax" is faster and requires less computational effort, but the usage of this algorithm is allowed only for research purposes (for details see function relaxinfo() in the package optmatch).
|
This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:
(i) adoption of a parametric model for the joint distribution of (X,Y,Z) and estimation of its parameters;
(ii) derivation of a complete “synthetic” data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.
In this case, as far as (i) is concerned, it is assumed that (X,Y,Z) follows a multivariate normal distribution. In particular, dealing with continuous variables, a version of the imputation method known as predictive mean matching is used. This method consists of three steps:
step 1) – Regression step: The two linear regression models Y vs. X and Z vs. X are considered and their parameters are estimated.
step 2) – Computation of intermediate values. For the units in data.rec
the following intermediate values are derived:
z_a = alpha_Z + beta_ZX * x_a + e_a
for each a=1,...,n_A, being n_A the number of units in data.rec
(rows of data.rec
). Note that, e_a is a random draw from the multivariate normal distribution with zero mean and estimated residual variance sigma_ZX.
Similarly, for the units in data.don
the following intermediate values are derived:
y_b = alpha_Y + beta_YX * x_b + e_b
for each 1,...,n_B, being n_B the number of units in data.don
(rows of data.don
). e_b is a random draw from the multivariate normal distribution with zero mean and estimated residual variance sigma_YX.
step 3) – Matching step. For each observation (row) in data.rec
a donor is chosen in data.don
through a nearest neighbor constrained distance hot deck procedure. The distances are computed between (y_a, z^_a) and (y^_b, z_b) using Mahalanobis distance.
For further details see Sections 2.5.1 and 3.6.1 in D'Orazio et al. (2006).
Note that in step 1) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (method="ML"
) (see D'Orazio et al., 2006, pp. 19–23,73–75) or, using the Moriarity and Scheuren (2001 and 2003) approach (method="MS"
) (see also D'Orazio et al., 2006, pp. 75–76). The two estimation methods are compared in D'Orazio et al. (2005).
When method="MS"
, if the value specified for the argument rho.yz
is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients.
When micro=FALSE
only the estimation of the parameters is performed (step 1). Otherwise, (micro=TRUE
) the whole procedure is carried out.
A list with a varying number of components depending on the values of the arguments
method
and rho.yz
.
mu |
The estimated mean vector. |
vc |
The estimated variance–covariance matrix. |
cor |
The estimated correlation matrix. |
res.var |
A vector with estimates of the residual variances sigma_Y|ZX and sigma_Z|YX. |
start.prho.yz |
It is the initial guess for the partial correlation coefficient rho_YZ|X passed in input via the rho.yz argument when method="ML" .
|
rho.yz |
Returned in output only when method="MS" . It is a vector with four values: the initial guess for rho_YZ; the lower and upper bounds for rho_YZ in the statistical matching framework given the correlation coefficients among Y and Xs and the correlation coefficients among Z and Xs estimated from the available data; and, finally, the closest admissible value used in computations instead of the initial rho.yz that resulted not coherent with the other correlation coefficients estimated from the available data.
|
phi |
When method="MS" . Estimates of the phi terms introduced by Moriarity and Scheuren (2001 and 2003).
|
filled.rec |
The data.rec filled in with the values of Z. It is returned only when micro=TRUE .
|
mtc.ids |
when micro=TRUE . This is a matrix with the same number of rows of data.rec and two columns. The first column contains the row names of the data.rec and the second column contains the row names of the corresponding donors selected from the data.don . When the input matrices do not contain row names, a numeric matrix with the indexes of the rows is provided.
|
dist.rd |
A vector with the distances among each recipient unit and the corresponding donor, returned only in case micro=TRUE .
|
call |
How the function has been called. |
Marcello D'Orazio madorazi@istat.it
Bertsekas, D.P. and Tseng, P. (1994). “RELAX–IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems”. Technical Report, LIDS-P-2276, Massachusetts Institute of Technology, Cambridge. http://web.mit.edu/dimitrib/www/RELAX4_doc.pdf
D'Orazio, M., Di Zio, M. and Scanu, M. (2005). “A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study”, Contributi, 2005/10, Istituto Nazionale di Statistica, Rome. http://www.istat.it/dati/pubbsci/contributi/Contributi/contr_2005/2005_10.pdf
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Moriarity, C., and Scheuren, F. (2001). “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422. http://www.jos.nu/Articles/abstract.asp?article=173407
Moriarity, C., and Scheuren, F. (2003). “A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation”, Journal of Business and Economic Statistics, 21, 65–73.
# Example with fictitious data # Set the correlation matrix mat.cor <- matrix(0, 4, 4) mat.cor[lower.tri(mat.cor)] <- c(0.3, 0.5, 0.7, 0.8, 0.4, 0.8) mat.cor <- mat.cor+t(mat.cor) diag(mat.cor) <- 1 dimnames(mat.cor) <- list(c("x1","x2","y","z"), c("x1","x2","y","z")) # generate data from multivariate normal distribution library(mvtnorm) data.all <- rmvnorm(n=100, mean=rep(0,4), sigma=mat.cor) dimnames(data.all) <- list(1:100, c("x1","x2","y","z")) # reproduce statistical matching framework data.A <- data.all[1:50, 1:3] #z deleted data.B <- data.all[51:100, c(1:2,4)] #y deleted # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) mtc.1 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z") # estimated vs. true correlation matrix mtc.1$cor - mat.cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # only parameter estimates (micro=FALSE) mtc.2 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z", rho.yz=0.5) # estimated vs. true correlation matrix mtc.2$cor - mat.cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # with imputation step (micro=TRUE) mtc.3 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z", rho.yz=0.5, micro=TRUE, constr.alg="lpSolve") # estimated vs. true correlation matrix mtc.3$cor - mat.cor # first rows of data.rec filled in with z head(mtc.3$filled.rec) # Moriarity and Scheuren estimation method under CIA; # only with parameter estimates (micro=FALSE) mtc.4 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z", method="MS") # estimated vs. true correlation matrix mtc.4$cor - mat.cor # Moriarity and Scheuren estimation method # with correlation coefficient set equal to 0.2 (rho_YZ=0.2) # only parameter estimates (micro=FALSE) mtc.5 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z", method="MS", rho.yz=0.2) # the starting value of rho.yz and the value used # in computations mtc.5$rho.yz # estimated vs. true correlation matrix mtc.5$cor - mat.cor # Moriarity and Scheuren estimation method # with correlation coefficient set equal to 0.6 (rho_YZ=0.6) # with imputation step (micro=TRUE) mtc.6 <- mixed.mtc(data.rec=data.A, data.don=data.B, match.vars=c("x1","x2"), y.rec="y", z.don="z", rho.yz=0.6, method="MS", micro=TRUE, constr.alg="lpSolve") # estimated vs. true correlation matrix mtc.6$cor - mat.cor # first rows of data.rec filled in with z imputed values head(mtc.6$filled.rec)