cem {cem}R Documentation

Coarsened Exact Matching

Description

Implementation of Coarsened Exact Matching

Usage

cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL,  
    grouping = NULL, drop=NULL, eval.imbalance = TRUE, k2k=FALSE,  
        method=NULL, mpower=2, L1.breaks = NULL, verbose = 0)

Arguments

treatment character, name of the treatment variable
data a data.frame
datalist a list of optional multiply imputed data.frame's
cutpoints named list each describing the cutpoints for numerical variables (the names are variable names). Each list element is either a vector of cutpoints, a number of cutpoints, or a method for automatic bin contruction. See Details.
grouping named list, each element of which is a list of groupings for a single categorical variable. See Details.
drop a vector of variable names in the data frame to ignore during matching
eval.imbalance Boolean. See Details.
k2k boolean, restrict to k-to-k matching? Default = FALSE
method distance method to use in k2k matching. See Details.
mpower power of the Minkowski distance. See Details.
L1.breaks list of cutpoints for the calculation of the L1 measure.
verbose controls level of verbosity. Default=0.

Details

When specifying cutpoints, several automatic methods may be chosen, including ``sturges'' (Sturges' rule, the default), ``fd'' (Freedman-Diaconis' rule), ``scott'' (Scott's rule) and ``ss'' (Shimazaki-Shinomoto's rule). See references for a description of each rule.

The grouping option is a list where each element is itself a list. For example, suppose for variable quest1 you have the following possible levels "no answer", NA, "negative", "neutral", "positive" and you want to collect ("no answer", NA, "neutral") into a single group, then the grouping argument should contain list(quest1=list(c("no answer", NA, "neutral"))). Or if you have a discrete variable elements with values 1:10 and you want to collect it into groups ``1:3,NA'', ``4'', ``5:9'', ``10'' you specify in grouping the following list list(elements=list(c(1:3,NA), 5:9)). Values not defined in the grouping are left as they are. If cutpoints and groupings are defined for the same variable, the groupings take precedence and the corresponding cutpoints are set to NULL.

verbose: a number greater or equal to 0. The higher, the more info are provided during the execution of the algorithm.

If eval.imbalance = TRUE (the default), cem$imbalance contains the imbalance measure by absolute difference in means for numerical variables and chi-square distance for categorical variables. If FALSE then cem$imbalance is set to NULL. If data contains missing data, the imbalance measures are not calculated.

If L1.breaks is missing, the default rule to calculate cutpoints is the Scott's rule.

If k2k is set to TRUE, the algorithm return strata with the same number of treated and control units per stratum, otherwise all the matched units are returned (default). When k2k = TRUE, the user can choose a method (between `euclidean', `maximum', `manhattan', `canberra', `binary' and `minkowski') for nearest neighbor matching inside each cem strata. By default method is set to `NULL', which means random matching inside cem strata. For the Minkowski distance the power can be specified via the argument mpower'. For more information on method != NULL, refer to dist help page.

By default, cem treats missing values as distinct categories and matches observations with missing values in the same variable in the same stratum provided that all the remaining (corasened) covariates match.

If argument data is non-NULL and datalist is NULL, CEM is applied to the single data set in data.

Argument datalist is a list of (multiply imputed) data frames (i.e., with missing cell values imputed). If data is NULL, the function cem is applied independently to each element of the list, resulting in separately matched data sets with different numbers of treated and control units.

When data and datalist are both non-NULL, each multiply imputed observation is assigned to the stratum in which it has been matched most frequently. In this case, the algorithm outputs the same matching solution for each multiply imputed data set (i.e., an observation, and the number of treated and control units matched, in one data set has the same meaning in all, and is the same for all)

Value

Returns an object of class cem.match if only data is not NULL or an object of class cem.match.list, which is a list of objects of class cem.match plus a field called unique which is true only if data and datalist are not both NULL. A cem.match object is a list with the following slots:

call the call
strata vector of stratum number in which each observation belongs, NA if the observation has not been matched
n.strata number of strata generated
vars report variables names used for the match
drop variables removed from the match
breaks named list of cutpoints, eventually NULL
treatment name of the treatment variable
groups factor, each observation belong to one group generated by the treatment variable
n.groups number of groups identified by the treatment variable
group.idx named list, index of observations belonging to each group
group.len sizes of groups
tab summary table of matched by group
imbalance NULL or a vector of imbalances. See Details.

Author(s)

Stefano Iacus, Gary King, and Giuseppe Porro

References

Stefano Iacus, Gary King, Giuseppe Porro, ``Matching for Casual Inference Without Balance Checking,'' http://gking.harvard.edu/files/abs/cem-abs.shtml

Examples

data(LL)

   
todrop <- c("treated","re78")
   
imbalance(LL$treated, LL, drop=todrop)

# cem match: automatic bin choice
mat <- cem(treatment="treated", data=LL, drop="re78")
mat

# cem match: user choiced coarsening
re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks
re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks
agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks
mycp <- list(re75=re75cut, re74=re74cut, age=agecut)
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp)
mat

# cem match: user choiced coarsening, k-to-k matching
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE)
mat

# mahalnobis matching: we use MatchIt
if(require(MatchIt)){
mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75,
   distance="mahalanobis", data=LL)
mah
#imbalance
imbalance(LL$treated, LL, drop=todrop, weights=mah$weights)
}

# Multiply Imputed data
# making use of Amelia for multiple imputation
if(require(Amelia)){
 data(LL)
 n <- dim(LL)[1]
 k <- dim(LL)[2]

 set.seed(123)

 LL1 <- LL
 idx <- sample(1:n, .3*n)
 invisible(sapply(idx, function(x) LL1[x,sample(2:k,1)] <<- NA))

 imputed <- amelia(LL1,noms=c("black","hispanic","treated","married",
                              "nodegree","u74","u75"))[1:5] 

# without information on which observation has missing values
 mat1 <- cem("treated", datalist=imputed, drop="re78")
 mat1

# ATT estimation
 out <- att(mat1, re78 ~ treated, data=imputed)

# with information about missingness
 mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1)
 mat2

# ATT estimation
 out <- att(mat2, re78 ~ treated, data=imputed)
}

[Package cem version 1.0.90 Index]