GenMatch {Matching}R Documentation

Genetic Matching

Description

This function finds optimal balance using multivariate matching where a genetic search algorithm determines the weight each covariate is given. This function finds the optimal weight each variable should be given by Match so as to achieve balance. Balance is determined by a variety of univariate test, mainly paired t-tests for dichotomous variables and univariate Kolmogorov-Smirnov (KS) test for multinomial and continuous variables. The loss criterion defining optimal balance is determined by the loss option. The object returned by GenMatch can be supplied as the Weight.matrix option of the Match function to obtain estimates. GenMatch, via the cluster option, supports the use of multiple computers, CPUs or cores to perform parallel computations.

Usage

GenMatch(Tr, X, BalanceMatrix=X, estimand="ATT", M=1,
         weights=rep(1,length(Tr)),
         pop.size = 50, max.generations=100,
         wait.generations=4, hard.generation.limit=FALSE,
         starting.values=rep(1,ncol(X)),
         data.type.integer=TRUE,
         MemoryMatrix=TRUE,
         exact=NULL, caliper=NULL, 
         nboots=0, ks=TRUE, verbose=FALSE,
         tolerance = 1e-05,
         distance.tolerance=tolerance,
         min.weight=0, max.weight=1000,
         Domains=NULL, print.level=2,
         project.path=NULL,
         paired=TRUE, loss=1,
         restrict=NULL,
         cluster=FALSE, balance=TRUE, ...)

Arguments

Tr A vector indicating the observations which are in the treatment regime and those which are not. This can either be a logical vector or a real vector where 0 denotes control and 1 denotes treatment.
X A matrix containing the variables we wish to match on. This matrix may contain the actual observed covariates or the propensity score or a combination of both.
BalanceMatrix A matrix containing the variables we wish achieve balance on. This is by default equal to X, but it can in principle be a matrix which contains more or less variables than X or variables which are transformed in various ways. See the examples.
estimand A character string for the estimand. The default estimand is "ATT", the sample average treatment effect for the treated. "ATE" is the sample average treatment effect (for all), and "ATC" is the sample average treatment effect for the controls.
M A scalar for the number of matches which should be found (with replacement). The default is one-to-one matching.
weights A vector the same length as Y which provides observations specific weights.
pop.size Population Size. This is the number of individuals genoud uses to solve the optimization problem. See genoud for more details.
max.generations Maximum Generations. This is the maximum number of generations that genoud will run when attempting to optimize a function. This is a soft limit. The maximum generation limit will be binding for genoud only if hard.generation.limit has been set equal to TRUE. If it has not been set equal to TRUE, wait.generations controls when genoud stops. See genoud for more details.
wait.generations If there is no improvement in the objective function in this number of generations, genoud will think that it has found the optimum. The other variables controlling termination are max.generations and hard.generation.limit.
hard.generation.limit This logical variable determines if the max.generations variable is a binding constraint for genoud. If hard.generation.limit is FALSE, then genoud may exceed the max.generations count if the objective function has improved within a given number of generations (determined by wait.generations).
starting.values This vector equal to the number of variables in X. This vector contains the starting weights each of the variables is given. The starting.values vector is a way for the user to insert one individual into the starting population. genoud will randomly create the other individuals. These values correspond to the diagonal of the Weight.matrix as described in detail in the Match function.
data.type.integer By default only integer weights are considered. If this option is set to false, search will be done over floating point weights. This is usually an unnecessary degree of precision.
MemoryMatrix This variable controls if genoud sets up a memory matrix. Such a matrix ensures that genoud will request the fitness evaluation of a given set of parameters only once. The variable may be TRUE or FALSE. If it is FALSE, genoud will be aggressive in conserving memory. The most significant negative implication of this variable being set to FALSE is that genoud will no longer maintain a memory matrix of all evaluated individuals. Therefore, genoud may request evaluations which it has already previously requested. When the number variables in X is large, the memory matrix consumes a large amount of RAM.

genoud's memory matrix will require significantly less memory if the user sets hard.generation.limit equal to TRUE. Doing this is a good way of conserving memory while still making use of the memory matrix structure.
exact A logical scalar or vector for whether exact matching should be done. If a logical scalar is provided, that logical value is applied to all covariates of X. If a logical vector is provided, a logical value should be provided for each covariate in X. Using a logical vector allows the user to specify exact matching for some but not other variables. When exact matches are not found, observations are dropped. distance.tolerance determines what is considered to be an exact match. The exact option takes precedence over the caliper option. Obviously, if exact matching is done using all of the covariates, one should not be using GenMatch unless the distance.tolerance has been set unusually high.
caliper A scalar or vector denoting the caliper(s) which should be used when matching. A caliper is the distance which is acceptable for any match. Observations which are outside of the caliper are dropped. If a scalar caliper is provided, this caliper is used for all covariates in X. If a vector of calipers is provided, a caliper value should be provide for each covariate in X. The caliper is interpreted to be in standardized units. For example, caliper=.25 means that all matches not equal to or within .25 standard deviations of each covariate in X are dropped. The ecaliper object which is returned by GenMatch shows the enforced caliper on the scale of the X variables.
nboots The number of bootstrap samples to be run for the ks test.
ks A logical flag for if the univariate bootstrap Kolmogorov-Smirnov (KS) test should be calculated. If the ks option is set to true, the univariate KS test is calculated for all non-dichotomous variables. The bootstrap KS test is consistent even for non-continuous variables. If a given variable is dichotomous, a t-test is used even if the KS test is requested. See ks.boot for more details.
verbose If details should be printed of each fit evaluation done by the genetic algorithm. Verbose is set to FALSE if the cluster option is used.
tolerance This is a scalar which is used to determine numerical tolerances. This option is used by numerical routines such as those used to determine if a matrix is singular.
distance.tolerance This is a scalar which is used to determine if distances between two observations are different from zero. Values less than distance.tolerance are deemed to be equal to zero. This option can be used to perform a type of optimal matching
min.weight This is the minimum weight any variable may be given.
max.weight This is the maximum weight any variable may be given.
Domains This is a ncol(X) *2 matrix. The first column is the lower bound, and the second column is the upper bound for each variable over which genoud will search for weights. If the user does not provide this matrix, the bounds for each variable will be determined by the min.weight and max.weight options.
print.level This option controls the level of printing. There are four possible levels: 0 (minimal printing), 1 (normal), 2 (detailed), and 3 (debug). If level 2 is selected, GenMatch will print details about the population at each generation, including the best individual found so far. If debug level printing is requested, details of the genoud population are printed in the "genoud.pro" file which is located in the temporary R directory returned by the tempdir function. See the project.path option for more details. Because GenMatch runs may take a long time, it is important for the user to receive feedback. Hence, print level 2 has been set as the default.
project.path This is the path of the genoud project file. By default no file is produced unless print.level=3. In that case, genoud places it's output in a file called "genoud.pro" located in the temporary directory provided by tempdir. If a file path is provided to the project.path option, a file will be created regardless of the print.level. The behavior of the project file, however, will depend on the print.level chosen. If the print.level variable is set to 1, then the project file is rewritten after each generation. Therefore, only the currently fully completed generation is included in the file. If the print.level variable is set to 2 or higher, then each new generation is simply appended to the project file. No project file is generated for print.level=0.
paired A flag for if the paired t.test should be used when determining balance.
loss The loss function to be optimized. The default value, 1, implies "lexical" optimization: all of the balance statistics will be sorted from the most discrepant to the least and weights will be picked which minimize the maximum discrepancy. If multiple sets of weights result in the same maximum discrepancy, then the second largest discrepancy is examined to choose the best weights. The processes continues iteratively until ties are broken.

If the value of 2 is used, then only the maximum discrepancy is examined. This was the default behavior prior to version 1.0. The user may also pass in any function she desires. Note that the option 1 corresponds to the sort function and option 2 to the min function. Any user specified function should expect a vector of balance statistics ("p-values") and it should return either a vector of values (in which case "lexical" optimization will be done) or a scalar value (which will be maximized). Some possible alternative functions are mean or median.
restrict A matrix which restricts the possible matches. This matrix has one row for each restriction and three columns. The first two columns contain the two observation numbers which are to be restricted (for example 4 and 20), and the third column is the restriction imposed on the observation-pair. Negative numbers in the third column imply that the two observations cannot be matched under any circumstances, and positive numbers are passed on as the distance between the two observations for the matching algorithm. The most commonly used positive restriction is 0 which implies that the two observations will always be matched.

Exclusion restriction are even more common. For example, if we want to exclude the observation pair 4 and 20 and the pair 6 and 55 from being matched, the restrict matrix would be: restrict=rbind(c(4,20,-1),c(6,55,-1))
cluster This can either be an object of the 'cluster' class returned by one of the makeCluster commands in the snow package or a vector of machine names so GenMatch can setup the cluster automatically. If it is the later, the vector should look like:
c("localhost","musil","musil","deckard").
This vector would create a cluster with four nodes: one on the localhost another on "deckard" and two on the machine named "musil". Two nodes on a given machine make sense if the machine has two or more chips/cores. GenMatch will setup a SOCK cluster by a call to makeSOCKcluster. This will require the user to type in her password for each node as the cluster is by default created via ssh. One can add on usernames to the machine name if it differs from the current shell: "username@musil". Other cluster types, such as PVM and MPI, which do not require passwords can be created by directly calling makeCluster, and then passing the returned cluster object to GenMatch. For an example of how to manually setup up a cluster with a direct call to makeCluster see http://sekhon.polisci.berkeley.edu/matching/R/cluster_manual.R. For an example of how to get around a firewall by ssh tunneling see: http://sekhon.polisci.berkeley.edu/matching/R/cluster_manual_tunnel.R.
balance This logical flag controls if load balancing is done across the cluster. Load balancing can result in better cluster utilization; however, increased communication can reduce performance. This options is best used if each individual call to Match takes at least several minutes to calculate or if the nodes in the cluster vary significantly in their performance. If cluster==FALSE, this option has no effect.
... Other options which are passed on to genoud.

Value

value The lowest p-value of the matched dataset.
par A vector of the weights given to each variable in X.
Weight.matrix A matrix whose diagonal corresponds to the weight given to each variable in X. This object corresponds to the Weight.matrix in the Match function.
matches A matrix where the first column contains the row numbers of the treated observations in the matched dataset. The second column contains the row numbers of the control observations. And the third column contains the weight that each matched pair is given. These columns respectively correspond to the index.treated, index.control and weights objects which are returned by Match.
ecaliper The size of the enforced caliper on the scale of the X variables. This object has the same length as the number of covariates in X.

Author(s)

Jasjeet S. Sekhon, UC Berkeley, sekhon@berkeley.edu, http://sekhon.polisci.berkeley.edu/.

References

Diamond, Alexis and Jasjeet S. Sekhon. 2005. ``Genetic Matching for Estimating Causal Effects: A General Multivariate Matching Method for Achieving Balance in Observational Studies.'' Working Paper. http://sekhon.polisci.berkeley.edu/papers/GenMatch.pdf

Sekhon, Jasjeet Singh and Walter R. Mebane, Jr. 1998. ``Genetic Optimization Using Derivatives: Theory and Application to Nonlinear Models.'' Political Analysis, 7: 187-210. http://sekhon.polisci.berkeley.edu/genoud/genoud.pdf

See Also

Also see Match, summary.Match, MatchBalance, genoud, balanceMV, balanceUV, ks.boot, GerberGreenImai, lalonde

Examples

set.seed(38913)

data(lalonde)
attach(lalonde)

#The covariates we want to match on
X = cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74);

#The covariates we want to obtain balance on
BalanceMat <- cbind(age, educ, black, hisp, married, nodegr, u74, u75, re75, re74,
                    I(re74*re75));

#Let's call GenMatch() to find the optimal weight to give each
#covariate in 'X' so as we have achieved balance on the covariates in
#'BalanceMat'. This is only an example so we want GenMatch to be quick
#to the population size has been set to be only 15 via the 'pop.size'
#option.  
genout <- GenMatch(Tr=treat, X=X, BalanceMatrix=BalanceMat, estimand="ATE", M=1,
                   pop.size=16, max.generations=10, wait.generations=1)

#The outcome variable
Y=re78/1000;

# Now that GenMatch() has found the optimal weights, let's estimate
# our causal effect of interest using those weights
mout <- Match(Y=Y, Tr=treat, X=X, estimand="ATE", Weight.matrix=genout)
summary(mout)

#                        
#Let's determine if balance has actually been obtained on the variables of interest
#                        
mb <- MatchBalance(treat~age +educ+black+ hisp+ married+ nodegr+ u74+ u75+
                   re75+ re74+ I(re74*re75),
                   match.out=mout, nboots=500, ks=TRUE, mv=FALSE)

# For more examples see: http://sekhon.polisci.berkeley.edu/matching/R.

[Package Matching version 1.8-6 Index]