--- title: "Estimating Effects After Matching" author: "Noah Greifer" date: "`r Sys.Date()`" output: html_vignette: toc: true vignette: > %\VignetteIndexEntry{Estimating Effects After Matching} %\VignetteEngine{knitr::rmarkdown_notangle} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: true --- ```{=html} ``` ```{r, include = FALSE} knitr::opts_chunk$set(echo = TRUE, eval=T) options(width = 200, digits= 4) me_ok <- requireNamespace("marginaleffects", quietly = TRUE) && requireNamespace("sandwich", quietly = TRUE) su_ok <- requireNamespace("survival", quietly = TRUE) boot_ok <- requireNamespace("boot", quietly = TRUE) ``` ```{r, include = FALSE} #Generating data similar to Austin (2009) for demonstrating treatment effect estimation gen_X <- function(n) { X <- matrix(rnorm(9 * n), nrow = n, ncol = 9) X[,5] <- as.numeric(X[,5] < .5) X } #~20% treated gen_A <- function(X) { LP_A <- - 1.2 + log(2)*X[,1] - log(1.5)*X[,2] + log(2)*X[,4] - log(2.4)*X[,5] + log(2)*X[,7] - log(1.5)*X[,8] P_A <- plogis(LP_A) rbinom(nrow(X), 1, P_A) } # Continuous outcome gen_Y_C <- function(A, X) { 2*A + 2*X[,1] + 2*X[,2] + 2*X[,3] + 1*X[,4] + 2*X[,5] + 1*X[,6] + rnorm(length(A), 0, 5) } #Conditional: # MD: 2 #Marginal: # MD: 2 # Binary outcome gen_Y_B <- function(A, X) { LP_B <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6] P_B <- plogis(LP_B) rbinom(length(A), 1, P_B) } #Conditional: # OR: 2.4 # logOR: .875 #Marginal: # RD: .144 # RR: 1.54 # logRR: .433 # OR: 1.92 # logOR .655 # Survival outcome gen_Y_S <- function(A, X) { LP_S <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6] sqrt(-log(runif(length(A)))*2e4*exp(-LP_S)) } #Conditional: # HR: 2.4 # logHR: .875 #Marginal: # HR: 1.57 # logHR: .452 set.seed(19599) n <- 2000 X <- gen_X(n) A <- gen_A(X) Y_C <- gen_Y_C(A, X) Y_B <- gen_Y_B(A, X) Y_S <- gen_Y_S(A, X) d <- data.frame(A, X, Y_C, Y_B, Y_S) ``` ## Introduction After assessing balance and deciding on a matching specification, it comes time to estimate the effect of the treatment in the matched sample. How the effect is estimated and interpreted depends on the desired estimand and the type of model used (if any). In addition to estimating effects, estimating the uncertainty of the effects is critical in communicating them and assessing whether the observed effect is compatible with there being no effect in the population. This guide explains how to estimate effects after various forms of matching and with various outcome types. There may be situations that are not covered here for which additional methodological research may be required, but some of the recommended methods here can be used to guide such applications. This guide is structured as follows: first, information on the concepts related to effect and standard error (SE) estimation is presented below. Then, instructions for how to estimate effects and SEs are described for the standard case (matching for the ATT with a continuous outcome) and some other common circumstances. Finally, recommendations for reporting results and tips to avoid making common mistakes are presented. ### Identifying the estimand Before an effect is estimated, the estimand must be specified and clarified. Although some aspects of the estimand depend not only on how the effect is estimated after matching but also on the matching method itself, other aspects must be considered at the time of effect estimation and interpretation. Here, we consider three aspects of the estimand: the population the effect is meant to generalize to (the target population), the effect measure, and whether the effect is marginal or conditional. **The target population.** Different matching methods allow you to estimate effects that can generalize to different target populations. The most common estimand in matching is the average treatment effect in the treated (ATT), which is the average effect of treatment for those who receive treatment. This estimand is estimable for matching methods that do not change the treated units (i.e., by weighting or discarding units) and is requested in `matchit()` by setting `estimand = "ATT"` (which is the default). The average treatment effect in the population (ATE) is the average effect of treatment for the population from which the sample is a random sample. This estimand is estimable only for methods that allow the ATE and either do not discard units from the sample or explicit target full sample balance, which in `MatchIt` is limited to full matching, subclassification, and profile matching when setting `estimand = "ATE"`. When treated units are discarded (e.g., through the use of common support restrictions, calipers, cardinality matching, or [coarsened] exact matching), the estimand corresponds to neither the population ATT nor the population ATE, but rather to an average treatment effect in the remaining matched sample (ATM), which may not correspond to any specific target population. See @greiferChoosingEstimandWhen2021 for a discussion on the substantive considerations involved when choosing the target population of the estimand. **Marginal and conditional effects.** A marginal effect is a comparison between the expected potential outcome under treatment and the expected potential outcome under control. This is the same quantity estimated in randomized trials without blocking or covariate adjustment and is particularly useful for quantifying the overall effect of a policy or population-wide intervention. A conditional effect is the comparison between the expected potential outcomes in the treatment groups within strata. This is useful for identifying the effect of a treatment for an individual patient or a subset of the population. **Effect measures.** The outcome types we consider here are continuous, with the effect measured by the mean difference; binary, with the effect measured by the risk difference (RD), risk ratio (RR), or odds ratio (OR); and time-to-event (i.e., survival), with the effect measured by the hazard ratio (HR). The RR, OR, and HR are *noncollapsible* effect measures, which means the marginal effect on that scale is not a (possibly) weighted average of the conditional effects within strata, even if the stratum-specific effects are of the same magnitude. For these effect measures, it is critical to distinguish between marginal and conditional effects because different statistical methods target different types of effects. The mean difference and RD are *collapsible* effect measures, so the same methods can be used to estimate marginal and conditional effects. Our primary focus will be on marginal effects, which are appropriate for all effect measures, easily interpretable, and require few modeling assumptions. The "Common Mistakes" section includes examples of commonly used methods that estimate conditional rather than marginal effects and should not be used when marginal effects are desired. ### G-computation To estimate marginal effects, we use a method known as g-computation [@snowdenImplementationGComputationSimulated2011] or regression estimation [@schaferAverageCausalEffects2008]. This involves first specifying a model for the outcome as a function of the treatment and covariates. Then, for each unit, we compute their predicted values of the outcome setting their treatment status to treated, and then again for control, leaving us with two predicted outcome values for each unit, which are estimates of the potential outcomes under each treatment level. We compute the mean of each of the estimated potential outcomes across the entire sample, which leaves us with two average estimated potential outcomes. Finally, the contrast of these average estimated potential outcomes (e.g., their difference or ratio, depending on the effect measure desired) is the estimate of the treatment effect. When doing g-computation after matching, a few additional considerations are required. First, when we take the average of the estimated potential outcomes under each treatment level, this must be a weighted average that incorporates the matching weights. Second, if we want to target the ATT or ATC, we only estimate potential outcomes for the treated or control group, respectively (though we still generate predicted values under both treatment and control). G-computation as a framework for estimating effects after matching has a number of advantages over other approaches. It works the same regardless of the form of the outcome model or type of outcome (e.g., whether a linear model is used for a continuous outcome or a logistic model is used for a binary outcome); the only difference might be how the average expected potential outcomes are contrasted in the final step. In simple cases, the estimated effect is numerically identical to effects estimated using other methods; for example, if no covariates are included in the outcome model, the g-computation estimate is equal to the difference in means from a t-test or coefficient of the treatment in a linear model for the outcome. There are analytic approximations to the SEs of the g-computation estimate, and these SEs can incorporate pair/subclass membership (described in more detail below). For all these reasons, we use g-computation when possible for all effect estimates, even if there are simpler methods that would yield the same estimates. Using a single workflow (with some slight modifications depending on the context; see below) facilitates implementing best practices regardless of what choices a user makes. ### Modeling the Outcome The goal of the outcome model is to generate good predictions for use in the g-computation procedure described above. The type and form of the outcome model should depend on the outcome type. For continuous outcomes, one can use a linear model regressing the outcome on the treatment; for binary outcomes, one can use a generalized linear model with, e.g., a logistic link; for time-to-event outcomes, one can use a Cox proportional hazards model. An additional decision to make is whether (and how) to include covariates in the outcome model. One may ask, why use matching at all if you are going to model the outcome with covariates anyway? Matching reduces the dependence of the effect estimate on correct specification of the outcome model; this is the central thesis of @ho2007. Including covariates in the outcome model after matching has several functions: it can increase precision in the effect estimate, reduce the bias due to residual imbalance, and make the effect estimate "doubly robust", which means it is consistent if either the matching reduces sufficient imbalance in the covariates or if the outcome model is correct. For these reasons, we recommend covariate adjustment after matching when possible. There is some evidence that covariate adjustment is most helpful for covariates with standardized mean differences greater than .1 [@nguyen2017], so these covariates and covariates thought to be highly predictive of the outcome should be prioritized in treatment effect models if not all can be included due to sample size constraints. Although there are many possible ways to include covariates (e.g., not just main effects but interactions, smoothing terms like splines, or other nonlinear transformations), it is important not to engage in specification search (i.e., trying many outcomes models in search of the "best" one). Doing so can invalidate results and yield a conclusion that fails to replicate. For this reason, we recommend only including the same terms included in the propensity score model unless there is a strong *a priori* and justifiable reason to model the outcome differently. It is important not to interpret the coefficients and tests of covariates in the outcome model. These are not causal effects and their estimates may be severely confounded. Only the treatment effect estimate can be interpreted as causal assuming the relevant assumptions about unconfoundedness are met. Inappropriately interpreting the coefficients of covariates in the outcome model is known as the Table 2 fallacy [@westreich2013]. To avoid this, we only display the results of the g-computation procedure and do not examine or interpret the outcome models themselves. ### Estimating Standard Errors and Confidence Intervals Uncertainty estimation (i.e., of SEs, confidence intervals, and p-values) may consider the variety of sources of uncertainty present in the analysis, including (but not limited to!) estimation of the propensity score (if used), matching (i.e., because treated units might be matched to different control units if others had been sampled), and estimation of the treatment effect (i.e., because of sampling error). In general, there are no analytic solutions to all these issues, so much of the research done on uncertainty estimation after matching has relied on simulation studies. The two primary methods that have been shown to perform well in matched samples are using cluster-robust SEs and the bootstrap, described below. To compute SEs after g-computation, a method known as the delta method is used; this is a way to compute the SEs of the derived quantities (the expected potential outcomes and their contrast) from the variance of the coefficients of the outcome models. For nonlinear models (e.g., logistic regression), the delta method is only an approximation subject to error (though in many cases this error is small and shrinks in large samples). Because the delta method relies on the variance of the coefficients from the outcome model, it is important to correctly estimate these variances, using either robust or cluster-robust methods as described below. #### Robust and Cluster-Robust Standard Errors **Robust standard errors.** Also known as sandwich SEs (due to the form of the formula for computing them), heteroscedasticity-consistent SEs, or Huber-White SEs, robust SEs are an adjustment to the usual maximum likelihood or ordinary least squares SEs that are robust to violations of some of the assumptions required for usual SEs to be valid [@mackinnon1985]. Although there has been some debate about their utility [@king2015], robust SEs rarely degrade inferences and often improve them. Generally, robust SEs **must** be used when any non-uniform weights are included in the estimation (e.g., with matching with replacement or inverse probability weighting). **Cluster-robust standard errors.** A version of robust SEs known as cluster-robust SEs [@liang1986] can be used to account for dependence between observations within clusters (e.g., matched pairs). @abadie2019 demonstrate analytically that cluster-robust SEs are generally valid after matching, whereas regular robust SEs can over- or under-estimate the true sampling variability of the effect estimator depending on the specification of the outcome model (if any) and degree of effect modification. A plethora of simulation studies have further confirmed the validity of cluster-robust SEs after matching [e.g., @austin2009a; @austin2014; @gayat2012; @wan2019; @austin2013]. Given this evidence favoring the use of cluster-robust SEs, we recommend them in most cases and use them judiciously in this guide[^1]. [^1]: Because they are only appropriate with a large number of clusters, cluster-robust SEs are generally not used with subclassification methods. Regular robust SEs are valid with these methods when using the subclassification weights to estimate marginal effects. #### Bootstrapping One problem when using robust and cluster-robust SEs along with the delta method is that the delta method is an approximation, as previously mentioned. One solution to this problem is bootstrapping, which is a technique used to simulate the sampling distribution of an estimator by repeatedly drawing samples with replacement and estimating the effect in each bootstrap sample [@efron1993]. From the bootstrap distribution, SEs and confidence intervals can be computed in several ways, including using the standard deviation of the bootstrap estimates as the SE estimate or using the 2.5 and 97.5 percentiles as 95% confidence interval bounds. Bootstrapping tends to be most useful when no analytic estimator of a SE is possible or has been derived yet. Although @abadie2008 found analytically that the bootstrap is inappropriate for matched samples, simulation evidence has found it to be adequate in many cases [@hill2006; @austin2014; @austin2017]. Typically, bootstrapping involves performing the entire estimation process in each bootstrap sample, including propensity score estimation, matching, and effect estimation. This tends to be the most straightforward route, though intervals from this method may be conservative in some cases (i.e., they are wider than necessary to achieve nominal coverage) [@austin2014]. Less conservative and more accurate intervals have been found when using different forms of the bootstrap, including the wild bootstrap develop by @bodory2020 and the matched/cluster bootstrap described by @austin2014 and @abadie2019. The cluster bootstrap involves sampling matched pairs/strata of units from the matched sample and performing the analysis within each sample composed of the sampled pairs. @abadie2019 derived analytically that the cluster bootstrap is valid for estimating SEs and confidence intervals in the same circumstances cluster robust SEs are; indeed, the cluster bootstrap SE is known to approximate the cluster-robust SE [@cameron2015]. With bootstrapping, more bootstrap replications are always better but can take time and increase the chances that at least one error will occur within the bootstrap analysis (e.g., a bootstrap sample with zero treated units or zero units with an event). In general, numbers of replications upwards of 999 are recommended, with values one less than a multiple of 100 preferred to avoid interpolation when using the percentiles as confidence interval limits [@mackinnon2006]. There are several methods of computing bootstrap confidence intervals, but the bias-corrected accelerated (BCa) bootstrap confidence interval often performs best [@austin2014; @carpenter2000] and is easy to implement, simply by setting `type = "bca"` in the call to `boot::boot.ci()` after running `boot::boot()`[^2]. [^2]: Sometimes, an error will occur with this method, which usually means more bootstrap replications are required. The number of replicates must be greater than the original sample size when using the full bootstrap and greater than the number of pairs/strata when using the block bootstrap. Most of this guide will consider analytic (i.e., non-bootstrapping) approaches to estimating uncertainty; the section "Using Bootstrapping to Estimate Confidence Intervals" describes broadly how to use bootstrapping. Although analytic estimates are faster to compute, in many cases bootstrap confidence intervals are more accurate. ## Estimating Treatment Effects and Standard Errors After Matching Below, we describe effect estimation after matching. We'll be using a simulated toy dataset `d` with several outcome types. Code to generate the dataset is at the end of this document. The focus here is not on evaluating the methods but simply on demonstrating them. In all cases, the correct propensity score model is used. Below we display the first six rows of `d`: ```{r} head(d) ``` `A` is the treatment variable, `X1` through `X9` are covariates, `Y_C` is a continuous outcome, `Y_B` is a binary outcome, and `Y_S` is a survival outcome. We will need to the following packages to perform the desired analyses: - `marginaleffects` provides the `avg_comparisons()` function for performing g-computation and estimating the SEs and confidence intervals of the average estimate potential outcomes and treatment effects - `sandwich` is used internally by `marginaleffects` to compute robust and cluster-robust SEs - `survival` provides `coxph()` to estimate the coefficients in a Cox-proportional hazards model for the marginal hazard ratio, which we will use for survival outcomes. Of course, we also need `MatchIt` to perform the matching. ```{r,message=FALSE,warning=FALSE} library("MatchIt") ``` ```{r,message=FALSE,warning=FALSE, eval=me_ok} library("marginaleffects") ``` All effect estimates will be computed using `marginaleffects::avg_comparions()`, even when its use may be superfluous (e.g., for performing a t-test in the matched set). As previously mentioned, this is because it is useful to have a single workflow that works no matter the situation, perhaps with very slight modifications to accommodate different contexts. Using `avg_comparions()` has several advantages, even when the alternatives are simple: it only provides the effect estimate, and not other coefficients; it automatically incorporates robust and cluster-robust SEs if requested; and it always produces average marginal effects for the correct population if requested. Other packages may be of use but are not used here. There are alternatives to the `marginaleffects` package for computing average marginal effects, including `margins` and `stdReg`. The `survey` package can be used to estimate robust SEs incorporating weights and provides functions for survey-weighted generalized linear models and Cox-proportional hazards models. ### The Standard Case For almost all matching methods, whether a caliper, common support restriction, exact matching specification, or $k$:1 matching specification is used, estimating the effect in the matched dataset is straightforward and involves fitting a model for the outcome that incorporates the matching weights[^3], then estimating the treatment effect using g-computation (i.e., using `marginaleffects::avg_comparisons()`) with a cluster-robust SE to account for pair membership. This procedure is the same for continuous and binary outcomes with and without covariates. [^3]: The matching weights are not necessary when performing 1:1 matching, but we include them here for generality. When weights are not necessary, including them does not affect the estimates. Because it may not always be clear when weights are required, we recommend always including them. There are a few adjustments that need to be made for certain scenarios, which we describe in the section "Adjustments to the Standard Case". These adjustments include for the following cases: when matching for the ATE rather than the ATT, for matching with replacement, for matching with a method that doesn't involve creating pairs (e.g., cardinality and profile matching and coarsened exact matching), for subclassification, for estimating effects with binary outcomes, and for estimating effects with survival outcomes. You must read the Standard Case to understand the basic procedure before reading about these special scenarios. Here, we demonstrate the faster analytic approach to estimating confidence intervals; for the bootstrap approach, see the section "Using Bootstrapping to Estimate Confidence Intervals" below. First, we will perform full matching on the propensity score for the ATT. Remember, all matching methods use this exact procedure or a slight variation, so this section is critical even if you are using a different matching method. ```{r} #Optimal full matching on the PS for the ATT mF <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = d, method = "full", estimand = "ATT") mF #Extract matched data md <- match.data(mF) head(md) ``` Typically one would assess balance and ensure that this matching specification works, but we will skip that step here to focus on effect estimation. See `vignette("MatchIt")` and `vignette("assessing-balance")` for more information on this necessary step. Because we did not use a caliper, the target estimand is the ATT. We perform all analyses using the matched dataset, `md`, which, for matching methods that involve dropping units, contains only the units retained in the sample. First, we fit a model for the outcome given the treatment and (optionally) the covariates. It's usually a good idea to include treatment-covariate interactions, which we do below, but this is not always necessary, especially when excellent balance has been achieved. You can also include the propensity score (usually labeled `distance` in the `match.data()` output), which can add some robustness, especially when modeled flexibly (e.g., with polynomial terms or splines) [@austinDoublePropensityscoreAdjustment2017]; see [here](https://stats.stackexchange.com/a/580174/116195) for an example. ```{r} #Linear model with covariates fit1 <- lm(Y_C ~ A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9), data = md, weights = weights) ``` Next, we use `marginaleffects::avg_comparisons()` to estimate the ATT. ```{r, eval=me_ok} avg_comparisons(fit1, variables = "A", vcov = ~subclass, newdata = subset(A == 1)) ``` Let's break down the call to `avg_comparisons()`: to the first argument, we supply the model fit, `fit1`; to the `variables` argument, the name of the treatment (`"A"`); to the `vcov` argument, a formula with subclass membership (`~subclass`) to request cluster-robust SEs; and to the `newdata` argument, a version of the matched dataset containing only the treated units (`subset(A == 1)`) to request the ATT. Some of these arguments differ depending on the specifics of the matching method and outcome type; see the sections below for information. If, in addition to the effect estimate, we want the average estimated potential outcomes, we can use `marginaleffects::avg_predictions()`, which we demonstrate below. Note the interpretation of the resulting estimates as the expected potential outcomes is only valid if all covariates present in the outcome model (if any) are interacted with the treatment. ```{r, eval=me_ok && packageVersion("marginaleffects") >= "0.11.0"} avg_predictions(fit1, variables = "A", vcov = ~subclass, newdata = subset(A == 1)) ``` We can see that the difference in potential outcome means is equal to the average treatment effect computed previously[^4]. All of the arguments to `avg_predictions()` are the same as those to `avg_comparisons()`. [^4]: To verify that they are equal, supply the output of `avg_predictions()` to `hypotheses(), e.g.,`avg_predictions(...) \|\> hypotheses("revpairwise")`; this explicitly compares the average potential outcomes and should yield identical estimates to the`avg_comparisons()\` call. ### Adjustments to the Standard Case This section explains how the procedure might differ if any of the following special circumstances occur. #### Matching for the ATE When matching for the ATE (including [coarsened] exact matching, full matching, subclassification, and cardinality matching), everything is identical to the Standard Case except that in the calls to `avg_comparisons()` and `avg_predictions()`, the `newdata` argument is omitted. This is because the estimated potential outcomes are computed for the full sample rather than just the treated units. #### Matching with replacement When matching with replacement (i.e., nearest neighbor or genetic matching with `replace = TRUE`), effect and SE estimation need to account for control unit multiplicity (i.e., repeated use) and within-pair correlations [@hill2006; @austin2020a]. Although @abadie2008 demonstrated analytically that bootstrap SEs may be invalid for matching with replacement, simulation work by @hill2006 and @bodory2020 has found that bootstrap SEs are adequate and generally slightly conservative. See the section "Using Bootstrapping to Estimate Confidence Intervals" for instructions on using the bootstrap and an example that use matching with replacement. Because control units do not belong to unique pairs, there is no pair membership in the `match.data()` output. One can simply change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `comparisons()` and `predictions()` to use robust SEs instead of cluster-robust SEs, as recommended by @hill2006. There is some evidence for an alternative approach that incorporates pair membership and adjusts for reuse of control units, though this has only been studied for survival outcomes [@austin2020a]. This adjustment involves using two-way cluster-robust SEs with pair membership and unit ID as the clustering variables. For continuous and binary outcomes, this involves the following two changes: 1) replace `match.data()` with `get_matches()`, which produces a dataset with one row per unit per pair, meaning control units matched to multiple treated units will appear multiple times in the dataset; 2) set `vcov = ~subclass + id` in the calls to `avg_comparisons()` and `avg_predictions()`. For survival outcomes, a special procedure must be used; see the section on survival outcomes below. #### Matching without pairing Some matching methods do not involve creating pairs; these include cardinality and profile matching with `mahvars = NULL` (the default), exact matching, and coarsened exact matching with `k2k = FALSE` (the default). The only change that needs to be made to the Standard Case is that one should change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `avg_comparisons()` and `avg_predictions()` to use robust SEs instead of cluster-robust SEs. Remember that if matching is done for the ATE (even if units are dropped), the `newdata` argument should be dropped. #### Propensity score subclassification There are two natural ways to estimate marginal effects after subclassification: the first is to estimate subclass-specific treatment effects and pool them using an average marginal effects procedure, and the second is to use the stratum weights to estimate a single average marginal effect. This latter approach is also known as marginal mean weighting through stratification (MMWS), and is described in detail by @hong2010[^5]. When done properly, both methods should yield similar or identical estimates of the treatment effect. [^5]: It is also known as fine stratification weighting, described by @desai2017. All of the methods described above for the Standard Case also work with MMWS because the formation of the weights is the same; the only difference is that it is not appropriate to use cluster-robust SEs with MMWS because of how few clusters are present, so one should change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `avg_comparisons()` and `avg_predictions()` to use robust SEs instead of cluster-robust SEs. The subclasses can optionally be included in the outcome model (optionally interacting with treatment) as an alternative to including the propensity score. The subclass-specific approach omits the weights and uses the subclasses directly. It is only appropriate when there are a small number of subclasses relative to the sample size. In the outcome model, `subclass` should interact with all other predictors in the model (including the treatment, covariates, and interactions, if any), and the `weights` argument should be omitted. As with MMWS, one should change `vcov = ~subclass` to `vcov = "HC3"` in the calls to `avg_comparisons()` and `avg_predictions()`. See an example below: ```{r, eval=me_ok} #Subclassification on the PS for the ATT mS <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = d, method = "subclass", estimand = "ATT") #Extract matched data md <- match.data(mS) fitS <- lm(Y_C ~ subclass * (A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9)), data = md) avg_comparisons(fitS, variables = "A", vcov = "HC3", newdata = subset(A == 1)) ``` A model with fewer terms may be required when subclasses are small; removing covariates or their interactions with treatment may be required and can increase precision in smaller datasets. Remember that if subclassification is done for the ATE (even if units are dropped), the `newdata` argument should be dropped. #### Binary outcomes Estimating effects on binary outcomes is essentially the same as for continuous outcomes. The main difference is that there are several measures of the effect one can consider, which include the odds ratio (OR), risk ratio/relative risk (RR), and risk difference (RD), and the syntax to `avg_comparisons()` depends on which one is desired. The outcome model should be one appropriate for binary outcomes (e.g., logistic regression) but is unrelated to the desired effect measure because we can compute any of the above effect measures using `avg_comparisons()` after the logistic regression. To fit a logistic regression model, change `lm()` to `glm()` and set `family = quasibinomial()`[^6]. To compute the marginal RD, we can use exactly the same syntax as in the Standard Case; nothing needs to change[^7]. [^6]: We use `quasibinomial()` instead of `binomial()` simply to avoid a spurious warning that can occur with certain kinds of matching; the results will be identical regardless. [^7]: Note that for low or high average expected risks computed with `avg_predictions()`, the confidence intervals may go below 0 or above 1; this is because an approximation is used. To avoid this problem, bootstrapping or simulation-based inference can be used instead. To compute the marginal RR, we need to add `comparison = "lnratioavg"` to `avg_comparisons()`; this computes the marginal log RR. To get the marginal RR, we need to add `transform = "exp"` to `avg_comparisons()`, which exponentiates the marginal log RR and its confidence interval. The code below computes the effects and displays the statistics of interest: ```{r, eval=me_ok} #Logistic regression model with covariates fit2 <- glm(Y_B ~ A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9), data = md, weights = weights, family = quasibinomial()) #Compute effects; RR and confidence interval avg_comparisons(fit2, variables = "A", vcov = ~subclass, newdata = subset(A == 1), comparison = "lnratioavg", transform = "exp") ``` The output displays the marginal RR, its Z-value, the p-value for the Z-test of the log RR against 0, and its confidence interval. (Note that even though the `Contrast` label still suggests the log RR, the RR is actually displayed.) To view the log RR and its standard error, omit the `transform` argument. For the marginal OR, the only thing that needs to change is that `comparison` should be set to `"lnoravg"`. #### Survival outcomes There are several measures of effect size for survival outcomes. When using the Cox proportional hazards model, the quantity of interest is the hazard ratio (HR) between the treated and control groups. As with the OR, the HR is non-collapsible, which means the estimated HR will only be a valid estimate of the marginal HR when no other covariates are included in the model. Other effect measures, such as the difference in mean survival times or probability of survival after a given time, can be treated just like continuous and binary outcomes as previously described. For the HR, we cannot compute average marginal effects and must use the coefficient on treatment in a Cox model fit without covariates[^8]. This means that we cannot use the procedures from the Standard Case. Here we describe estimating the marginal HR using `coxph()` from the `survival` package. (See `help("coxph", package = "survival")` for more information on this model.) To request cluster-robust SEs as recommended by @austin2013a, we need to supply pair membership (stored in the `subclass` column of `md`) to the `cluster` argument and set `robust = TRUE`. For matching methods that don't involve pairing (e.g., cardinality and profile matching and [coarsened] exact matching), we can omit the `cluster` argument (but keep `robust = TRUE`)[^9]. [^8]: It is not immediately clear how to estimate a marginal HR when covariates are included in the outcome model; though @austin2020 describe several ways of including covariates in a model to estimate the marginal HR, they do not develop SEs and little research has been done on this method, so we will not present it here. Instead, we fit a simple Cox model with the treatment as the sole predictor. [^9]: For subclassification, only MMWS can be used; this is done simply by including the stratification weights in the Cox model and omitting the `cluster` argument. ```{r, eval=su_ok} library("survival") #Cox Regression for marginal HR coxph(Surv(Y_S) ~ A, data = md, robust = TRUE, weights = weights, cluster = subclass) ``` The `coef` column contains the log HR, and `exp(coef)` contains the HR. Remember to always use the `robust se` for the SE of the log HR. The displayed z-test p-value results from using the robust SE. For matching with replacement, a special procedure described by @austin2020a can be necessary for valid inference. According to the results of their simulation studies, when the treatment prevalence is low (\<30%), a SE that does not involve pair membership (i.e., the `match.data()` approach, as demonstrated above) is sufficient. When treatment prevalence is higher, the SE that ignores pair membership may be too low, and the authors recommend using a custom SE estimator that uses information about both multiplicity and pairing. Doing so must be done manually for survival models using `get_matches()` and several calls to `coxph()` as demonstrated in the appendix of @austin2020a. We demonstrate this below: ```{r, eval = F} #get_matches() after matching with replacement gm <- get_matches(mR) #Austin & Cafri's (2020) SE estimator fs <- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE, weights = weights, cluster = subclass) Vs <- fs$var ks <- nlevels(gm$subclass) fi <- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE, weights = weights, cluster = id) Vi <- fi$var ki <- length(unique(gm$id)) fc <- coxph(Surv(Y_S) ~ A, data = gm, robust = TRUE, weights = weights) Vc <- fc$var kc <- nrow(gm) #Compute the variance and sneak it back into the fit object fc$var <- (ks/(ks-1))*Vs + (ki/(ki-1))*Vi - (kc/(kc-1))*Vc fc ``` The `robust se` column contains the computed SE, and the reported Z-test uses this SE. The `se(coef)` column should be ignored. ### Using Bootstrapping to Estimate Confidence Intervals The bootstrap is an alternative to the delta method for estimating confidence intervals for estimated effects. See the section Bootstrapping above for details. Here, we'll demonstrate two forms of the bootstrap: 1) the standard bootstrap, which involve resampling units and performing matching and effect estimation within each bootstrap sample, and 2) the cluster bootstrap, which involves resampling pairs after matching and estimating the effect in each bootstrap sample. For both, we will use functionality in the `boot` package. It is critical to set a seed using `set.seed()` prior to performing the bootstrap in order for results to be replicable. #### The standard bootstrap For the standard bootstrap, we need a function that takes in the original dataset and a vector of sampled unit indices and returns the estimated quantity of interest. This function should perform the matching on the bootstrap sample, fit the outcome model, and estimate the treatment effect using g-computation. In this example, we'll use matching with replacement, since the standard bootstrap has been found to work well with it [@bodory2020; @hill2006], despite some analytic results recommending otherwise [@abadie2008]. We'll implement g-computation manually rather than using `avg_comparisons()`, as this dramatically improves the speed of the estimation since we don't require standard errors to be estimated in each sample (or other processing `avg_comparisons()` does). We'll consider the marginal RR ATT of `A` on the binary outcome `Y_B`. The first step is to write the estimation function, we call `boot_fun`. This function returns the marginal RR. In it, we perform the matching, estimate the effect, and return the estimate of interest. ```{r} boot_fun <- function(data, i) { boot_data <- data[i,] #Do 1:1 PS matching with replacement m <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = boot_data, replace = TRUE) #Extract matched dataset md <- match.data(m, data = boot_data) #Fit outcome model fit <- glm(Y_B ~ A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9), data = md, weights = weights, family = quasibinomial()) ## G-computation ## #Subset to treated units for ATT; skip for ATE md1 <- subset(md, A == 1) #Estimated potential outcomes under treatment p1 <- predict(fit, type = "response", newdata = transform(md1, A = 1)) Ep1 <- mean(p1) #Estimated potential outcomes under control p0 <- predict(fit, type = "response", newdata = transform(md1, A = 0)) Ep0 <- mean(p0) #Risk ratio return(Ep1 / Ep0) } ``` Next, we call `boot::boot()` with this function and the original dataset supplied to perform the bootstrapping. We'll request 199 bootstrap replications here, but in practice you should use many more, upwards of 999. More is always better. Using more also allows you to use the bias-corrected and accelerated (BCa) bootstrap confidence intervals (which you can request by setting `type = "bca"` in the call to `boot.ci()`), which are known to be the most accurate. See `?boot.ci` for details. Here, we'll just use a percentile confidence interval. ```{r, eval = boot_ok, message=F, warning=F} library("boot") set.seed(54321) boot_out <- boot(d, boot_fun, R = 199) boot_out boot.ci(boot_out, type = "perc") ``` ```{r, include = FALSE} b <- { if (boot_ok) boot::boot.ci(boot_out, type = "perc") else list(t0 = 1.347, percent = c(0, 0, 0, 1.144, 1.891)) } ``` We find a RR of `r round(b$t0, 3)` with a confidence interval of (`r round(b$percent[4], 3)`, `r round(b$percent[5], 3)`). If we had wanted a risk difference, we could have changed the final line in `boot_fun()` to be `Ep1 - Ep0`. #### The cluster bootstrap For the cluster bootstrap, we need a function that takes in a vector of subclass (e.g., pairs) and a vector of sampled pair indices and returns the estimated quantity of interest. This function should fit the outcome model and estimate the treatment effect using g-computation, but the matching step occurs prior to the bootstrap. Here, we'll use matching without replacement, since the cluster bootstrap has been found to work well with it [@austin2014; @abadie2019]. This could be used for any method that returns pair membership, including other pair matching methods without replacement and full matching. As before, we'll use g-computation to estimate the marginal RR ATT, and we'll do so manually rather than using `avg_comparisons()` for speed. Note that the cluster bootstrap is already much faster than the standard bootstrap because matching does not need to occur within each bootstrap sample. First, we'll do a round of matching. ```{r} mNN <- matchit(A ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = d) mNN md <- match.data(mNN) ``` Next, we'll write the function that takes in cluster membership and the sampled indices and returns an estimate. ```{r} #Unique pair IDs pair_ids <- levels(md$subclass) #Unit IDs, split by pair membership split_inds <- split(seq_len(nrow(md)), md$subclass) cluster_boot_fun <- function(pairs, i) { #Extract units corresponding to selected pairs ids <- unlist(split_inds[pairs[i]]) #Subset md with block bootstrapped indices boot_md <- md[ids,] #Fit outcome model fit <- glm(Y_B ~ A * (X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9), data = boot_md, weights = weights, family = quasibinomial()) ## G-computation ## #Subset to treated units for ATT; skip for ATE md1 <- subset(boot_md, A == 1) #Estimated potential outcomes under treatment p1 <- predict(fit, type = "response", newdata = transform(md1, A = 1)) Ep1 <- mean(p1) #Estimated potential outcomes under control p0 <- predict(fit, type = "response", newdata = transform(md1, A = 0)) Ep0 <- mean(p0) #Risk ratio return(Ep1 / Ep0) } ``` Next, we call `boot::boot()` with this function and the vector of pair membership supplied to perform the bootstrapping. We'll request 199 bootstrap replications, but in practice you should use many more, upwards of 999. More is always better. Using more also allows you to use the bias-corrected and accelerated (BCa) boot strap confidence intervals, which are known to be the most accurate. See `?boot.ci` for details. Here, we'll just use a percentile confidence interval. ```{r, eval = boot_ok, message=F, warning=F} library("boot") set.seed(54321) cluster_boot_out <- boot(pair_ids, cluster_boot_fun, R = 199) cluster_boot_out boot.ci(cluster_boot_out, type = "perc") ``` ```{r, include = FALSE} b <- { if (boot_ok) boot::boot.ci(cluster_boot_out, type = "perc") else list(t0 = 1.588, percent = c(0,0,0, 1.348, 1.877)) } ``` We find a RR of `r round(b$t0, 3)` with a confidence interval of (`r round(b$percent[4], 3)`, `r round(b$percent[5], 3)`). If we had wanted a risk difference, we could have changed the final line in `cluster_boot_fun()` to be `Ep1 - Ep0`. ### Moderation Analysis Moderation analysis involves determining whether a treatment effect differs across levels of another variable. The use of matching with moderation analysis is described in @greenExaminingModerationAnalyses2014. The goal is to achieve balance within each subgroup of the potential moderating variable, and there are several ways of doing so. Broadly, one can either perform matching in the full dataset, requiring exact matching on the moderator, or one can perform completely separate analyses in each subgroup. We'll demonstrate the first approach below; see the blog post ["Subgroup Analysis After Propensity Score Matching Using R"](https://ngreifer.github.io/blog/subgroup-analysis-psm/) by Noah Greifer for an example of the other approach. There are benefits to using either approach, and @greenExaminingModerationAnalyses2014 find that either can be successful at balancing the subgroups. The first approach may be most effective with small samples, where separate propensity score models would be fit with greater uncertainty and an increased possibility of perfect prediction or failure to converge [@wangRelativePerformancePropensity2018]. The second approach may be more effective with larger samples or with matching methods that target balance in the matched sample, such as genetic matching [@kreifMethodsEstimatingSubgroup2012]. With genetic matching, separate subgroup analyses ensure balance is optimized within each subgroup rather than just overall. The chosen approach should be that which achieves the best balance, though we don't demonstrate assessing balance here to maintain focus on effect estimation. The full dataset approach involves pooling information across subgroups. This could involve estimating propensity scores using a single model for both groups but exact matching on the potential moderator. The propensity score model could include moderator-by-covariate interactions to allow the propensity score model to vary across subgroups on some covariates. It is critical that exact matching is done on the moderator so that matched pairs are not split across subgroups. We'll consider the binary variable `X5` to be the potential moderator of the effect of `A` on `Y_C`. Below, we'll estimate a propensity score using a single propensity score model with a few moderator-by-covariate interactions. We'll perform nearest neighbor matching on the propensity score and exact matching on the moderator, `X5`. ```{r} mP <- matchit(A ~ X1 + X2 + X5*X3 + X4 + X5*X6 + X7 + X5*X8 + X9, data = d, exact = ~X5, method = "nearest") mP ``` Although it is straightforward to assess balance overall using `summary()`, it is more challenging to assess balance within subgroups. The easiest way to check subgroup balance would be to use `cobalt::bal.tab()`, which has a `cluster` argument that can be used to assess balance within subgroups, e.g., by `cobalt::bal.tab(mP, cluster = "X5")`. See the vignette "Appendix 2: Using cobalt with Clustered, Multiply Imputed, and Other Segmented Data" on the `cobalt` [website](https://ngreifer.github.io/cobalt/index.html) for details. If we are satisfied with balance, we can then model the outcome with an interaction between the treatment and the moderator. ```{r} mdP <- match.data(mP) fitP <- lm(Y_C ~ A * X5, data = mdP, weights = weights) ``` To estimate the subgroup ATTs, we can use `avg_comparisons()`, this time specifying the `by` argument to signify that we want treatment effects stratified by the moderator. ```{r, eval=me_ok} avg_comparisons(fitP, variables = "A", vcov = ~subclass, newdata = subset(A == 1), by = "X5") ``` We can see that the subgroup mean differences are quite similar to each other. Finally, we can test for moderation using another call to `avg_comparisons()`, this time using the `hypothesis` argument to signify that we want to compare effects between subgroups: ```{r, eval=me_ok} avg_comparisons(fitP, variables = "A", vcov = ~subclass, newdata = subset(A == 1), by = "X5", hypothesis = "pairwise") ``` As expected, the difference between the subgroup treatment effects is small and nonsignificant, so there is no evidence of moderation by `X5`. ### Reporting Results It is important to be as thorough and complete as possible when describing the methods of estimating the treatment effect and the results of the analysis. This improves transparency and replicability of the analysis. Results should at least include the following: - a description of the outcome model used (e.g., logistic regression, a linear model with treatment-covariate interactions and covariates, a Cox proportional hazards model with the matching weights applied) - the way the effect was estimated (e.g., using g-computation or as the coefficient in the outcome model) - the way SEs and confidence intervals were estimated (e.g., using robust SEs, using cluster-robust SEs with pair membership as the cluster, using the BCa bootstrap with 4999 bootstrap replications and the entire process of matching and effect estimation included in each replication) - R packages and functions used in estimating the effect and its SE (e.g., `glm()` in base R, `avg_comparisons()` in `marginaleffects`, `boot()` and `boot.ci()` in `boot`) - The effect and its SE and confidence interval All this is in addition to information about the matching method, propensity score estimation procedure (if used), balance assessment, etc. mentioned in the other vignettes. ## Common Mistakes There are a few common mistakes that should be avoided. It is important not only to avoid these mistakes in one's own research but also to be able to spot these mistakes in others' analyses. ### 1. Failing to include weights Several methods involve weights that are to be used in estimating the treatment effect. With full matching and stratification matching (when analyzed using MMWS), the weights do the entire work of balancing the covariates across the treatment groups. Omitting weights essentially ignores the entire purpose of matching. Some cases are less obvious. When performing matching with replacement and estimating the treatment effect using the `match.data()` output, weights must be included to ensure control units matched to multiple treated units are weighted accordingly. Similarly, when performing k:1 matching where not all treated units receive k matches, weights are required to account for the differential weight of the matched control units. The only time weights can be omitted after pair matching is when performing 1:1 matching without replacement. Including weights even in this scenario will not affect the analysis and it can be good practice to always include weights to prevent this error from occurring. There are some scenarios where weights are not useful because the conditioning occurs through some other means, such as when using the direct subclass strategy rather than MMWS for estimating marginal effects after stratification. ### 2. Failing to use robust or cluster-robust standard errors Robust SEs are required when using weights to estimate the treatment effect. The model-based SEs resulting from weighted least squares or maximum likelihood are inaccurate when using matching weights because they assume weights are frequency weights rather than probability weights. Cluster-robust SEs account for both the matching weights and pair membership and should be used when appropriate. Sometimes, researchers use functions in the `survey` package to estimate robust SEs, especially with inverse probability weighting; this is a valid way to compute robust SEs and will give similar results to `sandwich::vcovHC()`.[^10] [^10]: To use `survey` to adjust for pair membership, one can use the following code to specify the survey design to be used with `svyglm()`: `svydesign(ids = ~subclass, weights = ~weights, data = md)` where `md` is the output of `match.data()`. After `svyglm()`, `comparisons()` can be used, and the `vcov` argument does not need to be specified. ### 3. Interpreting conditional effects as marginal effects The distinction between marginal and conditional effects is not always clear both in methodological and applied papers. Some statistical methods are valid only for estimating conditional effects and they should not be used to estimate marginal effects (without further modification). Sometimes conditional effects are desirable, and such methods may be useful for them, but when marginal effects are the target of inference, it is critical not to inappropriately interpret estimates resulting from statistical methods aimed at estimating conditional effects as marginal effects. Although this issue is particularly salient with binary and survival outcomes due to the general noncollapsibility of the OR, RR, and HR, this can also occur with linear models for continuous outcomes or the RD. The following methods estimate **conditional effects** for binary or survival outcomes (with noncollapsible effect measures) and should **not** be used to estimate marginal effects: - Logistic regression or Cox proportional hazards model with covariates and/or the propensity score included, using the coefficient on treatment as the effect estimate - Conditional logistic regression after matching - Stratified Cox regression after matching - Averaging stratum-specific effect estimates after stratification, including using Mantel-Haenszel OR pooling - Including pair or stratum fixed or random effects in a logistic regression model, using the coefficient on treatment as the effect estimate In addition, with continuous outcomes, conditional effects can be mistakenly interpreted as marginal effect estimates when treatment-covariate interactions are present in the outcome model. If the covariates are not centered at their mean in the target population (e.g., the treated group for the ATT, the full sample for the ATE, or the remaining matched sample for an ATM), the coefficient on treatment will not correspond to the marginal effect in the target population; it will correspond to the effect of treatment when the covariate values are equal to zero, which may not be meaningful or plausible. G-computation is always the safest way to estimate effects when including covariates in the outcome model, especially in the presence of treatment-covariate interactions. ## References ::: {#refs} ::: ## Code to Generate Data used in Examples ```{r, eval = FALSE} #Generating data similar to Austin (2009) for demonstrating treatment effect estimation gen_X <- function(n) { X <- matrix(rnorm(9 * n), nrow = n, ncol = 9) X[,5] <- as.numeric(X[,5] < .5) X } #~20% treated gen_A <- function(X) { LP_A <- - 1.2 + log(2)*X[,1] - log(1.5)*X[,2] + log(2)*X[,4] - log(2.4)*X[,5] + log(2)*X[,7] - log(1.5)*X[,8] P_A <- plogis(LP_A) rbinom(nrow(X), 1, P_A) } # Continuous outcome gen_Y_C <- function(A, X) { 2*A + 2*X[,1] + 2*X[,2] + 2*X[,3] + 1*X[,4] + 2*X[,5] + 1*X[,6] + rnorm(length(A), 0, 5) } #Conditional: # MD: 2 #Marginal: # MD: 2 # Binary outcome gen_Y_B <- function(A, X) { LP_B <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6] P_B <- plogis(LP_B) rbinom(length(A), 1, P_B) } #Conditional: # OR: 2.4 # logOR: .875 #Marginal: # RD: .144 # RR: 1.54 # logRR: .433 # OR: 1.92 # logOR .655 # Survival outcome gen_Y_S <- function(A, X) { LP_S <- -2 + log(2.4)*A + log(2)*X[,1] + log(2)*X[,2] + log(2)*X[,3] + log(1.5)*X[,4] + log(2.4)*X[,5] + log(1.5)*X[,6] sqrt(-log(runif(length(A)))*2e4*exp(-LP_S)) } #Conditional: # HR: 2.4 # logHR: .875 #Marginal: # HR: 1.57 # logHR: .452 set.seed(19599) n <- 2000 X <- gen_X(n) A <- gen_A(X) Y_C <- gen_Y_C(A, X) Y_B <- gen_Y_B(A, X) Y_S <- gen_Y_S(A, X) d <- data.frame(A, X, Y_C, Y_B, Y_S) ```