| Title: | Causal Inference Methods for Multilevel and Clustered Data |
|---|---|
| Description: | Provides an end-to-end workflow for estimating average treatment effects in clustered (multilevel) observational data. Core functionality includes cluster-aware propensity score estimation using fixed effects and Mundlak-style specifications, inverse probability weighting, within-cluster nearest-neighbor matching, covariate balance diagnostics at both individual and cluster-mean levels, outcome regression with cluster-robust standard errors, propensity score overlap visualization, and tipping-point sensitivity analysis for omitted cluster-level confounding. |
| Authors: | Subir Hait [aut, cre] (ORCID: <https://orcid.org/0009-0004-9871-9677>) |
| Maintainer: | Subir Hait <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-15 09:15:03 UTC |
| Source: | https://github.com/causalfragility-lab/mlcausal |
Computes standardised mean differences (SMDs) at two levels simultaneously:
the individual level and the cluster-mean level. Cluster-mean SMDs capture
between-cluster imbalance that individual-level diagnostics miss in
hierarchical data, and are the key diagnostic for assessing whether
ml_match's dual-balance optimisation has succeeded.
balance_ml(data, treatment, covariates, cluster, weights = NULL)balance_ml(data, treatment, covariates, cluster, weights = NULL)
data |
A |
treatment |
Character string. Name of the binary treatment variable
( |
covariates |
Character vector of covariate names. |
cluster |
Character string. Name of the cluster identifier variable. |
weights |
Optional character string. Name of a weight variable in
|
A common rule of thumb is for adequate balance.
Cluster-mean SMDs above this threshold indicate residual between-cluster
confounding; in that case increase lambda in ml_match
or switch to method = "fixed" in ml_ps.
When a cluster-mean SMD cannot be computed (e.g. because the matched sample
contains only one treatment arm across clusters), the smd column
returns the string
"Cluster-level SMD not estimable due to insufficient within-cluster variation after matching."
rather than NA. This makes the diagnostic failure explicit and
actionable rather than silently missing.
An object of class balance_ml, a named list:
Data frame with individual-level SMDs, one row per
covariate. The smd column is numeric.
Data frame with cluster-mean SMDs, one row per
covariate. The smd column is character: a numeric value
formatted to 4 decimal places, or a descriptive message when not
estimable.
Row-bound combination of overall and
cluster_means.
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) bal <- balance_ml(dat_w, treatment = "z", covariates = c("x1", "x2", "x3"), cluster = "school_id", weights = "weights") print(bal)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) bal <- balance_ml(dat_w, treatment = "z", covariates = c("x1", "x2", "x3"), cluster = "school_id", weights = "weights") print(bal)
Fits a (weighted) linear outcome model and reports the treatment effect
estimate with cluster-robust (HC1) standard errors via the sandwich
estimator. When covariates and weights are both supplied the
estimator is doubly robust.
estimate_att_ml( data, outcome, treatment, cluster, covariates = NULL, weights = NULL )estimate_att_ml( data, outcome, treatment, cluster, covariates = NULL, weights = NULL )
data |
A |
outcome |
Character string. Name of the continuous outcome variable. |
treatment |
Character string. Name of the binary treatment variable
( |
cluster |
Character string. Name of the cluster identifier variable. |
covariates |
Optional character vector of regression covariates for
doubly robust adjustment. |
weights |
Optional character string. Name of a weight variable in
|
An object of class estimate_att_ml:
The matched call.
Fitted lm object.
Cluster-robust variance-covariance matrix (HC1).
coeftest coefficient table.
Point estimate for the treatment coefficient.
Cluster-robust standard error.
Two-sided p-value.
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) est <- estimate_att_ml(dat_w, outcome = "y", treatment = "z", cluster = "school_id", weights = "weights") print(est)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) est <- estimate_att_ml(dat_w, outcome = "y", treatment = "z", cluster = "school_id", weights = "weights") print(est)
Performs greedy nearest-neighbour matching within clusters using a composite distance that simultaneously targets balance on individual-level propensity scores and cluster-mean covariates. This dual-balance approach is the core methodological innovation of MLCausal: standard within-cluster PS matching achieves individual-level balance but often leaves substantial between-cluster (cluster-mean) imbalance. The composite distance penalises matches that worsen cluster-mean balance.
ml_match(ps_fit, ratio = 1L, replace = FALSE, caliper = NULL, lambda = 1)ml_match(ps_fit, ratio = 1L, replace = FALSE, caliper = NULL, lambda = 1)
ps_fit |
An object of class |
ratio |
Positive integer. Controls matched per treated unit. Default 1. |
replace |
Logical. Allow control reuse? Default |
caliper |
Optional positive numeric. Maximum allowable logit-PS
distance. |
lambda |
Non-negative numeric. Weight on the cluster-mean balance
penalty in the composite distance. |
An object of class ml_match, a named list:
The matched call.
data.frame of matched units with columns
match_weight (1 for treated, for controls) and
pair_id.
data.frame of matched pairs with pair_id,
treated_row, control_row, ps_distance, and
composite_distance.
Stored arguments.
Number of treated units that could not be matched.
For each treated unit and candidate control in the same
cluster , the composite matching distance is
where is the propensity score, is
the running cluster- mean of covariate for treatment arm
after each match is tentatively accepted, and is the
pooled SD of covariate in the full sample. The tuning parameter
(lambda) controls the weight given to cluster-mean
balance relative to individual PS proximity. Setting lambda = 0
recovers standard within-cluster PS matching.
ml_ps, balance_ml,
estimate_att_ml
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") matched <- ml_match(ps, ratio = 1, caliper = 0.5, lambda = 1) print(matched)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") matched <- ml_match(ps, ratio = 1, caliper = 0.5, lambda = 1) print(matched)
Estimates propensity scores for clustered observational data using one of three specifications to account for between-cluster confounding.
ml_ps( data, treatment, covariates, cluster, method = c("mundlak", "fixed", "single"), estimand = c("ATT", "ATE"), family = stats::binomial() )ml_ps( data, treatment, covariates, cluster, method = c("mundlak", "fixed", "single"), estimand = c("ATT", "ATE"), family = stats::binomial() )
data |
A |
treatment |
Character string. Name of the binary treatment variable
(must be coded |
covariates |
Character vector of covariate names. |
cluster |
Character string. Name of the cluster identifier variable. |
method |
Character string. One of |
estimand |
Character string. Target estimand: |
family |
A GLM family object. Default |
An object of class ml_ps, a named list with elements:
The matched call.
The fitted glm object.
Working data frame with a .ps column of clipped
propensity scores and, when method = "mundlak", _cm
cluster-mean columns.
Numeric vector of clipped propensity scores.
Name of the treatment variable.
Names of the covariates used.
Name of the cluster variable.
PS estimation method used.
Target estimand.
"mundlak"Augments the propensity score model with cluster means of each covariate (Mundlak terms). Softly absorbs between-cluster confounding without requiring within-cluster treatment variation in every cluster. Recommended default.
"fixed"Adds cluster indicators as a factor. Fully absorbs between-cluster confounding but requires within-cluster treatment variation in every cluster.
"single"Standard logistic regression with no cluster adjustment. Useful only as a naive baseline.
ml_weight, ml_match,
plot_overlap_ml, balance_ml
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, treatment = "z", covariates = c("x1", "x2", "x3"), cluster = "school_id", method = "mundlak", estimand = "ATT") print(ps)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, treatment = "z", covariates = c("x1", "x2", "x3"), cluster = "school_id", method = "mundlak", estimand = "ATT") print(ps)
Computes ATT or ATE inverse probability weights from propensity scores
estimated by ml_ps. Stabilised weights (the default) are
recommended to reduce variance without sacrificing consistency.
ml_weight(ps_fit, estimand = c("ATT", "ATE"), stabilize = TRUE, trim = NULL)ml_weight(ps_fit, estimand = c("ATT", "ATE"), stabilize = TRUE, trim = NULL)
ps_fit |
An object of class |
estimand |
Character string. |
stabilize |
Logical. |
trim |
Optional positive numeric. Weights above this value are
Winsorised. |
The working data frame from ps_fit with an appended
weights column.
Let be the estimated propensity score and
the marginal treatment prevalence.
Unstabilised:
ATT: (treated), (control)
ATE: (treated), (control)
Stabilised:
ATT: (treated),
(control)
ATE: (treated),
(control)
ml_ps, balance_ml,
estimate_att_ml
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps, estimand = "ATT", stabilize = TRUE, trim = 10) summary(dat_w$weights)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps, estimand = "ATT", stabilize = TRUE, trim = 10) summary(dat_w$weights)
Creates a density overlap plot of estimated propensity scores by treatment group. Substantial overlap supports the positivity assumption. The plot can be shown overall or faceted by cluster.
plot_overlap_ml( x, treatment = NULL, cluster = NULL, ps = NULL, facet_clusters = FALSE, top_n_clusters = 12L )plot_overlap_ml( x, treatment = NULL, cluster = NULL, ps = NULL, facet_clusters = FALSE, top_n_clusters = 12L )
x |
An |
treatment |
Character string. Treatment variable name. Required when
|
cluster |
Character string. Cluster variable name. Required when
|
ps |
Character string. Propensity score variable name. Required when
|
facet_clusters |
Logical. Facet by cluster? Default |
top_n_clusters |
Positive integer. Maximum clusters shown when
|
A ggplot object.
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") plot_overlap_ml(ps)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") plot_overlap_ml(ps)
Reports how strong a hypothetical omitted confounder (in units of the
cluster-robust SE) would need to be to nullify the estimated treatment
effect (push below 1.96).
sens_ml(estimate, se, q = seq(0, 5, by = 0.1))sens_ml(estimate, se, q = seq(0, 5, by = 0.1))
estimate |
Numeric scalar. Treatment effect estimate from
|
se |
Numeric scalar. Cluster-robust standard error of the estimate. |
q |
Numeric vector. Confounder strengths |
A data.frame with columns:
The value examined.
Estimate after subtracting assumed bias.
Observed .
.
TRUE when .
The analysis assumes a linear bias model: an omitted variable with
confounder strength shifts the estimate by
and the z-statistic by
. This is transparent and easy to communicate, but is not a
full formalisation. For rigorous sensitivity analysis see Cinelli & Hazlett
(2020) or Rosenbaum bounds.
A message is printed if no tipping point is found within the examined range.
Cinelli, C. & Hazlett, C. (2020). Making sense of sensitivity: Extending omitted variable bias. JRSS-B, 82(1), 39–67. doi:10.1111/rssb.12348
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) est <- estimate_att_ml(dat_w, "y", "z", "school_id", weights = "weights") sens <- sens_ml(est$estimate, est$se) head(sens)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) ps <- ml_ps(dat, "z", c("x1", "x2", "x3"), "school_id") dat_w <- ml_weight(ps) est <- estimate_att_ml(dat_w, "y", "z", "school_id", weights = "weights") sens <- sens_ml(est$estimate, est$se) head(sens)
Generates a multilevel dataset for prototyping and validating
MLCausal workflows. The data-generating process (DGP) includes a
cluster-level random effect that simultaneously drives treatment selection
and the outcome (i.e. unmeasured cluster-level confounding when
method = "single" is used in ml_ps), and
heterogeneous treatment effects across clusters.
simulate_ml_data( n_clusters = 30L, cluster_size = 20L, n_min = 10L, seed = NULL )simulate_ml_data( n_clusters = 30L, cluster_size = 20L, n_min = 10L, seed = NULL )
n_clusters |
Integer. Number of clusters (schools). Default |
cluster_size |
Integer. Expected cluster size drawn from
|
n_min |
Integer. Minimum guaranteed cluster size. Prevents tiny
clusters that lose entire treatment arms after matching. Default |
seed |
Optional integer. Random seed for reproducibility. |
A data.frame with columns:
Integer cluster identifier (1 to n_clusters).
Continuous individual-level covariate, .
Binary individual-level covariate, .
Continuous covariate correlated with the cluster random effect.
Binary treatment indicator (1 = treated, 0 = control).
Continuous outcome.
Because treatment effect heterogeneity correlates with the cluster-level
random effect, the true ATT ATE:
ATE (marginal mean of
).
ATT because treated units are
over-represented in high- clusters where .
dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) head(dat) table(dat$z)dat <- simulate_ml_data(n_clusters = 5, cluster_size = 10, seed = 1) head(dat) table(dat$z)