HST 190 at HMS – Causal Inference

Causal inference and (bio)statistics

Study designs: Randomized controlled trials (RCTs), observational studies, “natural” experiments
Why do study designs matter?
- Sampling: Are our inferences generalizable?
- Confounding: Are we learning what we think we are?
- What type of study should we aim for? What can we learn?
When can a statistical inference be interpreted as causal?
- Randomization of treatment assignment, no (unmeasured) confounding (in observational studies)
- Sufficient experimentation in treatment assignment (positivity)

Questions…association or causation

“Is risk of symptomatic disease higher/lower in vaccinated or unvaccinated groups?”
Questions of association inquire about the actual (or realized) state of the system under study; they do not require conceiving of a manipulation of the system, only the ability to observe it.

“Would the risk of symptomatic disease be increased or decreased by vaccination?”
Questions of causality inquire about a counterfactual state; they require that we conceive of how the system would have behaved had it been subjected to an intervention (by us or others).

Association and causation are distinct concepts, with different types of tools and assumptions necessary for their study.

Regression and causality

Example 1 How should we interpret the linear regression parameter $β_{1}$ :

$E [Y ∣ X] = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$

Tempting to call $β_{1}$ “the expected change in outcome $Y$ if covariate $X_{1}$ was to be increased by one unit, keeping other covariates constant.” (It is tempting also to teach others to say so.)
With only the statement about the regression model above, is this true? Are there any key assumptions missing?
The linear form above makes no statement about mechanism; it does not claim to do any more than describe an observed reality—that is, it is not a statement about causality.

An early example of an RCT (per Senn (2022))

In 1747, James Lind, a surgeon on the HMS Salisbury, conducted one of the first medical trials—to assess the causal relationship, if any, between eating oranges and lemons and recovering from scurvy.
Lind took aside twelve men with advanced symptoms of scurvy “as similar as [he] could have them,” for a matched pairs experiment.
- The first pair were given slightly alcoholic cider and the second an elixir of vitriol. The third pair took vinegar while the fourth drank sea water. The fifth were fed two oranges and one lemon daily for six days, and the sixth were given a medicinal paste and a mild laxative.
- Of the six pairs, the pair who were fed the oranges and lemons were nearly recovered after only a week, and those who had drunk the cider responded favorably but were too weak to return to duty after two weeks. The other four pairs all improved little—or at all.

The first RCT

In 1947, Great Britain’s Medical Research Council (MRC) conducted the first published instance of a blinded, randomized controlled trial (RCT).
Clinical setting: $n = 107$ patients suffering from TB were assigned to experimental (streptomycin) or control regimens via a (manual) system of random number assignments devised by Sir Austin Bradford Hill.
- Prior to this, alternative assignment strategies (e.g., “alternating allocation”) had been preferred over randomization
- Hill himself harbored doubts about the ethics of randomization
- Hill became persuaded, as a limited supply of streptomycin meant it was the sole way in which most patients could receive the drug
The MRC streptomycin trial was groundbreaking, pioneering the use of randomized treatment assignment in clinical settings.

The blessing of randomization

Today, RCTs are considered a”gold standard” for medical research (and causal inference!), due to the inferential safeguards they provide.
Randomization allows for the causal effect of a candidate treatment to be isolated from that of other variables (potential confounders).
Without randomization, any association between candidate treatment and the outcome could not be disentangled from the association of the confounders with either.
Randomization is not a panacea, however, and the validity of an RCT’s findings depends on the quality of its design and execution.
As a result of modern advances (ongoing since the 1980s), causal inference is now possible even without randomization (that is, in observational studies)—but this requires great care…

Learning from data…or trying to…

Question: What is “the effect” of drug A versus B on illness $Y$ ?

Mock dataset from a hypothetical study.
Patient ID	$X$ (age)	$D$ (drug)	$Y$ (illness)
1	19	A	1
2	45	B	0
…	…	…	…
199	57	B	0
200	32	A	1

Drug $D$ : standard-of-care $(A)$ or an investigational drug $(B)$ .
Illness $Y$ : some patients recover $(Y = 1)$ , others don’t $(Y = 0)$ .
Does the age $(X)$ of the patients matter in our analysis?

When the ideal data are missing

Question: What is “the effect” of drug A versus B on illness $Y$ ?

Potential outcomes table
Patient ID	$X$	$D$	$Y^{A}$	$Y^{B}$	$Y^{B} - Y^{A}$
1	19	A	1	?	? - 1
2	45	B	?	0	0 - ?
…	…	…	…	…	…
199	57	B	?	0	0 - ?
200	32	A	1	?	? - 1

Potential outcomes: $Y_{i}^{A}$ is the outcome of patient $i$ had they taken drug A and $Y_{i}^{B}$ the outcome had they taken drug B.
$Y_{i}^{B} - Y_{i}^{A}$ is the individual causal effect (ICE) of patient $i$ .

The fundamental problem of causal inference

$Y_{i}^{A}$ , $Y_{i}^{B}$ are potential outcomes or counterfactual RVs (Imbens and Rubin 2015; Hernán and Robins 2025).
For a given study unit $i$ , we cannot simultaneously observed both: $Y_{i}^{A}$ when drug A is assigned, $Y_{i}^{B}$ only when taking drug B…
The ICE is $Y_{i}^{B} - Y_{i}^{A}$ , and we cannot observe both potential outcomes—this is the fundamental problem of causal inference (Holland 1986).
The ICE cannot be measured but what about alternative metrics? $θ_{ATE} = E [Y^{B} - Y^{A}],$ where $θ_{ATE}$ is the average treatment effect (ATE).

From causality to statistics in randomized studies

Proposition 1 (Identification) Let $(A, Y) \overset{iid}{\sim} P$ and assume:

Consistency: $Y = Y^{a}$ whenever $A = a$ .
Randomization: $A ⊥ ⊥ Y^{a}$ for each $a \in A$ .

Then, $E [Y ∣ A = a] = E [Y^{a}]$ .

With Proposition 1, we can re-express $θ_{ATE} = E [Y^{1}] - E [Y^{0}]$ as $ψ_{ATE} = E [Y ∣ A = 1] - E [Y ∣ A = 0] .$

While $ψ_{ATE}$ is a statistical estimand that can be evaluated using data, $θ_{ATE}$ is a causal estimand, defined by unobservable potential outcomes.

The difference-in-means estimator

Since we have that $ψ_{ATE} = E [Y ∣ A = 1] - E [Y ∣ A = 0]$ is equivalent to $θ_{ATE} = E [Y^{1}] - E [Y^{0}]$ by Proposition 1, we can estimate the ATE: $\begin{aligned} {\hat{ψ}}_{ATE}^{DM} & = \hat{E} [Y ∣ A = 1] - \hat{E} [Y ∣ A = 0] \\ = \frac{1}{n_{1}} \sum_{i : A_{i} = 1} Y_{i} - \frac{1}{n_{0}} \sum_{i : A_{i} = 0} Y_{i}, \end{aligned}$ where ${\hat{ψ}}_{ATE}^{DM}$ is the difference-in-means (DM) estimator, with variance: $V ({\hat{ψ}}_{ATE}^{DM}) = \frac{V [Y ∣ A = 1]}{p} + \frac{V [Y ∣ A = 0]}{1 - p} .$

The Horvitz-Thompson (or IPW) estimator

The difference in means estimator can be expressed $\begin{aligned} {\hat{ψ}}_{ATE}^{DM} & = \hat{E} [Y ∣ A = 1] - \hat{E} [Y ∣ A = 0] \\ = \frac{\hat{E} [Y A]}{\hat{E} [A]} - \frac{\hat{E} [Y (1 - A)]}{\hat{E} [(1 - A)]} = \frac{\hat{E} [Y A]}{{\hat{p}}_{1}} - \frac{\hat{E} [Y (1 - A)]}{{\hat{p}}_{0}} . \end{aligned}$

What if we knew the probability of treatment assignment for each arm? $\begin{aligned} {\hat{ψ}}_{ATE}^{HT} & = \hat{E} [Y ∣ A = 1] - \hat{E} [Y ∣ A = 0] \\ = \frac{\hat{E} [Y A]}{E [A]} - \frac{\hat{E} [Y (1 - A)]}{E [(1 - A)]} = \frac{\hat{E} [Y A]}{p_{1}} - \frac{\hat{E} [Y (1 - A)]}{p_{0}} . \end{aligned}$

Efficiency, or how unsure should we be?

When comparing two estimators of the same parameter (target quantity), we care about bias and efficiency
Assuming both estimators are unbiased (accurately recover the target parameter), then our comparison boils down to relative efficiency
- relative efficiency is a measure of the quality of two unbiased estimators: it is the ratio of their (asymptotic) variances
It turns out that between the Horvitz-Thompson (HT) and difference-in-means (DM) estimators, DM is more efficient (smaller variance)
- this is unintuitive…HT used the known assignment probabilities $(p_{1}, p_{0})$ while DM used the estimated probabilities $({\hat{p}}_{1}, {\hat{p}}_{0})$
- why does estimation of probabilities improve efficiency? when does the observed $n \hat{p}$ get close to the theoretical $n p$ ?

Observational studies: “Thank you for smoking”

Observational studies, randomized experiments not uniformly bad/good
- both types of studies range in quality, and it is not true that one type obviously dominates the other.
- their evidential status is context-dependent and needs to be evaluated case-by-case, based on specific merits or faults.
R.A. Fisher was a major proponent of randomization for drawing valid causal inferences…and happened to enjoy smoking tobacco
- he was a prominent critic of any evidence linking smoking to cancer and other diseases; see, e.g., Fisher (1957) in the The BMJ.
- randomization of tobacco smoking is neither ethical nor feasible, yet overwhelming observational evidence linking smoking to cancer
- how to reconcile lack of experimental evidence with such strong observational evidence?

Observational studies and causal inference

In 1965, the (bio)statistician and epidemiologist Sir Austin Bradford Hill outlined a set of criteria for making causal judgments on the basis of observational evidence. An excerpt from Hill (1965) appears below:

When our observations reveal an association between two variables, perfectly clear-cut and beyond what we would care to attribute to the play of chance, what aspects of that association should we especially consider before deciding that the most likely interpretation of it is causation?

Strength: First upon my list I would put the strength of the association.

Consistency: Next on my list of features to be specially considered I would place the consistency of the observed association. Has it been repeatedly observed by different persons, in different places, circumstances and times?

Specificity: One reason, needless to say, is the specificity of the association, the third characteristic which invariably we must consider.

Temporality: My fourth characteristic is the temporal relationship of the association—which is the cart and which the horse?

Biological gradient: Fifthly, if the association is one which can reveal a biological gradient, or dose-response curve, then we should look most carefully for such evidence.

Plausibility: It will be helpful if the causation we suspect is biologically plausible. But this is a feature I am convinced we cannot demand. What is biologically plausible depends upon the biological knowledge of the day.

Coherence: On the other hand the cause-and effect interpretation of our data should not seriously conflict with the generally known facts of the natural history and biology of the disease—in the expression of the Advisory Committee to the Surgeon-General it should have coherence.

Experiment: Occasionally it is possible to appeal to experimental, or semi-experimental, evidence. For example, because of an observed association some preventive action is taken. Does it in fact prevent?

Analogy: In some circumstances it would be fair to judge by analogy. With the effects of thalidomide and rubella before us we would surely be ready to accept slighter but similar evidence with another drug or another viral disease in pregnancy

From causality to statistics in observational studies

Proposition 2 (Identification) Let $(L, A, Y) \overset{iid}{\sim} P$ and assume:

Consistency: $Y = Y^{a}$ whenever $A = a$ .
Non-interference: $Y_{i}^{a} ⊥ ⊥ A_{j}$ for all $i \neq j$ .
No unmeasured confounding: $Y^{a} ⊥ ⊥ A ∣ L$ for each $a \in A$ .
Positivity: $0 < P (A = a ∣ L = l) < 1$ for all $l \in L$ .

Then, $E [E [Y ∣ A = a, L]] = E [Y^{a}]$ .

With Proposition 2, we can re-express $θ_{ATE} = E [Y^{1}] - E [Y^{0}]$ as $ψ_{ATE} = E [E [Y ∣ A = 1, L] - E [Y ∣ A = 0, L]] .$

References

Fisher, Ronald A. 1957. “Dangers of Cigarette-Smoking.” British Medical Journal 2 (5039): 297–98.

Hernán, Miguel A, and James M Robins. 2025. Causal Inference: What If. CRC Press.

Hill, Austin Bradford. 1965. “The Environment and Disease: Association or Causation?” Journal of the Royal Society of Medicine 58 (5): 295–300. https://doi.org/doi.org/10.1177/0141076814562718.

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60. https://doi.org/10.1080/01621459.1986.10478354.

Imbens, Guido W, and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press. https://doi.org/10.1017/CBO9781139025751.

Senn, Stephen. 2022. Dicing with Death: Living by Data. Cambridge University Press. https://doi.org/10.1017/9781009000185.

The difference-in-means estimator…another look…

$\begin{aligned} {\hat{ψ}}_{ATE}^{DM} & = \hat{E} [Y ∣ A = 1] - \hat{E} [Y ∣ A = 0] \\ = \frac{\hat{E} [Y A]}{\hat{E} [A]} - \frac{\hat{E} [Y (1 - A)]}{\hat{E} [(1 - A)]} \\ = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} A_{i}}{\frac{1}{n} \sum_{i = 1}^{n} A_{i}} - \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} (1 - A_{i})}{\frac{1}{n} \sum_{i = 1}^{n} (1 - A_{i})} \\ = \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} A_{i} n}{n_{1}} - \frac{1}{n} \sum_{i = 1}^{n} \frac{Y_{i} (1 - A_{i}) n}{n_{0}} \\ = \frac{1}{n_{1}} \sum_{i : A_{i} = 1} Y_{i} - \frac{1}{n_{0}} \sum_{i : A_{i} = 0} Y_{i} \end{aligned}$