Clinical Trials and Estimands

Nima Hejazi

Harvard Biostatistics

July 8, 2025

Overview of clinical trials

  • Throughout the continuum of health care, patients and providers have to make decisions (e.g., one treatment or another).
  • A major goal of clinical research is to establish evidence to help in that decision-making process.
  • Randomized, controlled clinical trials are viewed as generating the strongest, most compelling evidence.
    • this is especially true for individual studies
    • synthesizing evidence across many studies may involve meta-analysis instead

Goals of clinical trials

Clinical trials used to investigate a wide variety of potential decisions:

  1. Prevention, e.g., effect of Mediterranean-style diet on cardiovascular outcomes
  2. Screening or early detection, e.g., utility of breast cancer screening
  3. Diagnostic testing, e.g., to evaluate ventilator-associated pneumonia
  4. Therapeutics/treatments, e.g., effect of TB therapeutic regimens on disease-free survival
  5. Quality-of-life, e.g., family-support interventions in the ICU

Clinical trials of therapeutics

There are usually four phases of such clinical trials:

  • Phase I: 20-80 participants, lasting up to several months, with the goal of studying safety of the investigational regimen
  • Phase II: 100-300 participants, lasting up to 2 years, with the goal of studying efficacy of the investigational regimen
  • Phase III: 1000-3000 participants, lasting 1-4 years, with the goal of studying the safety, efficacy, and dosing of the investigational regimen
  • Phase IV: thousands of participants, with the goal of studying long-term effectiveness of new regimens as post-marketing surveillance

Clinical RCTs must be pre-registered (at ClinicalTrials.gov) and a detailed protocol must be submitted/evaluated before the study begins.

Components of a clinical trial

  • Scientific components include study population, interventions and hypotheses, primary (and possibly secondary) endpoints.
  • Study components include type of randomization, extent of blinding
  • Analytic components include choice of statistical analysis framework, the target sample size (and associated power), interim monitoring

Friedman et al. (2015) and Pocock (1983) provide in depth treatments of the design and conduct of clinical trials.

We will briefly review key concepts in broad strokes.

The study population

  • Specification of the study population provides a basis for the interpretation of any findings, i.e., to whom do the results apply?
  • For example, the REMoxTB RCT enrolled adult patients (18+ years of age) “who had newly diagnosed, previously untreated M. tuberculosis infection,” including people living with HIV (PLwH) (Gillespie et al. 2014).
    • Study population was adult patients with drug-susceptible TB, including PLwH (subject to CD4+ and ART eligibility criteria)
    • Patient population to whom results of the trial should generalize
  • There is no single correct answer—but considerations should include
    • clinically meaningful and relevant
    • feasibility to recruit and follow participants

The study population

  • Choice of study population may be determined, in part, by
    • previous work and prior studies (e.g., PLwH eligibility in TB trials)
    • likelihood of adverse events and/or competing risks
  • Critical to recognize that interventions often have heterogeneous effects and which sub-populations most relevant to include/exclude
  • Interpretability trade-offs in selecting the study population
    • broader definitions are more inclusive but less specific
    • broader definitions are more appropriate for policy decisions but far less informative for personalized decision-making

Formulating hypotheses

  • We discussed the evaluation of hypotheses within the null hypothesis significance testing (NHST) framework:
    • The null hypothesis, \(H_0\), is the status quo
    • The alternative hypothesis, \(H_1\), is a deviation from the status quo
    • The strategy is to gather evidence (data) and evaluate degree to which evidence is compatible with \(H_0\) (quantified by a p-value)
  • In a clinical trial, the alternative hypothesis, \(H_1\), is aligned with the primary goals of the study (e.g., treatment-shortening in REMoxTB)

Formulating hypotheses: Superiority or non-inferiority

Consider an investigational regimen \(A\) and standard-of-care/placebo \(B\), there are two typical framings of hypotheses in clinical trials:

  1. superiority: investigational regimen is better than control, expressed as \(H_0: \mu_A = \mu_B\) and \(H_1: \mu_A > \mu_B\)
  2. non-inferiority: investigational regimen no worse than control, expressed \(H_0: \mu_A < \mu_B - \Delta\) and \(H_1: \mu_A \geq \mu_B - \Delta\)
    • investigational regimen may be better in terms of, say, cost, safety
    • requires specification of an acceptable non-inferiority margin \(\Delta\), which is often informed by prior studies and/or expert opinion

Formulating hypotheses: The REMoxTB trial

  • The REMoxTB trial (NCT00864383) (Gillespie et al. 2014) aimed to test two investigational regimens against standard-of-care:
    1. Control (6 mo.): isoniazid, rifampin, pyrazinamide, and ethambutol for 8 weeks, followed by 18 weeks of isoniazid and rifampin
    2. Isoniazid group (4 mo.): replaced ethambutol with moxifloxacin for 17 weeks, followed by 9 weeks of placebo
    3. Ethambutol group (4 mo.): replaced isoniazid with moxifloxacin for 17 weeks, followed by 9 weeks of placebo
  • Primary hypothesis concerned non-inferiority, “defined as a between-group difference of less than 6 percentage points in the upper boundary of the two-sided 97.5% Wald confidence interval for the proportion of patients with an unfavorable outcome.”

Choosing a control group

  • When testing an investigational regimen (say, “thisworksumab”, a new drug candidate), what is the correct cohice of control?
  • This choice has implications for both interpretation and quantification of the effect that is ultimately estimated.
  • Conducting trial to gather evidence to inform future decisions: choice should reflect decision that patients/regulators will ultimately face.
  • When is a placebo appropriate? No established standard-of-care.
  • When a standard-of-care is available, an active control should be used:
    • unethical to deny patients known-effective choice of treatment
    • for example, would never assign placebo in a TB treatment trial but did do so for COVID-19 vaccine trial in 2020-2021

Choosing an endpoint

  • In REMoxTB, primary and secondary endpoints were used
    • primary efficacy endpoint was a “composite unfavorable outcome” based on “bacteriologically or clinically defined failure or relapse” up to 18 months after randomization (Gillespie et al. 2014)
    • secondary outcomes included “time to an unfavorable outcome” and “status at the end of treatment” (Gillespie et al. 2014)
  • The choice of primary endpoint must be pre-specified
    • this is the basis for the power analysis justifying the sample size
    • depending on the disease, a regulatory body or conventions in the field may mandate specific endpoints
    • typically must report results for the primary endpoint, regardless of how evaluation of the hypotheses goes

Choosing an endpoint

  • Primary endpoints may be co-primary (e.g., in Alzheimer’s disease, with functional and cognitive measures) or composite (e.g., in TB, integrating both clinical measures and biological information)
  • Secondary endpoints should also be pre-specified
    • typically with reduced stringency relative to error rates
    • consider accommodating power for such endpoints
  • Secondary endpoints are often either exploratory or confirmatory
    • exploratory: hypothesis-generating, may inform future studies
    • confirmatory: used to support or elaborate upon findings from primary analysis

Choosing an Estimand: The Estimands Framework

US Food and Drug Administration (2021) ICH E9 (R1) addendum outlines a framework for specifying the estimand of interest in a clinical trial.

Kahan et al. (2023) and Kahan et al. (2024) provide succinct overviews:

  • clear description of what the treatment effect represents
  • what happens to the same set of patients under different treatment conditions (i.e., a causal comparison for a common target population)
  • the estimand is the target of inference, the estimator is the statistical method applied to obtain an estimate of the estimand

Weir, Dufault, and Phillips (2024) give specifics in TB therapeutics trials and apply this framework to reanalyzing data from REMoxTB as an example.

Randomization

  • Central to clinical trials’ strength of evidence: on average, rule out differences between groups due to confounding factors
  • In REMoxTB, patients randomized 1:1:1 to one of the three arms with stratification based on patient weight group and study center.
  • A few common types of randomization:
    • Bernoulli: flip a digital coin, guarantees balance on average
    • Blocking: collect patients into groups as enrollment occurs and randomly assign blocks, guarantees sample balance across arms
    • Stratification: group patients based on key characteristics and assign randomly within strata, guarantees balance across factors
    • Adaptive: adjust randomization probabilities based on baseline prognostic factors or based on responses as study progresses

Blinding

  • Potential bias can arise from patients or study staff knowing which treatment is being administered
  • Blinding hides the treatment being administered (e.g., same pill appearance or generic labeling of study materials)
  • Challenging to perform in some settings (e.g., surgical interventions or behavior-altering interventions)

Analytic framework

  • Key considerations: hypotheses being assessed, study design
  • Aspects of statistical methods may be informed by the nature of the endpoint too but scientific question and estimand should dominate
  • The estimand translates the scientific question into a quantity evaluable by way of a hypothesis test
    • e.g., difference of means: two-sample t-test, linear model
    • e.g., odds ratio: logistic regression
    • e.g., hazard ratio: Cox proportional hazards model
  • Should the nature of the endpoint impact the estimand, or should what you measure affect the question of interest?
  • Additional considerations include accommodating design features, e.g., via stratification or interactions (e.g., group-by-time effects)

Design considerations

  • Randomization is critical but how many participants to recruit?
  • Sample size selection has implications for cost and feasibility
    • typically consider statistical power for fixed Type-I error rate \(\alpha\)
    • power: probability of rejecting the null when alternative is true, i.e., \(\Pr(\text{reject } H_0 \mid H_1 \text{ is true}) = 1 - \beta\), for Type-II error rate \(\beta\)
  • Increasing sample size: higher chance of rejecting \(H_0\) when it is false
  • Two strategies for power analysis:
    1. Pre-specify power (e.g., 80%) and calculate sample size
    2. Pre-specify sample size and calculate approximate power
  • In both cases, power and sample size also depend on effect size, variability in outcome in population, missing data (e.g., dropout)

Statistical analysis

  • With all of the above in place, this part is easy! (except the details…)
  • First, explore and describe the data: tables, figures, summaries
  • Next, evaluate the primary hypothesis based on the pre-specified analysis plan
    • encodes the primary study question and the estimand (e.g., group difference from a two-sample t-test)
    • required to adhere to the analysis plan pre-specified in the protocol to avoid “gaming” based on the data
  • Finally, summarize and interpret results, both statistical and scientific

Statistical analysis…complications

  • Often, additional issues arise, including
    • challenges in recruiting and retaining study participants
    • imbalance between the study arms based on key factors
    • non-compliance of participants with the study protocol
    • drop out from the study
  • Each item above necessitates adjustment in the statistical analysis plan
  • Tread lightly…adjustments can be dangerous since they open the door to questioning the objectivity and integrity of the study
    • anticipate possible issues and pre-specify corrective strategies
    • sensitivity analyses may be designed to assess how much such deviations impact the results

Per-protocol versus intention-to-treat analyses

  • The study protocol specifies in detail what the intervention is, including
    • timing/schedule (e.g., single or multiple time point treatment)
    • dosage, both by arm and over time
    • clinically necessary modifications (e.g., dose reduction for adverse events or contraindications)
  • Ideally, participants and staff adhere to the protocol
  • In reality, deviations occur all the time, including
    • non-compliance with dosing or schedule
    • treatment switching (e.g., investigational to control regimens)
    • concomitant treatment (e.g., external to study)

Per-protocol versus intention-to-treat analyses

  • Two common analytic strategies:
    1. Per-protocol (PP): only participants who adhere to the protocol
    2. Intention-to-treat (ITT): all participants as randomized
  • The per-protocol estimand is defined as the effect of treatment among those who adhere to the protocol
    • classical approach: remove non-adherent participants and analyze remainder in each study arm, then compute the contrast
    • modern approaches adopt causal inference ideas to define the per-protocol estimand as a sub-population effect based on principal stratification (e.g., as the complier average causal effect)
  • The intention-to-treat estimand is defined as the effect of treatment among all participants as randomized

Subgroup analyses

  • Primary analysis results are usually a single contrast across all study participants, goal being to generalize to study population
  • Effects of interventions may differ in different sub-populations:
    • called “treatment effect heterogeneity”
    • e.g., by disease severity at presentation
    • e.g., for PLwH versus not in REMoxTB
    • e.g., by socio-demographic factors

Subgroup analyses

  • Subgroup analyses should be pre-specified in the study protocol
    • these may inform preliminary data for future, follow-up studies
    • subgroup effects can be evaluated via interaction terms in regression or stratified analyses
  • Subgroup analyses suffer a few key limitations:
    • more false positives due to multiple testing over subgroups
    • more false negatives due to low sample size in subgroups
    • non-randomized: no safeguards (Bland and Altman 2011)

Interim monitoring

  • Clinical trials are approved by an institutional review board (IRB) and evaluated by a data safety and monitoring board (DSMB)
  • The DSMB is independent and composed of clinicians, statisticians, patient advocates, and ethicists
    • is responsible for reviewing study progress (accrual, conduct)
    • will pause a trial to evaluate adverse events
  • The DSMB may stop a trial early based on interim analyses:
    • such analyses are pre-specified and based on, e.g., patient accrual and/or at a fixed time point
    • such analyses typically concern efficacy, futility, and safety

Interim monitoring and early stopping

  • Interim monitoring speeds up decision-making:
    • allows an effective treatment to be found quickly
    • minimizes exposure to an unsafe treatment
    • reduces costs when treatment is ineffective
    • risks failing to gain information for secondary endpoints
  • Interim analyses are still hypothesis tests and “spend” \(\alpha\):
    • opportunity to make an inferential mistake at each interim analysis
    • must adjust to preserve the overall Type-I error rate of the study

Interim monitoring: \(\alpha\)-spending

  • In group-sequential designs, common approach is to use \(\alpha\)-spending functions (DeMets and Lan 1994):
    • preserve overall Type-I error rate \(\alpha\) across \(R\) interim analyses
    • define \(\alpha(\tau)\) as an increasing function in the information fraction \(\tau\) (e.g., \(\tau = n/N\) for \(n\) units enrolled of total \(N\))
    • consider \(R\) interim analyses and \(t_1, \ldots, t_R = 1\), then \(\alpha(\tau)\) gives the appropriate Type-I error rate for the \(\tau\)th interim test such that the overall Type-I error rate \(\alpha(t_R)\) is preserved at the final test
    • intuition: realizing that interim tests are correlated sequentially, how stringent must we be at each interim test to control overall \(\alpha\)?
  • The \(\alpha\)-spending approach led to several popular spending functions, including those yielding Pocock and O’Brien-Fleming boundaries

References

Bland, J Martin, and Douglas G Altman. 2011. “Comparisons Within Randomised Groups Can Be Very Misleading.” The BMJ 342. https://doi.org/10.1136/bmj.d561.
DeMets, David L, and K K Gordon Lan. 1994. “Interim Analysis: The Alpha Spending Function Approach.” Statistics in Medicine 13 (13-14): 1341–52.
Friedman, Lawrence M, Curt D Furberg, David L DeMets, David M Reboussin, and Christopher B Granger. 2015. Fundamentals of Clinical Trials. Springer. https://doi.org/10.1007/978-3-319-18539-2.
Gillespie, Stephen H, Angela M Crook, Timothy D McHugh, Carl M Mendel, Sarah K Meredith, Stephen R Murray, Frances Pappas, Patrick P J Phillips, and Andrew J Nunn. 2014. “Four-Month Moxifloxacin-Based Regimens for Drug-Sensitive Tuberculosis.” New England Journal of Medicine 371 (17): 1577–87. https://doi.org/10.1056/NEJMoa1407426.
Kahan, Brennan C, Suzie Cro, Fan Li, and Michael O Harhay. 2023. “Eliminating Ambiguous Treatment Effects Using Estimands.” American Journal of Epidemiology 192 (6): 987–94. https://doi.org/10.1093/aje/kwad036.
Kahan, Brennan C, Joanna Hindley, Mark Edwards, Suzie Cro, and Tim P Morris. 2024. “The Estimands Framework: A Primer on the ICH E9 (R1) Addendum.” The BMJ 384: e076316. https://doi.org/10.1136/bmj-2023-076316.
Pocock, Stuart J. 1983. Clinical Trials: A Practical Approach. John Wiley & Sons. https://doi.org/10.1002/9781118793916.
US Food and Drug Administration. 2021. Statistical Principles for Clinical Trials E9 (R1): Addendum: Estimands and Sensitivity Analysis in Clinical Trials.” https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e9r1-statistical-principles-clinical-trials-addendum-estimands-and-sensitivity-analysis-clinical.
Weir, Isabelle R, Suzanne M Dufault, and Patrick P J Phillips. 2024. “Estimands for Clinical Endpoints in Tuberculosis Treatment Randomized Controlled Trials: A Retrospective Application in a Completed Trial.” Trials 25 (1). https://doi.org/10.1186/s13063-024-07999-w.