- The project is out-of-sync -- use `renv::status()` for details.
July 8, 2025
- The project is out-of-sync -- use `renv::status()` for details.
Two-sample data can be paired or unpaired (independent).
Paired measurements for each “participant” or study unit
Two independent sets of measurements
Nature of the data dictates testing procedure: two-sample test for paired data or two-sample test for independent group data.
Does the dietary (energy) intake of women differ pre- versus post-menstruation (across 10 days)? (Table 9.3 of Altman 1990)
A study was conducted to assess the effect of menstruation on energy intake.
pre | post |
---|---|
5260 | 3910 |
5470 | 4220 |
5640 | 3885 |
6180 | 5160 |
6390 | 5645 |
6515 | 4680 |
The energy intake measurements are paired within a given study unit (each of the women)—use this structure to advantage, i.e., exploit homogeneity.
For each woman \(i\), we have measurements \(x_{i, \text{pre}}\) and \(x_{i, \text{post}}\).
Define the measurement difference: \(d_i = x_{i, \text{pre}} - x_{i, \text{post}}\)
Base inference on \(\overline{d}\), sample mean of the \(d_i\), that is \(\overline{d} = \frac{1}{n} \sum_{i=1}^n d_i\)
Let \(\Delta\) be the population mean of the difference in energy intake for 10-day intervals pre- and post-menstruation for the population of women from which this random sample (of \(n=11\)) was taken.
The null and alternative hypotheses are
The test statistic for the paired \(t\)-test is
\(t\)-test and confidence interval for paired data
\[ t = \dfrac{\overline{d} - \Delta_{H_0}}{\dfrac{s_d}{\sqrt{n}}} \quad \quad \quad \text{CI}(\Delta) = \overline{d} \pm \left( t^{\star} \times \dfrac{s_d}{\sqrt{n}} \right) \] where \(\overline{d} = \frac{1}{n} \sum_{i=1}^n d_i\) is the mean of the paired differences \(d_i\), \(s_d\) is the sample standard deviation of the \(d_i\), and \(n\) is the number of differences (i.e., number of pairs).
A paired \(t\)-test is just a one-sample test of the differences \(d_i\); the \(p\)-value may be calculated from a \(t\) distribution with \(df = n - 1\). In the above, recall that \(\dfrac{s_d}{\sqrt{n}}\) is the standard error of the mean of the differences, \(\overline{d}\).
R
do the workPaired \(t\)-test
Paired t-test
data: intake$pre and intake$post
t = 11.941, df = 10, p-value = 3.059e-07
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
1074.072 1566.838
sample estimates:
mean difference
1320.455
Differences \(t\)-test
# one-sample syntax
t.test(
intake[, pre - post],
mu = 0, paired = FALSE,
alternative = "two.sided"
)
One Sample t-test
data: intake[, pre - post]
t = 11.941, df = 10, p-value = 3.059e-07
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
1074.072 1566.838
sample estimates:
mean of x
1320.455
Independent \(t\)-test
# two-sample syntax
t.test(
intake$pre, intake$post,
mu = 0, paired = FALSE,
alternative = "two.sided"
)
Welch Two Sample t-test
data: intake$pre and intake$post
t = 2.6242, df = 19.92, p-value = 0.01629
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
270.5633 2370.3458
sample estimates:
mean of x mean of y
6753.636 5433.182
Like the one-sample version, paired \(t\)-test requires assumptions:
Question: Does change in non-dominant arm strength (ndrm.ch
) after resistance training differ between men and women?
sex | age | height | weight | actn3.r577x | ndrm.ch |
---|---|---|---|---|---|
Female | 27 | 65.0 | 199 | CC | 40 |
Male | 36 | 71.7 | 189 | CT | 25 |
Female | 24 | 65.0 | 134 | CT | 40 |
Female | 40 | 68.0 | 171 | CT | 125 |
Female | 32 | 61.0 | 118 | CC | 40 |
Female | 24 | 62.2 | 120 | CT | 75 |
famuss
dataset from Vu and Harrington (2020)
Framing the question — the null and alternative hypotheses are
Hypotheses may generically be written in terms of \(\mu_A\) and \(\mu_B\).
The test statistic for the two-group (independent) \(t\)-test is
Two-group (independent) \(t\)-test
\[t = \dfrac{(\bar{x}_A - \bar{x}_B) - (\mu_A - \mu_B)} {\sqrt{\dfrac{s_A^2}{n_A} + \dfrac{s_B^2}{n_B}}} = \dfrac{d_{\overline{x}} - \Delta_{\mu}}{\sqrt{\dfrac{s_A^2}{n_A} + \dfrac{s_B^2}{n_B}}},\] where \(\sqrt{\dfrac{s_A^2}{n_A} + \dfrac{s_B^2}{n_B}}\) is an estimate of the standard errors of the two group means.
The two-sample \(t\)-test compares between-group differences; the \(p\)-value may be calculated from the \(t\) distribution, but the \(df\) differ.
A conservative approximation is \(df = \text{min}(n_A - 1, n_B - 1)\)
The Welch-Satterthwaite approximation1 (used by R
): \[df = \dfrac{\left[s_A^2 / n_A + s_B^2 / n_B \right]^2}
{\left[\dfrac{(s_A^2/n_A)^2}{(n_A - 1)} +
\dfrac{(s_B^2/n_B)^2}{(n_B - 1)}\right]}\]
Alternative is to use Student’s version \(df = n_A + n_B - 2\), under a (strong!) assumption of equal variances of the two groups \(A\) and \(B\)
A \((1-\alpha)\)(100)% confidence interval (CI) for the difference in the two population means \(\Delta_{\mu} = \mu_A - \mu_B\) is of the form
\[\text{CI}(\mu_A - \mu_B) = (\overline{x}_A - \overline{x}_B) \pm \left(t^{\star} \times \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}} \right),\] where \(t^{\star}\) is the critical point (i.e., quantile) on a \(t\) distribution (with same \(df\) as for corresponding \(t\)-test) and area \(\dfrac{\alpha}{2}\) in either tail.
R
do the work
Welch Two Sample t-test
data: famuss$ndrm.ch[female] and famuss$ndrm.ch[male]
t = 10.073, df = 574.01, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
19.07240 28.31175
sample estimates:
mean of x mean of y
62.92720 39.23512
uses \(\sqrt{\dfrac{s_A^2}{n_A} + \dfrac{s_B^2}{n_B}}\)
Two Sample t-test
data: famuss$ndrm.ch[female] and famuss$ndrm.ch[male]
t = 9.1425, df = 593, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18.60259 28.78155
sample estimates:
mean of x mean of y
62.92720 39.23512
uses1 \(s_{\text{pooled}} \sqrt{\dfrac{1}{n_A} + \dfrac{1}{n_B}}\)
Permutation testing1 is a nonparametric framework for inference
Permutation/randomization inference evaluates evidence against a different null hypothesis (\(H_0: X_{i,A} = X_{i,B} \,\, \forall \,\, i = 1, \ldots, n\))
Exact Two-Sample Fisher-Pitman Permutation Test
data: ndrm.ch by sex (Female, Male)
Z = 8.5664, p-value < 2.2e-16
alternative hypothesis: true mu is not equal to 0
Exact: How many ways to choose \(n=\) 242 males from \(n=\) 595 individuals? 1.294158^{173}!
Approximative Two-Sample Fisher-Pitman Permutation Test
data: ndrm.ch by sex (Female, Male)
Z = 8.5664, p-value < 1e-04
alternative hypothesis: true mu is not equal to 0
Approximate: What about 10000? Is that enough?
Data on \(n\) units, measured as \(X_1, \ldots, X_n\), under two conditions \(A\) and \(B\), e.g., \(n = 1000\) patients assigned treatment \(A\) or placebo \(B\)
Weak null: averages
\(H_0: \mu_A = \mu_B\)
\(H_A: \mu_A \neq \mu_B\)
Is there no effect on average?
Sharp (strong) null: individuals
\(H_0: X_{i, A} = X_{i, B} \,\, \forall \,\, i\)
\(H_A: X_{i, A} \neq X_{i, B} \,\, \text{for some} \,\, i\)
Is there no effect for everyone?
The sharp null implies the weak null—no effect at all, no effect on average too.
Going beyond two-group comparisons
Is change in non-dominant arm strength after resistance training associated with genotype?
Suppose we are interested in comparing means across more than two groups. Why not conduct several two-sample \(t\)-tests?
ANOVA: Assesses equality of means across many groups:
Under \(H_0\), there is no real difference between the groups—so any observed variation in group means must be due to chance alone.
ANOVA exploits differences in means from an overall mean and within-group variation to evaluate equality of a few group means.
Is the variability in the sample means large enough that it seems unlikely to be due to chance alone?
Compare two quantities:
The \(F\)-test for ANOVA
ANOVA measures discrepancies via the F-statistic, which follows an \(F\) distribution: \[ F = \frac{\text{variance between groups}}{\text{variance within groups}} = \frac{MS_{\text{Groups}}}{MS_{\text{Error}}} \]
Just like the \(t\)-test, ANOVA requires assumptions about the data
When these assumptions are violated, hard to know whether \(H_0\) being rejected is due to evidence or failure of the assumptions…
As with the \(t\)-test, a permutational \(F\)-test can be formulated.
If the \(F\)-test indicates sufficient evidence of inequality of the group means, pairwise comparisons to identify the group may follow.
Pairwise comparisons may use the two-group (independent) \(t\)-test:
R
do the work Df Sum Sq Mean Sq F value Pr(>F)
famuss$actn3.r577x 2 7043 3522 3.231 0.0402 *
Residuals 592 645293 1090
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: \(p < \alpha\), sufficient evidence to reject \(H_0\) in favor of \(H_A\)—so at least one group differs in mean from the others.
But which group(s) is (are) the source of the difference? 🧐
If the \(F\)-test indicates sufficient evidence of inequality of the group means, pairwise comparisons (\(t\)-tests1) may identify those groups.
pairwise.t.test()
to perform these post hoc two-sample t-tests.Pairwise comparisons using two-sample \(t\)-tests (CC
to CT
, CC
to TT
, CT
to TT
) can now be done if the Type I error rate is controlled.
Examine each of the three two-sample \(t\)-tests, evaluating evidence for group mean differences at the more stringent level of \(\alpha^{\star}\).
R
do the workOnly CC
versus TT
resulted in a \(p\)-value less than \(\alpha^{\star} =
0.0167\).
CT
is not distinguishable from strength change for those with CC
and TT
.CC
and TT
are different.\(p\)-values on original scale (compare to \(\alpha^{\star}\))
Hypothesis testing was intended for controlled experiments or for studies with only a few comparisons (e.g., ANOVA).
Type I errors (rejecting \(H_0\) when \(H_0\) true) occur with probability \(\alpha\).
Multiple testing–compounding error
What happens to Type I error when making several comparisons? When conducting many tests, many chances to make a mistake.
A scientist is using two \(t\)-tests to examine a possible association of each of two genes with a disease type. Assume the tests are independent and each are conducted at level \(\alpha = 0.05\).
Example of compounding of Type I error
Probability of making at least one error is equal to the complement of the event that a Type I error is not made with either test. \[1 - [\mathbb{P}(A^C) \mathbb{P}(B^C) ] = 1 - (1 - 0.05)^2 = 0.0975\]
10 tests… \(\text{experiment-wise error }= 1 - (1 - 0.05)^{10} = 0.401\)
25 tests… \(\text{experiment-wise error }= 1 - (1 - 0.05)^{25} = 0.723\)
100 tests… \(\text{experiment-wise error }= 1 - (1 - 0.05)^{100} = 0.994\)
With 100 independent tests: 99.4% chance an investigator will make at least one Type I error!
Advanced melanoma is an aggressive form of skin cancer that until recently was almost uniformly fatal.
Research is being conducted on therapies that might be able to trigger immune responses to the cancer that then cause the melanoma to stop progressing or disappear entirely.
In a study where 52 patients were treated concurrently with 2 new therapies, nivolumab and ipilimumab, 21 had immune responses.1
Some research questions that can be addressed with inference…
In this study of melanoma, experiencing an immune response to the concurrent therapy is a binary event (i.e., binomial data).
Suppose \(X \sim \text{Bin}(n, p)\) is a binomial RV with parameters \(n\), the number of trials, and \(p\), the probability of success.
The sampling distribution of \(\hat{p}\) is approximately normal1 when
Under these conditions, \(\hat{p}\) is approximately normally distributed with mean \(p\) and standard error \(\sqrt{\frac{p(1-p)}{n}}\), estimated as \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\).
For CIs, use \(\hat{p}\) instead of \(p\).
In hypothesis testing: \(p_0\) for \(p\).
Test statistic \(z\) for null \(H_0: p = p_0\) is \[z = \dfrac{\hat{p} - p_0}{\sqrt{\dfrac{p_0(1 - p_0)}{n}}}\]
1-sample proportions test with continuity correction
data: 21 out of 52, null probability 0.3
X-squared = 2.1987, df = 1, p-value = 0.06906
alternative hypothesis: true p is greater than 0.3
95 percent confidence interval:
0.2906582 1.0000000
sample estimates:
p
0.4038462
An exact procedure does not reply upon a limit approximation.1
The Clopper-Pearson CI is an exact method for binomial CIs: \((\inf S_{\geq}, \sup S_{\leq})\), where \[\begin{align*} S_{\leq} &:= \{p \lvert \mathbb{P} \{\text{Bin}(n,p) \leq x\} > \alpha/2\} \\ S_{\geq} &:= \{p \lvert \mathbb{P} \{\text{Bin}(n,p) \geq x\} > \alpha/2\} \end{align*}\]
Note: While exactness is by construction, the CIs can be conservative (i.e., too large) relative to those from an approximation.
[1] 0.07167176
Exact binomial test
data: 21 and 52
number of successes = 21, number of trials = 52, p-value = 0.07167
alternative hypothesis: true probability of success is greater than 0.3
95 percent confidence interval:
0.2889045 1.0000000
sample estimates:
probability of success
0.4038462
The normal approximation can be applied to \(\hat{p}_A - \hat{p}_B\) if
The standard error of the difference in sample proportions is \[\text{SE}(p_A - p_B) = \sqrt{\dfrac{p_A(1 - p_A)}{n_A} + \dfrac{p_B(1 - p_B)}{n_B}}\]
In resource-limited settings, single-dose nevirapine is given to HIV\(^{+}\) women during birth to prevent mother-to-child transmission of HIV.
Here, the possible outcomes are virologic failure (virus becomes resistant) versus stable disease (virus growth is prevented).
The results of a study comparing NVP vs. LPV in treatment of HIV-infected infants.1 Children were randomized to receive NVP or LPV.
NVP | LPV | Total | |
---|---|---|---|
Virologic Failure | 60 | 27 | 87 |
Stable Disease | 87 | 113 | 200 |
Total | 147 | 140 | 287 |
Is there evidence of a difference in NVP vs. LPV? How to see this?2
Do data support the claim of a differential outcome by treatment?
If there is no difference in outcome by treatment, then knowing treatment provides no information about outcome, i.e., treatment assignment and outcome are independent (i.e., not associated).
Question: What would we expect if no association (i.e., under \(H_0\))?
Idea: How do the observed cell counts differ from those expected (under \(H_0\), i.e., as if the null hypothesis were true)? Let’s use this.
The Pearson \(\chi^2\) test of independence formulates a test statistic to quantify the magnitude of deviation of observed results from what would be expected under \(H_0\).
Under these assumptions, the Pearson \(\chi^2\) test statistic attains a \(\chi^2\) distribution with \(df\) tied to the number of comparisons.
The \(\chi^2\) test statistic is calculated as
The \(\chi^2\) test
\[\chi^2 = \sum_{i = 1}^r \sum_{j = 1}^c \dfrac{(O_{i, j} - E_{i, j})^2}{E_{i, j}},\] and is approximately distributed \(\chi^2\) with \(df= (r - 1)(c - 1)\), where \(r\) is the number of rows and \(c\) is the number of columns (e.g., in a \(2 \times 2\) table, \(df = 1\)).
\(O_{i,j}\) represents the observed count in row \(i\), column \(j\); \(E_{i,j}\) is its expected counterpart1.
If treatment has no effect on outcome, what would we expect?
Under the null hypothesis of independence, \[\mathbb{P}(A \cap B) = \mathbb{P}(A) \times \mathbb{P}(B) = \left(\frac{147}{287}\right) \left(\frac{87}{287}\right) \, ,\] where \(A\) = {assignment to NVP} and \(B\) = {virologic failure}. So, the expected cell count in the upper left corner of the \(2 \times 2\) table from Violari et al. (2012) would be \[\mathbb{E}(A \cap B) = n \times \mathbb{P}(A \cap B) = (287) \left(\frac{147}{287}\right) \left(\frac{87}{287}\right) = 44.56 \, .\] Thus, the scaled deviation, \((O_{i,j} - E_{i,j})^2 / E_{i,j}\), would be \((60 - 44.56)^2 / 44.56 = 5.35\)
Same logic applies for other cells…repeat three times and sum to get the \(\chi^2\) test statistic.
R
The data from Violari et al. (2012):
What would we expect to see under independence? (Table of \(E_{i,j}\))
NVP LPV
Virologic Failure 44.56 42.44
Stable Disease 102.44 97.56
By comparing cells of these two tables (of \(O_{i,j}\) and \(E_{i,j}\)), we get the \(\chi^2\) test statistic.
R
’s chisq.test()
does all of this:
Pearson's Chi-squared test with Yates' continuity correction
data: hiv_table
X-squared = 14.733, df = 1, p-value = 0.0001238
Conclusion: The \(\chi^2\) test finds evidence against the claimed independence of NPV and virologic failure at most reasonable significance standards (\(p\)-value < 0.001 and \(S\)-value \(= -\log_2(p) \approx\) 13).
R.A. Fisher proposed an exact test for contingency tables by exploiting the counts and margins, with the “reasoned basis” for the test being randomization and exchangeability (Fisher 1936).
NVP | LPV | Total | |
---|---|---|---|
Virologic Failure | a | b | a+b |
Stable Disease | c | d | c+d |
Total | a+c | b+d | n |
\[\begin{align*} p &= \dfrac{{a+b \choose a}{c+d \choose c}}{{n \choose a+c}} = \dfrac{{a+b \choose b}{c+d \choose d}}{{n \choose b+d}} \\ p &= \dfrac{(a+b)! \, (c+d)! \, (a+c)! \, (b+d)!}{a! \, b! \, c! \, d! \, n!} \end{align*}\]
Here, \(p(A)\), seeing as many cases of virologic failure in the NVP group as we do see, follows exactly a hypergeometric distribution.
R
’s fisher.test()
does this:
Fisher's Exact Test for Count Data
data: hiv_table
p-value = 0.0001037
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.643280 5.127807
sample estimates:
odds ratio
2.875513
Conclusion: Evidence for independence of NPV, virologic failure lacking (\(p < 0.001\)).
“Dr. Muriel Bristol, a colleague of Fisher’s, claimed that when drinking tea she could distinguish whether milk or tea was added to the cup first (she preferred milk first). To test her claim, Fisher asked her to taste eight cups of tea, four of which had milk added first and four of which had tea added first.” Agresti (2012)
Tea first (real) | Milk first (real) | Total | |
---|---|---|---|
Tea first (thinks) | 2 | 2 | 4 |
Milk first (thinks) | 2 | 2 | 4 |
Total | 4 | 4 | 8 |
\(H_0\): Order doesn’t impact taste, so Dr. Bristol can’t tell
\(H_A\): Order does impact taste and Dr. Bristol can tell
What’s the probability of getting all four right?
Tea first (real) | Milk first (real) | Total | |
---|---|---|---|
Tea first (thinks) | 4 | 0 | 4 |
Milk first (thinks) | 0 | 4 | 4 |
Total | 4 | 4 | 8 |
Fisher's Exact Test for Count Data
data: ladys_tea
p-value = 0.02857
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.339059 Inf
sample estimates:
odds ratio
Inf
Relative risk (RR) measures the risk of an event occurring in one group relative to risk of same event occurring in another group.
The risk of virologic failure among the NVP group is \[\dfrac{\text{# in NVP group and had virologic failure}} {\text{total # in NVP group}} = \dfrac{60}{147} = 0.408\]
The risk of virologic failure among the LPV group is \[\dfrac{\text{# in LPV group and had virologic failure}} {\text{total # in LPV group}} = \dfrac{27}{140} = 0.193\]
RR of virologic failure for NVP vs. LPV is \(0.408/0.193 = 2.11\), so children treated with NVP are estimated to be more than twice as likely to experience virologic failure.
Odds ratio (OR) measures the odds of an event occurring in one group relative to the odds of the event occurring in another group.
The odds of virologic failure among the NVP group is \[\dfrac{\text{# in NVP group and had virologic failure}} {\text{# in NVP group and did not have virologic failure}} = \dfrac{60}{87} = 0.690\]
The odds of virologic failure among the LPV group is \[\dfrac{\text{# in LPV group and had virologic failure}} {\text{# in LPV group and did not have virologic failure}} = \dfrac{27}{113} = 0.239\]
OR of virologic failure for NVP vs. LPV is \(0.690/0.239 = 2.89\), so the odds of virologic failure when given NVP are nearly three times as large as the odds when given LPV.
The relative risk cannot be used in studies that use outcome-dependent sampling, such as a case-control study:
HST 190: Introduction to Biostatistics