- The project is out-of-sync -- use `renv::status()` for details.
July 8, 2025
- The project is out-of-sync -- use `renv::status()` for details.
Statistics uses tools from probability and data analysis to
We’ll illustrate inferential principles in the setting of
Some pithy philosophy on statistics
Consider the Youth Risk Factor Behavior Surveillance System (YRBSS), a survey conducted by the CDC to measure health-related activity in high school aged youth. The YRBSS data contain
yrbss
in the oibiostat
R
package contains the responses from the 13,583 participants from the year 2013.The CDC used 13,572 students’ responses to estimate health behaviors in a target population: 21.2 million high school-aged students in US in 2013.
The mean weight among the 21.2 million students is an example of a population parameter, i.e., \(\mu_{\text{weight}}\).
The mean within a sample (e.g., as with the 13,572 students in YRBSS), is a point estimate \(\bar{x}_{\text{weight}}\) of a population parameter.
Estimating the population mean weight from the sample of 13,572 participants is an example of statistical inference.
Why inference? It is too tedious to gather this information for all 21.2 million students—also, it is unnecessary.
In nearly all studies, there is one target population and one sample.
Suppose a different random sample (of the same size) were taken from the same population—different participants, different \(\bar{x}_{\text{weight}}\).
Sampling variability describes the degree to which a point estimate varies from sample to sample (assuming fixed sampling scheme).
Properties of sampling variability (randomness) allow for us to account for its effect on estimates based on a sample.
The estimator \(\bar{X}\) is a random variable—randomness from sampling.
The statistic \(\bar{X}\) is a random variable for which \(\bar{X} \sim \text{N}(\mu, \sigma^2_{\bar{X}} )\)
Any sample statistic is a random variable since each sample drawn from the population ought to be different.
When the data have not yet been observed, the statistic, like the corresponding RV, is a function of the same random elements.
If \(\bar{X}\) could be observed through repeated sampling, its standard deviation would be \(\text{SE}_{\bar{X}} = \dfrac{\sigma_X}{\sqrt{n}}\) (n.b., \(\text{SE}_{\bar{X}} = \sqrt{\sigma^2_{\bar{X}}}\))
The variability of a sample mean decreases as sample size increases: \(\text{SE}_{\bar{X}}\) characterizes that behavior more precisely.
A confidence interval gives a plausible range of values for the population parameter, coupling an estimate and a margin of error:
Confidence intervals: Definition and construction
A confidence interval with coverage rate \(1-\alpha\) for a population mean \(\mu\) is any random interval \(\text{CI}_{(1-\alpha)}(\mu)\) such that \(\mathbb{P}[\mu \in \text{CI}_{(1-\alpha)}(\mu)] \geq 1 - \alpha\). A common form is \[\bar{x} \pm m \rightarrow (\bar{x} - m, \bar{x} + m),\] where the margin of error, \(m\), draws on the sampling variability of \(\bar{X}\).
Since \(\sqrt{n}(\bar{X} - \mu) \to_d \text{N}(0, \sigma^2)\), the margin of error may be based on the properties (e.g., quantiles) of the normal distribution.
The confidence level may also be called the confidence coefficient. Confidence intervals have a nuanced interpretation:
Asymptotic \(1-\alpha\)(100)% CI
Random interval \(\text{CI}_{(1-\alpha)}(\mu)\) is a \((1-\alpha)\)(100)% CI if \(\lim_{n \to \infty} \mathbb{P}[\mu \in \text{CI}_{(1-\alpha)}(\mu)] \geq 1 - \alpha\)
The \(t\) distribution1 is symmetric, bell-shaped, and centered at 0; it is like a standard normal distribution \(\text{N}(0,1)\), almost…
It has an additional parameter–degrees of freedom (\(df\) or \(\nu\)).
A (\(1 - \alpha\))(100)% confidence interval (CI) for a population mean \(\mu\) based on a single sample with mean \(\bar{x}\) is
\[\bar{x} \pm t^\star \times \frac{s}{\sqrt{n}} \rightarrow \left( \bar{x} - t^\star\frac{s}{\sqrt{n}},\bar{x} + t^\star\frac{s}{\sqrt{n}} \right),\] where \(t^\star\) is the quantile of a \(t\) distribution (with \(\nu = n-1\) df) for which there is \(1 - \dfrac{\alpha}{2}\) area to its left.
For a 95% CI, find \(t^\star\) with 0.975 area to its left (or, equivalently, 0.025 area to its right).
The R
function qt(p, df)
finds the quantile of a \(t\) distribution with df degrees of freedom that has area \(p = \mathbb{P}(T \leq t)\) to its left.
\(t^\star\) for a 95% confidence interval where \(n = 10\) is 2.262.
Just let R
do the work for you…95% CI from t.test()
:
Question: Do Americans tend to be overweight?
Category | BMI range |
---|---|
Underweight | \(<18.50\) |
Healthy weight | 18.5-24.99 |
Overweight | \(\geq 25\) |
Obese | \(\geq30\) |
nhanes.samp.adult
(from the oibiostat
R
package), is
Confidence interval suggests that population average BMI is well outside the range defined as healthy (BMI of 18.5-24.99):
[1] 27.81388 30.38524
attr(,"conf.level")
[1] 0.95
If a (\(1 - \alpha\))(100)% confidence interval for a population mean does not contain a hypothesized value \(\mu_0\), then:
For our BMI inquiry, there are a few possible choices for \(H_0\) and \(H_A\). To simplify and demonstrate, let’s use
The form of \(H_A\) above is a one-sided alternative. One could also write a two-sided alternative, \(H_A: \mu_{\text{BMI}} \neq 21.7\).
The choice of one- or two-sided alternative is context-dependent and should be driven by the motivating scientific question.
The significance level \(\alpha\) quantifies how rare or unlikely an event must be in order to represent sufficient evidence against \(H_0\).
In other words, it is a bar for the degree of evidence necessary for a difference to be considered as “real” (or significant)1 2.
In the context of decision errors, \(\alpha\) is the probability of committing a Type I error (incorrectly rejecting \(H_0\) when it is true).
The test statistic measures the discrepancy between the observed data and what would be expected if the null hypothesis were true.
When testing hypotheses about a mean, a valid test statistic is \[T = \frac{\bar{X} - \mu_0}{\dfrac{s}{\sqrt{n}}},\] where the test statistic \(T\) follows a \(t\) distribution with \(\nu = n-1\).
We will go on to talk about a few more practical versions of the \(t\)-test (e.g., 2-sample \(t\)-test, 1-sample \(t\)-test of paired differences).
In each of these cases, some assumptions are required…
Justifying your assumptions is the hardest part
Given the context of your scientific problem, are these assumptions true or reasonable?
What is the probability that we would observe a result as or more extreme than the observed sample value, if the null hypothesis is true? This probability is the \(p\)-value1.
Despite their popularity, \(p\)-values are notoriously hard to interpret1. Rafi and Greenland (2020)’s \(S\)-values (“binary surprisal value”) are a cognitive tool for interpreting and understanding \(p\)-values2.
The surprisal value \(S\) for interpreting \(p\)-values
The \(S\)-value is defined via \(p = \left(\frac{1}{2}\right)^s\) as \(s = -\log_2(p)\), where \(p\) is a \(p\)-value.
\(S\) quantifies the degree of surprise associated with experiencing a similar result (as the given \(p\)-value) when evaluating if a coin is fair.
For a two-sided alternative, \(H_A: \mu \neq \mu_0\), the \(p\)-value of a t-test is the total area from both tails of the t distribution that are beyond the absolute value of the observed t statistic:
\[p = 2 \mathbb{P}(T \geq \lvert t \rvert) = \mathbb{P}(T \leq - \lvert t \rvert) + \mathbb{P}(T \geq \lvert t \rvert)\]
For a one-sided alternative, the \(p\)-value is the area in the tail of the t distribution that matches the direction of the alternative.
For \(H_A: \mu > \mu_0\): \(p = \mathbb{P}(T \geq \lvert t \rvert)\)
For \(H_A: \mu < \mu_0\): \(p = \mathbb{P}(T \leq - \lvert t \rvert)\)
The smaller the \(p\)-value, the stronger the evidence against \(H_0\).
Always state conclusions in the context of the research problem.
Question: Do Americans tend to be overweight?
One Sample t-test
data: nhanes.samp.adult$BMI
t = 11.383, df = 134, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 21.7
95 percent confidence interval:
28.02288 Inf
sample estimates:
mean of x
29.09956
The KS test is a nonparametric test that evaluates the equality of two distributions (n.b., different from testing mean differences).
The empirical cumulative distribution function (eCDF) is \[F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbb{I}_{(-\infty, x)}(X_i) = \frac{\# \text{ elements} \leq x}{n}\]
The KS test uses as its test statistic \(D_n = \text{sup}_x \lvert F_n(x) - F(x) \rvert\)1, where \(F(x)\) is a theoretical (i.e., assumed) CDF.
Applying the KS test evaluates evidence against \(H_0\) of BMI (in NHANES population) arising from a normal distribution:
Asymptotic one-sample Kolmogorov-Smirnov test
data: bmi_zstd
D = 0.09895, p-value = 0.1422
alternative hypothesis: two-sided
HST 190: Introduction to Biostatistics