Regression methods examine the association between a response variable and a set of possible predictor variables (covariates).
Linear regression posits an approximately linear relationship between the response and predictor variables.
The response variable can be referred to as the dependent variable, and the predictor variable the independent variable.
A simple linear regression model takes the form:
Simple linear regression quantifies how the mean of a response variable varies with a single predictor , assuming linearity.
Multiple regression
Multiple linear regression evaluates the relationship, assuming linearity, between the mean of a response variable, , and a vector of predictors, , that is,
This is conceptually similar to the simpler case of evaluating ’s mean with respect to a single , except that the interpretation is much more nuanced (more on this later).
Linearity is an approximation; so, think of regression as projection onto a simpler worldview (“if the phenomenon were linear…”).
Back to the PREVEND study
As adults age, cognitive function changes over time, largely due to cerebrovascular and neurodegenerative changes. PREVEND study measured clinical, demographic data for participants, 1997-2006.
Data from 4,095 participants appear the prevend dataset in the oibiostatR package.
Ruff Figural Fluency Test (RFFT) to assess cognitive function (planning and the ability to switch between different tasks).
Assumptions for linear regression
A few assumptions for justifying the use of linear regression to describe how the mean of varies with .
Independent observations: pairs are independent; that is, values of one pair provide no information on those of others.
Linearity: is linear and appears justifiably so in the observed data.
Constant variability (homoscedasticity): variability of response about the regression line is constant across different values of predictor .
Approximate normality of residuals:
What happens when these assumptions do not hold? 🤔
Linear regression via ordinary least squares
Distance between an observed point and the corresponding predicted value from the regression line is the residual for the th unit.
For , where , the residual is
The least squares regression line minimizes the sum of squared residuals1 among to get estimates .
The mean squared error (MSE), a metric of prediction quality, is also based on the residuals .
Coefficients in least squares linear regression
and are parameters with estimates and ; estimates can be calculated from summary statistics:
, : sample means of and
, : sample SD’s of and
: correlation between and
lm(prevend.samp$RFFT ~ prevend.samp$Age)$coef
(Intercept) prevend.samp$Age
137.549716 -1.261359
Least squares regression line for the association of RFTT and age in the PREVEND data is
Linear regression: The population view
For a population of ordered pairs , the population regression line is , where 1.
Since , the regression line is also , where denotes the expectation of when .
So, the regression line is a statement about averages: What do we expect the mean of to be when takes on the value ? If, the mean of follows, in fact, a linear relationship in .
Linear regression: Checking assumptions?
Assumptions of linear regression are independence of study units, linearity (in parameters), constant variability, normality of residuals.
Independence should be enforced by well-considered study design.
Other assumptions may be checked empirically…but should we?
Residual plots: scatterplots in which predicted values are on the -axis and residuals are on the -axis
Normal probability plots: theoretical quantiles for a normal versus observed quantiles (of residuals)
…whether these hold (or not), the claim made is that linearity is a faithful representation of the underlying phenomenon. But is it?
Linear regression with categorical predictors
Although the response variable in linear regression is necessarily numerical, the predictor may be either numerical or categorical.
Simple linear regression only accommodates categorical predictor variables with two levels1.
Simple linear regression with a two-level categorical predictor compares the means of the two groups defined by :
Back to FAMuSS: Comparing ndrm.ch by sex
Let’s re-examine the association between change in non-dominant arm strength after resistance training and sex in the FAMuSS data.
# calculate mean ndrm.ch in each grouptapply(famuss$ndrm.ch, famuss$sex, mean)
Female Male
62.92720 39.23512
# fit a linear model of ndrm.ch by sexlm(famuss$ndrm.ch ~ famuss$sex)$coef
(Intercept) famuss$sexMale
62.92720 -23.69207
Intercept : mean in baseline group
Slope : difference of group means
Strength of a regression fit: Using
Correlation coefficient measures strength of a linear relationship between ; , is the covariance normalized by the SDs
(or ) more common as a measure of the strength of a linear fit, since describes amount of variation in response explained by the regression fit
If a linear regression fit perfectly captured the variability in the observed data, then would equal and would be 1 (its maximum).
The variability of the residuals about the regression line represents the variability remaining after the fit; is the variability unexplained by the regression fit.
Statistical inference in regression
Assume observed data to have been randomly sampled from a population where the explanatory variable and response variable are related as follows
Under this assumption, the slope and intercept of the fitted regression line are estimates of the parameters and .
Goal: Inference for the slope , the association1 of with .
Hypothesis testing in regression
The null hypothesis is most often about no association:
, that is, and are not associated
, that is, and variables are associated
Use the -test? Recall the -statistic (here, with ): where under the null hypothesis of no association.
Confidence intervals in regression
A (100)% confidence interval for is where is the appropriate quantile of a -distribution, and the standard error1 of the estimator is
Example: Linear Regression in an RCT
Consider a randomized controlled trial (RCT) that recruits patients, assigns () each to drug or placebo, and monitors them until the trial’s end, at which point an outcome (e.g., a disease severity score) is measured.
L1
L2
X
Y
1
1
0
1.39
0
0
0
-0.18
0
0
0
-0.20
1
1
1
3.61
The treatment is randomized, so it will be balanced on the baseline factors (i.e., confounders , ) on average. Linear regression, , gives the difference of which is 1.
Statistical power and sample size
Question: A collaborator approaches you about this hypothetical RCT, wondering whether is a sufficient sample size — is it?
In terms of hypothesis testing: and (the drug “works”). How large does have to be for it to be detectable? Is enough?
The power of a statistical test is the probability that the test (correctly) rejects the null hypothesis when the alternative hypothesis is true. Power depends on…
the hypothesized effect size ( in an RCT – but not generally)
the variance of each of the two groups (i.e., , for treatment, placebo)
the sample sizes of each of the two group (, )
Outcomes and errors in testing
Result of test
State of nature
Reject
Fail to reject
is true
Type I error, (false positive)
No error, (true negative)
is true
No error, (true positive)
Type II error, (false negative)
Choosing the right sample size
Study design includes calculating a study size (sample size) such that probability of rejecting is acceptably large, typically 80%-90%.
It is important to have a precise estimate of an appropriate study size, since
a study needs to be large enough to allow for sufficient power to detect a difference between groups when one exists, but
not so unnecessarily large that it is cost-prohibitive or unethical.
Often, simulation is a quick and feasible way to conduct a power analysis.
Multiple regression
In most practical settings, more than one explanatory variable is likely to be associated with a response.
Multiple linear regression evaluates the relationship between a response and several (say, ) predictors .
Multiple linear regression takes the form
PREVEND: Statin use and cognitive function
The PREVEND study collected data on statin use and demographic factors.
Statins are a class of drugs widely used to lower cholesterol.
Recent guidelines for prescribing statins suggest statin use for almost half of Americans 40-75 years old, as well as nearly all men over 60.
A few small (low ) studies have found evidence of a negative association between statin use and cognitive ability.
Age, statin use, and RFFT score
Red dots represent statin users; blue dots represent non-users.
Multiple (linear) regression takes the form where is the number of predictors (or covariates).
Recall the correspondence of the regression equation with
The coefficient of : the predicted difference in the mean of for groups that differ by one unit in and for whom all other predictors take on the same value.
Practically, a coefficient in multiple regression is the association between the response and predictor , after adjusting for other predictors .
In action: RFFT vs. statin use and age
Fit the multiple regression with lm():
# fit the linear modelprevend_multreg <-lm(RFFT ~ Statin + Age, data = prevend.samp)prevend_multreg
Call:
lm(formula = RFFT ~ Statin + Age, data = prevend.samp)
Coefficients:
(Intercept) StatinUser Age
137.8822 0.8509 -1.2710
0.851 — so it seems statin use is associated with a positive difference in RFTT, when adjusting for age (i.e., for study units in the same age range).
# print the model summary for the coefficientssummary(prevend_multreg)$coef |>round(3)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 137.882 5.122 26.919 0.000
StatinUser 0.851 2.596 0.328 0.743
Age -1.271 0.094 -13.478 0.000
Assumptions for multiple regression
Analogous to those of simple linear regression…
Independence: units independent in .
Linearity: for each predictor variable , difference in the predictor is linearly related to difference in the response variable in groups defined by all other predictors taking on the same value.
Constant variability: have approximately constant variance.
Normality of residuals: approximately normally distributed.
References
Tsiatis, Anastasios A, Marie Davidian, Min Zhang, and Xiaomin Lu. 2008. “Covariate Adjustment for Two-Sample Treatment Comparisons in Randomized Clinical Trials: A Principled yet Flexible Approach.”Statistics in Medicine 27 (23): 4658–77. https://doi.org/10.1002/sim.3113.
Introduction to Linear Regression Nima Hejazi nhejazi@hsph.harvard.edu Harvard Biostatistics September 3, 2025