Introduction to Linear Regression

Nima Hejazi
nhejazi@hsph.harvard.edu

Harvard Biostatistics

September 3, 2025

Regression

Regression methods examine the association between a response variable and a set of possible predictor variables (covariates).

  • Linear regression posits an approximately linear relationship between the response and predictor variables.
  • The response variable y can be referred to as the dependent variable, and the predictor variable x the independent variable.
  • A simple linear regression model takes the form: Y=β0+β1X+ϵ

Simple linear regression quantifies how the mean of a response variable Y varies with a single predictor X, assuming linearity.

Multiple regression

Multiple linear regression evaluates the relationship, assuming linearity, between the mean of a response variable, Y, and a vector of predictors, X1,X2,…,Xp, that is, Y=β0+β1X1+β2X2+⋯+βpXp+ϵ .

This is conceptually similar to the simpler case of evaluating Y’s mean with respect to a single X, except that the interpretation is much more nuanced (more on this later).

Linearity is an approximation; so, think of regression as projection onto a simpler worldview (“if the phenomenon were linear…”).

Back to the PREVEND study

As adults age, cognitive function changes over time, largely due to cerebrovascular and neurodegenerative changes. PREVEND study measured clinical, demographic data for participants, 1997-2006.

  • Data from 4,095 participants appear the prevend dataset in the oibiostat R package.
  • Ruff Figural Fluency Test (RFFT) to assess cognitive function (planning and the ability to switch between different tasks).

Assumptions for linear regression

A few assumptions for justifying the use of linear regression to describe how the mean of Y varies with X.

  1. Independent observations: (x,y) pairs are independent; that is, values of one pair provide no information on those of others.
  2. Linearity: E[Y∣X]=f(X) is linear and appears justifiably so in the observed data.
  3. Constant variability (homoscedasticity): variability of response Y about the regression line is constant across different values of predictor X.
  4. Approximate normality of residuals: ϵ∼N(0,σϵ2)

What happens when these assumptions do not hold? 🤔

Linear regression via ordinary least squares

Distance between an observed point yi and the corresponding predicted value y^i from the regression line is the residual for the ith unit.

For (xi,yi), where y^=β^0+β^1x, the residual ei is ei=yi−y^i

The least squares regression line minimizes the sum of squared residuals1 ∑i=1nei2 among (xi,yi)∀i to get estimates (β^0,β^1).

The mean squared error (MSE), a metric of prediction quality, is also based on the residuals MSE=1n∑i=1nei2.

  1. In other words, the least squares line is the line with coefficients β0 and β1 such that the quantity e12+e22+⋯+en2 is minimized, where n is the number of data points.

Coefficients in least squares linear regression

β0 and β1 are parameters with estimates β^0 and β^1; estimates can be calculated from summary statistics:

β^1=rsysxβ^0=y―−β^1x―

  • x―, y―: sample means of x and y
  • sx, sy: sample SD’s of x and y
  • r: correlation between x and y
lm(prevend.samp$RFFT ~ prevend.samp$Age)$coef
     (Intercept) prevend.samp$Age 
      137.549716        -1.261359 

Least squares regression line Y^=β^0+β^1X for the association of RFTT and age in the PREVEND data is RFFT^=137.55−(1.26)(Age)

Linear regression: The population view

For a population of ordered pairs (x,y), the population regression line is Y=β0+β1X+ϵ, where ϵ∼N(0,σϵ)1.

Since E[ϵ]=0, the regression line is also E[Y∣x]=β0+β1x, where E[Y∣x] denotes the expectation of Y when X=x.

So, the regression line is a statement about averages: What do we expect the mean of Y to be when X takes on the value x? If, the mean of Y follows, in fact, a linear relationship in (β0,β1).

  1. The error term ϵ can be thought of as a population parameter for the residuals e.

Linear regression: Checking assumptions?

Assumptions of linear regression are independence of study units, linearity (in parameters), constant variability, normality of residuals.

  • Independence should be enforced by well-considered study design.
  • Other assumptions may be checked empirically…but should we?
    • Residual plots: scatterplots in which predicted values are on the x-axis and residuals are on the y-axis
    • Normal probability plots: theoretical quantiles for a normal versus observed quantiles (of residuals)
    …whether these hold (or not), the claim made is that linearity is a faithful representation of the underlying phenomenon. But is it?

Linear regression with categorical predictors

Although the response variable in linear regression is necessarily numerical, the predictor may be either numerical or categorical.

Simple linear regression only accommodates categorical predictor variables with two levels1.

Simple linear regression with a two-level categorical predictor X compares the means of the two groups defined by X: E[Y∣X=1]−E[Y∣X=0]=β0+β1−β0=β1

  1. Examining categorical predictors with more than two levels requires the use of multiple linear regression, since each level must be encoded separately.

Back to FAMuSS: Comparing ndrm.ch by sex

Let’s re-examine the association between change in non-dominant arm strength after resistance training and sex in the FAMuSS data.

# calculate mean ndrm.ch in each group
tapply(famuss$ndrm.ch, famuss$sex, mean)
  Female     Male 
62.92720 39.23512 
# fit a linear model of ndrm.ch by sex
lm(famuss$ndrm.ch ~ famuss$sex)$coef
   (Intercept) famuss$sexMale 
      62.92720      -23.69207 

ndrm.ch^=62.93−23.59(sex = male)

  • Intercept β0: mean in baseline group
  • Slope β1: difference of group means

Strength of a regression fit: Using R2

  • Correlation coefficient r measures strength of a linear relationship between (X,Y); rX,Y=Cov(X,Y)/σXσY, is the covariance normalized by the SDs
  • r2 (or R2) more common as a measure of the strength of a linear fit, since R2 describes amount of variation in response Y explained by the regression fit

R2=Var(y^i)Var(yi)=Var(yi)−Var(ei)Var(yi)=1−∑i=1n(yi−y^i)2(yi−y¯)2

  • If a linear regression fit perfectly captured the variability in the observed data, then Var(y^i) would equal Var(yi) and R2 would be 1 (its maximum).
  • The variability of the residuals about the regression line represents the variability remaining after the fit; Var(ei) is the variability unexplained by the regression fit.

Statistical inference in regression

Assume observed data (xi,yi) to have been randomly sampled from a population where the explanatory variable X and response variable Y are related as follows Y=β0+β1X+ϵ,whereϵ∼N(0,σϵ2)

Under this assumption, the slope β^1 and intercept β^0 of the fitted regression line are estimates of the parameters β0 and β1.

Goal: Inference for the slope β1, the association1 of X with Y.

  1. Note that β1=Cov(X,Y)/Var(X), so the association is about how much more (or less) the covariance of X and Y is than just the variance of X.

Hypothesis testing in regression

The null hypothesis H0 is most often about no association:

  • H0:β1=0, that is, X and Y are not associated
  • HA:β1≠0, that is, X and Y variables are associated

Use the t-test? Recall the t-statistic (here, with df=n−2): t=β^1−β1,H0SE(β^1)=β^1SE(β^1), where β1,H0=0 under the null hypothesis of no association.

Confidence intervals in regression

A (1−α)(100)% confidence interval for β1 is β^1±(t⋆×SE(β^1)), where t⋆ is the appropriate quantile of a t-distribution, and the standard error1 of the estimator β^1 is SE(β^)=1n−2∑i=1n(yi−y^i)2(xi−x¯)2

  1. The estimator β^1 is asymptotically normal n(β^−β)→DN(0,σϵ2Var(X))

Example: Linear Regression in an RCT

Consider a randomized controlled trial (RCT) that recruits n=200 patients, assigns (X) each to drug or placebo, and monitors them until the trial’s end, at which point an outcome Y (e.g., a disease severity score) is measured.

L1 L2 X Y
1 1 0 1.39
0 0 0 -0.18
0 0 0 -0.20
1 1 1 3.61

The treatment X is randomized, so it will be balanced on the baseline factors (i.e., confounders L1, L2) on average. Linear regression, E[Y∣X]=β0+β1X, gives E[Y∣X=1]=β0+β1andE[Y∣X=0]=β0, the difference of which is E[Y∣X=1]−E[Y∣X=0]=β11.

  1. In an RCT, the parameter β1 matches the average treatment effect (ATE) (Tsiatis et al. 2008)

Statistical power and sample size

Question: A collaborator approaches you about this hypothetical RCT, wondering whether n=200 is a sufficient sample size — is it?

In terms of hypothesis testing: H0:β1=0 and HA:β1>0 (the drug “works”). How large does β1 have to be for it to be detectable? Is n=200 enough?

The power of a statistical test is the probability that the test (correctly) rejects the null hypothesis H0 when the alternative hypothesis HA is true. Power depends on…

  • the hypothesized effect size (β1 in an RCT – but not generally)
  • the variance of each of the two groups (i.e., σ1, σ2 for treatment, placebo)
  • the sample sizes of each of the two group (n1, n2)

Outcomes and errors in testing

Result of test
State of nature Reject H0 Fail to reject H0
H0 is true Type I error, P=α (false positive) No error, P=1−α (true negative)
HA is true No error, P=1−β (true positive) Type II error, P=β (false negative)

Choosing the right sample size

Study design includes calculating a study size (sample size) such that probability of rejecting H0 is acceptably large, typically 80%-90%.

It is important to have a precise estimate of an appropriate study size, since

  • a study needs to be large enough to allow for sufficient power to detect a difference between groups when one exists, but
  • not so unnecessarily large that it is cost-prohibitive or unethical.

Often, simulation is a quick and feasible way to conduct a power analysis.

Multiple regression

In most practical settings, more than one explanatory variable is likely to be associated with a response.

Multiple linear regression evaluates the relationship between a response Y and several (say, p) predictors X1,X2,…,Xp.

Multiple linear regression takes the form Y=β0+β1X1+β2X2+⋯+βpXp+ϵ

There are several applications of multiple regression, including

  • Estimating an association between a response variable and primary predictor of interest while adjusting for possible confounding variables
  • Constructing a model that effectively explains the observed variation in the response variable

PREVEND: Statin use and cognitive function

The PREVEND study collected data on statin use and demographic factors.

  • Statins are a class of drugs widely used to lower cholesterol.
  • Recent guidelines for prescribing statins suggest statin use for almost half of Americans 40-75 years old, as well as nearly all men over 60.
  • A few small (low n) studies have found evidence of a negative association between statin use and cognitive ability.

Age, statin use, and RFFT score

Red dots represent statin users; blue dots represent non-users.


Call:
lm(formula = RFFT ~ Statin, data = prevend.samp)

Coefficients:
(Intercept)   StatinUser  
      70.71       -10.05  

Interpretation of regression coefficients

Multiple (linear) regression takes the form Y=β0+β1X1+β2X2+⋯+βpXp+ϵ, where p is the number of predictors (or covariates).

Recall the correspondence of the regression equation with E[Y∣X1,⋯,Xp]=β0+β1X1+β2X2+⋯+βpXp,

The coefficient βj of Xj: the predicted difference in the mean of Y for groups that differ by one unit in Xj and for whom all other predictors take on the same value.

Practically, a coefficient βj in multiple regression is the association between the response Y and predictor Xj, after adjusting for other predictors Xi:i=1,⋯,p|i≠j.

In action: RFFT vs. statin use and age

Fit the multiple regression with lm():

# fit the linear model
prevend_multreg <- lm(RFFT ~ Statin + Age, data = prevend.samp)
prevend_multreg

Call:
lm(formula = RFFT ~ Statin + Age, data = prevend.samp)

Coefficients:
(Intercept)   StatinUser          Age  
   137.8822       0.8509      -1.2710  

βstatin= 0.851 — so it seems statin use is associated with a positive difference in RFTT, when adjusting for age (i.e., for study units in the same age range).

# print the model summary for the coefficients
summary(prevend_multreg)$coef |> round(3)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  137.882      5.122  26.919    0.000
StatinUser     0.851      2.596   0.328    0.743
Age           -1.271      0.094 -13.478    0.000

Assumptions for multiple regression

Analogous to those of simple linear regression…

  1. Independence: units (yi,x1,i,x2,i,…,xp,i) independent in i.
  2. Linearity: for each predictor variable xj, difference in the predictor is linearly related to difference in the response variable in groups defined by all other predictors taking on the same value.
  3. Constant variability: ϵ have approximately constant variance.
  4. Normality of residuals: ϵ approximately normally distributed.

References

Tsiatis, Anastasios A, Marie Davidian, Min Zhang, and Xiaomin Lu. 2008. “Covariate Adjustment for Two-Sample Treatment Comparisons in Randomized Clinical Trials: A Principled yet Flexible Approach.” Statistics in Medicine 27 (23): 4658–77. https://doi.org/10.1002/sim.3113.

HST 190: Introduction to Biostatistics

Introduction to Linear Regression Nima Hejazi nhejazi@hsph.harvard.edu Harvard Biostatistics September 3, 2025

  1. Slides

  2. Tools

  3. Close
  • Introduction to Linear Regression
  • Regression
  • Multiple regression
  • Back to the PREVEND study
  • Assumptions for linear regression
  • Linear regression via ordinary least squares
  • Coefficients in least squares linear regression
  • Linear regression: The population view
  • Linear regression: Checking assumptions?
  • Linear regression with categorical predictors
  • Back to FAMuSS: Comparing ndrm.ch by sex
  • Strength of a regression fit: Using R2
  • Statistical inference in regression
  • Hypothesis testing in regression
  • Confidence intervals in regression
  • Example: Linear Regression in an RCT
  • Statistical power and sample size
  • Outcomes and errors in testing
  • Choosing the right sample size
  • Multiple regression
  • PREVEND: Statin use and cognitive function
  • Age, statin use, and RFFT score
  • Interpretation of regression coefficients
  • In action: RFFT vs. statin use and age
  • Assumptions for multiple regression
  • References
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • r Scroll View Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help