August 16, 2025
With a binary predictor, linear regression estimates the difference in the mean response between groups defined by the levels.
For example, the equation for linear regression of RFFT score from statin use with the data prevend.samp
is
For example: Is RFFT score associated with education?
The variable Education
indicates the highest level of education an individual completed in the Dutch educational system:
Education
Baseline category represents individuals who at most completed primary school (Education = 0
); Intercept
is the sample mean RFFT score for this group.
Coefficients indicates difference in estimated average RFFT relative to baseline group, e.g., predicted increase of 44.96 points for Univ
:
x | |
---|---|
(Intercept) | 40.94118 |
EducationLowerSecond | 14.77857 |
EducationHigherSecond | 32.13345 |
EducationUniv | 44.96389 |
x | |
---|---|
Primary | 40.94118 |
LowerSecond | 55.71975 |
HigherSecond | 73.07463 |
Univ | 85.90506 |
The coefficients of multiple regression
Inference is usually about slope parameters:
Typically, the hypotheses of interest are
The
Then, a 95% confidence interval for the slope
The
There is sufficient evidence to reject
summary(lm())
reports the
Call:
lm(formula = RFFT ~ Education, data = prevend.samp)
Residuals:
Min 1Q Median 3Q Max
-55.905 -15.975 -0.905 16.068 63.280
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 40.941 3.203 12.783 < 2e-16 ***
EducationLowerSecond 14.779 3.686 4.009 7.04e-05 ***
EducationHigherSecond 32.133 3.763 8.539 < 2e-16 ***
EducationUniv 44.964 3.684 12.207 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.87 on 496 degrees of freedom
Multiple R-squared: 0.3072, Adjusted R-squared: 0.303
F-statistic: 73.3 on 3 and 496 DF, p-value: < 2.2e-16
# fit model adjusting for age, edu, cvd
model1 <- lm(RFFT ~ Statin + Age + Education + CVD,
data = prevend.samp)
summary(model1)
Call:
lm(formula = RFFT ~ Statin + Age + Education + CVD, data = prevend.samp)
Residuals:
Min 1Q Median 3Q Max
-56.348 -15.586 -0.136 13.795 63.935
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.03507 6.33012 15.645 < 2e-16 ***
Statin 4.69045 2.44802 1.916 0.05594 .
Age -0.92029 0.09041 -10.179 < 2e-16 ***
EducationLowerSecond 10.08831 3.37556 2.989 0.00294 **
EducationHigherSecond 21.30146 3.57768 5.954 4.98e-09 ***
EducationUniv 33.12464 3.54710 9.339 < 2e-16 ***
CVDPresent -7.56655 3.65164 -2.072 0.03878 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 20.71 on 493 degrees of freedom
Multiple R-squared: 0.4355, Adjusted R-squared: 0.4286
F-statistic: 63.38 on 6 and 493 DF, p-value: < 2.2e-16
When preforming multiple regression,
A statistical interaction describes the case when the impact of an
(Intercept) Age DiabetesYes
4.800011340 0.007491805 -0.317665963
What if two distinct regressions were fit to evaluate total cholesterol and age (varying by diabetic status)?
Consider the model
The term
Call:
lm(formula = TotChol ~ Age * Diabetes, data = nhanes.samp.adult.500)
Residuals:
Min 1Q Median 3Q Max
-2.3587 -0.7448 -0.0845 0.6307 4.2480
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.695703 0.159691 29.405 < 2e-16 ***
Age 0.009638 0.003108 3.101 0.00205 **
DiabetesYes 1.718704 0.763905 2.250 0.02492 *
Age:DiabetesYes -0.033452 0.012272 -2.726 0.00665 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.061 on 469 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.03229, Adjusted R-squared: 0.0261
F-statistic: 5.216 on 3 and 469 DF, p-value: 0.001498
Interpretation: For those with diabetes, increasing age has a negative (not positive!) association with total cholesterol (
As in simple regression,
As variables are added,
In the summary(lm())
output, Multiple R-squared is
Adjusted
A generalization of methods for two-way tables, allowing for joint association between a binary response and several predictors.
Similar in intent to linear regression, but details differ a bit…
Patients admitted to ICUs are very ill — usually from a serious medical event (e.g., respiratory failure, stroke) or from trauma (e.g., traffic accident).
Question: Can patient features available at admission be used to estimate the probability of survival to hospital discharge?
The icu
dataset in the aplore3
package is from a study conducted at Baystate Medical Center in Springfield, MA.
sta
variable includes labels Died
(0
for death before discharge) and Lived
(1
for survival to discharge).Let’s take a look at how ICU patients fare with CPR:
Prior CPR
Survival No Yes Sum
Died 33 7 40
Lived 154 6 160
Sum 187 13 200
Probability | Odds = |
Odds |
---|---|---|
0 | 0/1 = 0 | 0 |
1/100 = 0.01 | 1/99 = 0.0101 | 1 : 99 |
1/10 = 0.10 | 1/9 = 0.11 | 1 : 9 |
1/4 | 1/3 | 1 : 3 |
1/3 | 1/2 | 1 : 2 |
1/2 | 1 : 1 | |
2/3 | 2 : 1 | |
3/4 | 3 | 3 : 1 |
1 | 1/ |
Suppose
The model for a single variable logistic regression is
Logistic regression models the association between the probability
Since a probability can only take values from 0 to 1, regression can fail, with the predicted response falling outside the interval
Recall that
The logistic regression model’s equation:
The intercept
The slope coefficient
The odds of survival in patients who previously received CPR:
As with linear regression, the regression slope captures information about the association between a response and a predictor.
These hypotheses can also be written in terms of the odds ratio.
Suppose
With several predictors
Each coefficient estimates the difference in log(odds) for a one unit difference in that variable, for groups in which the other variables are the same.
Call:
glm(formula = sta ~ cpr + cre + age, family = binomial(link = "logit"),
data = icu)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.32901 0.74884 4.446 8.77e-06 ***
cprYes -1.69680 0.62145 -2.730 0.00633 **
cre> 2.0 -1.13328 0.70191 -1.615 0.10641
age -0.02814 0.01125 -2.502 0.01235 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 200.16 on 199 degrees of freedom
Residual deviance: 181.47 on 196 degrees of freedom
AIC: 189.47
Number of Fisher Scoring iterations: 5
Typically, the hypotheses of interest are
These hypotheses can also be written in terms of the odds ratio.
We considered
The ridge (or Tikhonov) regularized regression estimator minimizes a sum of squares subject to an
The lasso regularized regression estimator minimizes a sum of squares subject to an
The elastic net regularized regression estimator combines the ridge and lasso penalties. It is defined as
Regression methods borrow information across
What if we evaluate a large collection (e.g., in
Can we run a fair competition between estimators to determine the best? If so, how would we go about doing this?
How can we estimate the risk of an estimator?
Idea: Use empirical risk estimate
Problem:
By analogy: A student got a glimpse of the exam before taking it.
With
When we use the same data to fit the regression function and to evaluate it:
What we need is external data to properly evaluate estimators.
Instead, we use cross-validation to estimate the risk.
Data are divided into
Several factors to consider when choosing
HST 190: Introduction to Biostatistics