HST 190 at HMS – Regression: Linear, Logistic, Regularized

Categorical predictors

With a binary predictor, linear regression estimates the difference in the mean response between groups defined by the levels.

For example, the equation for linear regression of RFFT score from statin use with the data prevend.samp is $\hat{RFFT} = 70.71 - 10.05 (StatinUser)$

Mean RFFT score for individuals not using statins is 70.71.
Mean RFFT score for individuals using statins is 10.05 points lower than non-users, $70.71 - 10.05 = 60.66$ .

What about with more than two levels?

For example: Is RFFT score associated with education?

The variable Education indicates the highest level of education an individual completed in the Dutch educational system:

0: primary school
1: lower secondary school
2: higher secondary education
3: university education

Interpreting regression on `Education`

Baseline category represents individuals who at most completed primary school (Education = 0); Intercept is the sample mean RFFT score for this group.

Coefficients indicates difference in estimated average RFFT relative to baseline group, e.g., predicted increase of 44.96 points for Univ: $40.94 + 44.96 = 85.90$

Results of linear regression
	x
(Intercept)	40.94118
EducationLowerSecond	14.77857
EducationHigherSecond	32.13345
EducationUniv	44.96389

Group means
	x
Primary	40.94118
LowerSecond	55.71975
HigherSecond	73.07463
Univ	85.90506

Inference for multiple regression

The coefficients of multiple regression ${\hat{β}}_{0}, {\hat{β}}_{1}, \dots, {\hat{β}}_{p}$ are estimates of the population parameters in the regression:

$y = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p} + ϵ, ϵ \sim N (0, σ)$

Inference is usually about slope parameters: $β_{1}, β_{2}, \dots, β_{p}$

Typically, the hypotheses of interest are

$H_{0} : β_{k} = 0$ , the variables $X_{k}$ and $Y$ are not associated
$H_{A} : β_{k} \neq 0$ , the variables $X_{k}$ and $Y$ are associated

Testing hypotheses about slope coefficients

The $t$ -statistic has $d f = n - p - 1$ , where $n$ is the sample size and $p$ is the number of predictors in the model.

$t = \frac{{\hat{β}}_{k} - β_{k, H_{0}}}{SE ({\hat{β}}_{k})} = \frac{{\hat{β}}_{k}}{SE ({\hat{β}}_{k})}$

Then, a 95% confidence interval for the slope $β_{k}$ is given by ${\hat{β}}_{k} \pm (t^{⋆} \times SE ({\hat{β}}_{k})),$ where $t^{⋆}$ is the quantile of a $t$ -distribution with $n - p - 1$ degrees of freedom with $α / 2$ area to the right.

The $F$ -statistic in multiple regression

The $F$ -statistic gives a test (ANOVA) to assess whether predictors in the regression, as a group, are associated with $Y$ .

$H_{0} : β_{1} = β_{2} = \dots = β_{p} = 0$
$H_{A} :$ At least one of the slope coefficients is not 0

There is sufficient evidence to reject $H_{0}$ (no association) if the $p$ -value of the $F$ -statistic is smaller than or equal to $α$ .

summary(lm()) reports the $F$ -statistic and its $p$ -value.

Inference: RFFT vs. Education

# model summary
summary(lm(RFFT ~ Education, data = prevend.samp))


Call:
lm(formula = RFFT ~ Education, data = prevend.samp)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.905 -15.975  -0.905  16.068  63.280 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)             40.941      3.203  12.783  < 2e-16 ***
EducationLowerSecond    14.779      3.686   4.009 7.04e-05 ***
EducationHigherSecond   32.133      3.763   8.539  < 2e-16 ***
EducationUniv           44.964      3.684  12.207  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.87 on 496 degrees of freedom
Multiple R-squared:  0.3072,    Adjusted R-squared:  0.303 
F-statistic:  73.3 on 3 and 496 DF,  p-value: < 2.2e-16

# connection to ANOVA
summary(aov(prevend.samp$RFFT ~ prevend.samp$Education))

                        Df Sum Sq Mean Sq F value Pr(>F)    
prevend.samp$Education   3 115041   38347    73.3 <2e-16 ***
Residuals              496 259469     523                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reanalyzing the PREVEND data

# fit model adjusting for age, edu, cvd
model1 <- lm(RFFT ~ Statin + Age + Education + CVD,
             data = prevend.samp)
summary(model1)


Call:
lm(formula = RFFT ~ Statin + Age + Education + CVD, data = prevend.samp)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.348 -15.586  -0.136  13.795  63.935 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           99.03507    6.33012  15.645  < 2e-16 ***
Statin                 4.69045    2.44802   1.916  0.05594 .  
Age                   -0.92029    0.09041 -10.179  < 2e-16 ***
EducationLowerSecond  10.08831    3.37556   2.989  0.00294 ** 
EducationHigherSecond 21.30146    3.57768   5.954 4.98e-09 ***
EducationUniv         33.12464    3.54710   9.339  < 2e-16 ***
CVDPresent            -7.56655    3.65164  -2.072  0.03878 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 20.71 on 493 degrees of freedom
Multiple R-squared:  0.4355,    Adjusted R-squared:  0.4286 
F-statistic: 63.38 on 6 and 493 DF,  p-value: < 2.2e-16

Interaction in regression

When preforming multiple regression, $y = β_{0} + β_{1} x_{1} + β_{2} x_{2} + . . . + β_{p} x_{p} + ϵ$ assumes that a predictor $x_{j}$ differing by 1 unit impacts the predicted response by $β_{j}$ , holding other covariates constant.

A statistical interaction describes the case when the impact of an $x_{j}$ on the response depends on value(s) of another covariate.

Cholesterol vs. Age and Diabetes

 (Intercept)          Age  DiabetesYes 
 4.800011340  0.007491805 -0.317665963

Cholesterol vs. Age and Diabetes

What if two distinct regressions were fit to evaluate total cholesterol and age (varying by diabetic status)?

Cholesterol vs. Age and Diabetes

Adding an interaction term

Consider the model $E (T o t C h o l) = β_{0} + β_{1} (A g e) + β_{2} (D i a b e t e s) + β_{3} (D i a b e t e s \times A g e) .$

The term $(D i a b e t e s \times A g e)$ is the interaction between diabetes status and age, and $β_{3}$ is the coefficient of the interaction.

# fit the model
model.interact <- lm(TotChol ~ Age * Diabetes, data = nhanes.samp.adult.500)
coef(model.interact)

    (Intercept)             Age     DiabetesYes Age:DiabetesYes 
    4.695702513     0.009638183     1.718704342    -0.033451562

Significance of an interaction term

summary(model.interact)


Call:
lm(formula = TotChol ~ Age * Diabetes, data = nhanes.samp.adult.500)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3587 -0.7448 -0.0845  0.6307  4.2480 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.695703   0.159691  29.405  < 2e-16 ***
Age              0.009638   0.003108   3.101  0.00205 ** 
DiabetesYes      1.718704   0.763905   2.250  0.02492 *  
Age:DiabetesYes -0.033452   0.012272  -2.726  0.00665 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.061 on 469 degrees of freedom
  (27 observations deleted due to missingness)
Multiple R-squared:  0.03229,   Adjusted R-squared:  0.0261 
F-statistic: 5.216 on 3 and 469 DF,  p-value: 0.001498

Interpretation: For those with diabetes, increasing age has a negative (not positive!) association with total cholesterol ( $β_{1} + β_{3} =$ -0.024)

$R^{2}$ with multiple regression

As in simple regression, $R^{2}$ represents the proportion of variability in the response variable explained by the model.

As variables are added, $R^{2}$ always increases.

In the summary(lm()) output, Multiple R-squared is $R^{2}$ .

# extract R^2 of a model
summary(lm(RFFT ~ Statin + Age, data = prevend.samp))$r.squared

[1] 0.2851629

Adjusted $R^{2}$ as a tool for model assessment

$R_{a d j}^{2} = 1 - (\frac{Var (e_{i})}{Var (y_{i})} \times \frac{n - 1}{n - p - 1}),$ where $n$ is the number of units and $p$ is the number of predictors.

Adjusted $R^{2}$ incorporates a penalty for the inclusion of predictors that fail to contribute much to explaining observed variation in $Y$ .

often used to balance predictive ability and model complexity.
unlike $R^{2}$ , $R_{a d j}^{2}$ does not have an inherent interpretation.

# extract adjusted R^2 of a model
summary(lm(RFFT ~ Statin + Age, data = prevend.samp))$adj.r.squared

[1] 0.2822863

Logistic regression

A generalization of methods for two-way tables, allowing for joint association between a binary response and several predictors.

Similar in intent to linear regression, but details differ a bit…

the response variable is categorical (specifically, binary)
the model is not estimated via minimizing least squares
the model coefficients have a (very) different interpretation

Survival to discharge in the ICU

Patients admitted to ICUs are very ill — usually from a serious medical event (e.g., respiratory failure, stroke) or from trauma (e.g., traffic accident).

Question: Can patient features available at admission be used to estimate the probability of survival to hospital discharge?

The icu dataset in the aplore3 package is from a study conducted at Baystate Medical Center in Springfield, MA.

The dataset contains information about patient characteristics at admission, including heart rate, diagnosis, kidney function, and much more.
The sta variable includes labels Died (0 for death before discharge) and Lived (1 for survival to discharge).

Survival to discharge in the ICU: To CPR or not?

Let’s take a look at how ICU patients fare with CPR:

        Prior CPR
Survival  No Yes Sum
   Died   33   7  40
   Lived 154   6 160
   Sum   187  13 200

Odds of survival for those who did not receive CPR: $154 / 33 = 4.67$
- $odds = \frac{p}{1 - p} = \frac{0.824}{1 - 0.824} = 4.67$
Probability of survival for those who did not receive CPR: $154 / 187 = 0.824$
- $p = \frac{odds}{1 + odds} = \frac{4.67}{1 + 4.67} = 0.824$

Odds and probabilities

Probability	Odds = $p / (1 - p)$	Odds
0	0/1 = 0	0
1/100 = 0.01	1/99 = 0.0101	1 : 99
1/10 = 0.10	1/9 = 0.11	1 : 9
1/4	1/3	1 : 3
1/3	1/2	1 : 2
1/2	$(\frac{1}{2}) / (\frac{1}{2}) = 1$	1 : 1
2/3	$(2 / 3) / (1 / 3) = 2$	2 : 1
3/4	3	3 : 1
1	1/ $0 \approx \infty$	$\infty$

Logistic regression

Suppose $Y$ is a binary variable and $X$ is a predictor variable.

In our ICU example: $Y$ is survival to discharge, $X$ is prior CPR.
Let $p (x) = P (Y = 1 ∣ X = x)$ , for $Y = 1$ denoting survival.

The model for a single variable logistic regression is $\log [\frac{p (x)}{1 - p (x)}] = β_{0} + β_{1} x$

Logistic regression models the association between the probability $p$ of an event of interest and one or more predictor variables.

Why log(odds) in regression models?

Since a probability can only take values from 0 to 1, regression can fail, with the predicted response falling outside the interval $[0, 1]$ .

The odds, $p / (1 - p)$ , ranges from 0 to $\infty$ .
The natural log of the odds (log odds) ranges from $- \infty$ to $\infty$ .

glm(sta ~ cpr, data = icu, family = binomial(link = "logit"))


Call:  glm(formula = sta ~ cpr, family = binomial(link = "logit"), data = icu)

Coefficients:
(Intercept)       cprYes  
      1.540       -1.695  

Degrees of Freedom: 199 Total (i.e. Null);  198 Residual
Null Deviance:      200.2 
Residual Deviance: 192.2    AIC: 196.2

Interpreting the output

Recall that $p (x) = P (Y = 1 ∣ X = x) = p (survival = 1 ∣ CPR = yes)$ .

The logistic regression model’s equation:

$\log [\frac{\hat{p} (status = lived ∣ CPR = yes)}{1 - \hat{p} (status = lived ∣ CPR = yes)}] = 1.540 - 1.695 (CPR = yes)$

The intercept $β_{0}$ is the log( $\hat{odds}$ of survival) for ICU patients who didn’t receive CPR:

estimated odds of survival $= \exp (1.540) = 4.66$
estimated probability of survival $= \frac{odds}{1 + odds} = \frac{4.66}{1 + 4.66} = 0.823$

Interpreting the output…

The slope coefficient $β_{1}$ of $CPR = yes$ represents the difference in log(odds of survival), from the no previous CPR group to the previous CPR group.

The odds of survival in patients who previously received CPR:

$\log [\frac{\hat{p} (status = lived ∣ CPR = yes)}{1 - \hat{p} (status = lived ∣ CPR = yes)}] = 1.540 - 1.695 (1)$
$\hat{odds}$ of survival $= \exp (1.540 - 1.695) = 0.856$

$β_{1}$ is the log of the estimated odds ratio for survival, comparing CPR vs. not:

$\hat{O R} = \exp (- 1.695) = 0.184$
$\hat{O R} = \frac{{odds}_{CPR =yes}}{{odds}_{CPR = no}} = \frac{0.856}{4.66} = 0.184$

Inference for simple logistic regression

As with linear regression, the regression slope captures information about the association between a response and a predictor.

$H_{0} : β_{1} = 0$ , the $X$ and $Y$ variables are not associated
$H_{A} : β_{1} \neq 0$ , the $X$ and $Y$ variables are associated

These hypotheses can also be written in terms of the odds ratio.

Logistic regression with multiple predictors

Suppose $p (x) = p (x_{1}, x_{2}, \dots, x_{p}) = P (Y = 1 ∣ x_{1}, x_{2}, \dots, x_{p})$ .

With several predictors $x_{1}, x_{2}, \dots, x_{p}$ the regression equation is

$\log [\frac{p (x)}{1 - p (x)}] = β_{0} + β_{1} x_{1} + β_{2} x_{2} + \dots + β_{p} x_{p}$

Each coefficient estimates the difference in log(odds) for a one unit difference in that variable, for groups in which the other variables are the same.

Survival versus CPR and age

summary(glm(sta ~ cpr + cre + age, data = icu,
            family = binomial(link = "logit")))


Call:
glm(formula = sta ~ cpr + cre + age, family = binomial(link = "logit"), 
    data = icu)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.32901    0.74884   4.446 8.77e-06 ***
cprYes      -1.69680    0.62145  -2.730  0.00633 ** 
cre> 2.0    -1.13328    0.70191  -1.615  0.10641    
age         -0.02814    0.01125  -2.502  0.01235 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 181.47  on 196  degrees of freedom
AIC: 189.47

Number of Fisher Scoring iterations: 5

Inference for multiple logistic regression

Typically, the hypotheses of interest are

$H_{0} : β_{k} = 0$ , the variables $X_{k}$ and $Y$ are not associated
$H_{A} : β_{k} \neq 0$ , the variables $X_{k}$ and $Y$ are associated

These hypotheses can also be written in terms of the odds ratio.

Regression recap

We considered $E [Y ∣ X] = β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p}$ , the multiple linear regression model with outcome $Y$ and predictors $X_{1}, \dots, X_{p}$ .

When fitting the regression model, we used ordinary least squares to minimize the sum of squared residuals: $(β_{0}, β_{1}, \dots, β_{p}) = {\arg \min}_{β_{0}, \dots, β_{p}} \sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i 1} - \dots - β_{p} x_{i p})^{2}$
This doesn’t work when the number or predictors exceeds the number of observations, i.e., $p > n$ , in which case the regression problem is ill-posed (an under- or over-determined system of equations).
The case $p > n$ is a common problem in high-dimensional data analysis.

Regularized regression

In this high-dimensional setting, how do we choose the relevant predictors to fit the best model that we can?
A penalized or regularized regression procedure changes the quantity we minimize to discourage choosing too many variables.
- Force estimates of $β$ for unimportant predictors to zero, excluding these from the chosen model
Regularized regression is a general family of procedures involving different forms of penalization, including lasso, ridge, and elastic net.

Regularized regression: Ridge

The ridge (or Tikhonov) regularized regression estimator minimizes a sum of squares subject to an $L_{2}$ (squared) penalty on the $β_{j}$ . It is defined

$({\tilde{β}}_{0}, {\tilde{β}}_{1}, \dots, {\tilde{β}}_{p}) = {\arg \min}_{β_{0}, \dots, β_{p}} (\sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i 1} - \dots - β_{p} x_{i p})^{2} + λ \sum_{j = 1}^{p} β_{j}^{2}) .$

The penalty (degree of regularization) is set by the tuning parameter $λ$ .
As $λ$ is increased, the ridge penalty shrinks estimated coefficients towards zero slowly, leaving many variables in the model.
The choice of $λ$ determines the selected regression model, so how do we go about choosing $λ$ ?

Regularized regression: Lasso

The lasso regularized regression estimator minimizes a sum of squares subject to an $L_{1}$ (absolute) penalty on the $β_{j}$ . It is defined

$({\tilde{β}}_{0}, {\tilde{β}}_{1}, \dots, {\tilde{β}}_{p}) = {\arg \min}_{β_{0}, \dots, β_{p}} (\sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i 1} - \dots - β_{p} x_{i p})^{2} + λ \sum_{j = 1}^{p} | β_{j} |) .$

Again, the penalty (degree of regularization) is controlled by the tuning parameter $λ$ .
As $λ$ is increased, the lasso penalty shrinks estimated coefficients towards zero, setting many to zero exactly.
There are many variations to the lasso (group lasso, fused lasso, etc.), and it is a popular variable selection strategy.

Regularized regression: Elastic net

The elastic net regularized regression estimator combines the ridge and lasso penalties. It is defined as

$({\tilde{β}}_{0}, {\tilde{β}}_{1}, \dots, {\tilde{β}}_{p}) = {\arg \min}_{β_{0}, \dots, β_{p}} (\sum_{i = 1}^{n} (y_{i} - β_{0} - β_{1} x_{i 1} - \dots - β_{p} x_{i p})^{2} + λ_{1} \sum_{j = 1}^{p} | β_{j} | + λ_{2} \sum_{j = 1}^{p} β_{j}^{2}) .$

Note the correspondence to ridge regression (when $λ_{1} = 0$ ) and also to lasso regression (when $λ_{2} = 0$ ).
The elastic net overcomes some pitfalls of the lasso, including that the lasso can fail to isolate a best predictor from among a group of related predictors, with the $L_{2}$ penalty ensuring a unique minimum.

Generalization in regression

Regression methods borrow information across $X$ to learn about the form of the regression function $E [Y ∣ X]$ via some selected model.

How do we choose $X$ to include, or, equivalently, how do we choose $λ$ in regularized regression?
Borrowing (or smoothing) across all $X$ to inform $E [Y ∣ X]$ , regression is a global learning problem (for $X_{j} = x$ , $β_{j}$ is fixed).
- borrowing too much across $X$ leads to high bias
- borrowing too little across $X$ leads to high variance
- the goal is to find the right balance: bias-variance trade-off
Can we use the available data to make an optimal choice of $λ$ ?

Generalization in regression

What if we evaluate a large collection (e.g., in $λ$ ) of pre-specified regression techniques, choosing the best one as our final estimate $\hat{E} [Y ∣ X]$ ?

we can still entirely pre-specify the whole estimation procedure
…but we need to determine how to pick the best technique

Can we run a fair competition between estimators to determine the best? If so, how would we go about doing this?

Cross-validation and generalization

How can we estimate the risk of an estimator?

Idea: Use empirical risk estimate $\hat{R} (\hat{E}) (X) = \frac{1}{n} \sum_{i = 1}^{n} {Y_{i} - \hat{E} (X_{i})}^{2}$

Problem: $\hat{E} (\cdot)$ would fitted using the same data that we use to estimate its risk! (We’re testing the estimator on training data.)

Cross-validation and generalization

By analogy: A student got a glimpse of the exam before taking it.

focused on learning test questions very well…but
test result doesn’t reflect how well student learned the subject

With $n = 100$ units, fit linear regression $X + X^{2} + \dots + X^{20}$ $\hat{E} (x) = {\hat{β}}_{0} + {\hat{β}}_{1} x + + {\hat{β}}_{2} x^{2} + \dots + {\hat{β}}_{20} x^{20}$ versus fitting the true, unknown linear regression, say, $X + X^{2} + X^{3}$ $\hat{E} (x) = {\hat{β}}_{0} + {\hat{β}}_{1} x + {\hat{β}}_{2} x^{2} + {\hat{β}}_{3} x^{3}$

Cross-validation and generalization

When we use the same data to fit the regression function and to evaluate it:

we are overly optimistic about how well each estimator is fit;
so, we would select the wrong estimator.

What we need is external data to properly evaluate estimators.

We could get a better idea of how well estimators are doing.
We could objectively select the correct estimator to use.
…but such data are rarely available.

Cross-validation

Instead, we use cross-validation to estimate the risk.

Fit regression on part of the data, evaluate on another part.
Mimics the evaluation of the given fit on external data.
Many forms of cross-validation¹: $V$ -fold cross-validation is a simple and very popularly used scheme.

Cross-validation

Data are divided into $V$ sets of size $\approx \frac{n}{V}$ . Here, $V = 10$ .

Training sample is used to fit (“train”, “learn”) the regressions.
Validation sample is used to estimate risk (“validate”, “test”, “evaluate”).

Several factors to consider when choosing $V$ :

large $V$ = more data to fit regressions¹; greater computation time
small $V$ = more data to evaluate risk²;