HST 190 at HMS – Biostatistical Science and Data Analysis

*OI Biostat* Table 1.6
sex	age	race	height	weight	actn3.r577x	ndrm.ch
Female	20	Caucasian	60.0	90	CT	80.0
Female	21	Caucasian	68.0	149	CT	57.1
Male	18	Caucasian	74.0	183	CT	50.0
Male	24	Other	72.6	135	TT	85.7
Male	21	Caucasian	65.0	133	CT	40.0
Male	28	Asian	71.0	141	CC	42.9
Female	23	Hispanic	63.2	129	TT	30.0

Distributions and summary measures

The collection of values for a numerical, continuous variable (e.g., weight) is the distribution of that variable.

Numerical and graphical summaries convey characteristics of a distribution without listing all the values.

Important characteristics include…

Center: where is the middle of the distribution?
- Measures of center: mean, median
Spread: how similar or varied are the values to each other?
- Measures of spread: standard deviation, interquartile range

Measures of center

The sample mean of a variable is the sum of all observations divided by the number of observations:

$\overset{―}{x} = \frac{x_{1} + x_{2} + \dots + x_{n}}{n}$ where $x_{1}, x_{2}, \dots, x_{n}$ represent the $n$ observed values in a sample.

The mean weight in the famuss dataset is 155.648 pounds.

Measures of center $\dots$

The median is the value of the middle observation in a sample. If the number of observations is

odd, then the median is the middle observation
even, then the median is the average of the two middle observations

The median is the $50^{th}$ percentile: 50% of observations lie below (and above) the median.

The median weight in the famuss dataset is 150 pounds.

Measures of spread

The variance and standard deviation measure the distance between a “typical” observation and the mean.

An observation’s deviation is the distance between its value $x$ and the sample mean $\overset{―}{x}$ , that is $d = x - \overset{―}{x}$ .
Sample variance $s^{2}$ is the sum of squared deviations divided by the number of observations, minus 1. $s^{2} = \frac{({x_{1} - \overset{―}{x})}^{2} + ({x_{2} - \overset{―}{x})}^{2} + \dots + ({x_{n} - \overset{―}{x})}^{2}}{n - 1},$ where $x_{1}, x_{2}, \dots, x_{n}$ represent the $n$ observed values.

Measures of spread $\dots$

The standard deviation is the distance between a “typical” observation and the mean, on the same unit scale.

The standard deviation $s$ is the square root of the variance $s^{2}$ . $s = \sqrt{\frac{({x_{1} - \overset{―}{x})}^{2} + ({x_{2} - \overset{―}{x})}^{2} + \dots + ({x_{n} - \overset{―}{x})}^{2}}{n - 1}}$

In the famuss dataset, the standard deviation of the variable weight is 34.59

Measures of Spread: Percentiles/Quartiles

The $p^{th}$ percentile is the observation such that $p$ % of the remaining observations fall below this observation.

The first quartile ( $Q_{1}$ ) is the $25^{th}$ percentile.
The second quartile ( $Q_{2}$ ), i.e., the median, is the $50^{th}$ percentile.
The third quartile ( $Q_{3}$ ) is the $75^{th}$ percentile.

Measures of Spread: Percentiles/Quartiles $\dots$

The interquartile range (IQR) is the distance between the third and first quartiles: $I Q R = Q_{3} - Q_{1}$

In the famuss dataset, the IQR for the variable weight is 42

Robust estimates

The median and IQR are often called robust estimates since they are less affected by extreme values than are means and standard deviations.

For distributions containing extreme observations, the median and IQR provide a more accurate sense of center and spread.

Histograms

Histograms $\dots$

Histograms show important features of the shape of a distribution:

Symmetry, or lack of it (skew)
Minimum and maximum values
Regions of high frequency (modes)

Histograms are not so good for:

Displaying the median or quartiles
Showing subtle skewing
Identifying extreme values

Box-and-whisker plots

Vu and Harrington (2020), Figure 1.20 (frog data)

Boxplots

A boxplot indicates the first, second, and third quartiles of a distribution
It also identifies potential outliers – observations far from the center
- Large outliers are > $Q_{3} + (1.5 \times IQR)$
- Small outliers are < $Q_{1} - (1.5 \times IQR)$

On a boxplot $\dots$

The rectangle extends from the first quartile to the third quartile, with a line at the second quartile (median).
Whiskers capture data that fall between $Q_{1} - (1.5 \times I Q R)$ and $Q_{3} + (1.5 \times I Q R)$ , and they must end at data points.
Potential outliers are dotted.

Biostatistical Science and Data Analysis

Welcome to HST 190!

What is this course about?

Overview of course logistics

Data basics

Example: the FAMuSS study

Four rows from FAMuSS data matrix

Data basics

Types of Variables

Types of variables

Exploring data with simple tools

Numerical data

Distributions and summary measures

Measures of center

Measures of center $\dots$

Measures of spread

Measures of spread $\dots$

Measures of Spread: Percentiles/Quartiles

Measures of Spread: Percentiles/Quartiles $\dots$

Robust estimates

Histograms

Histograms $\dots$

Box-and-whisker plots

Boxplots

Categorical data

Tables

Bar plots for categorical data

Relationships between two variables

Summarizing relationships between two variables

Two numerical variables

Two numerical variables $\dots$

Two numerical variables $\dots$

Two categorical variables $\dots$

A numerical variable and categorical variable $\dots$

A numerical variable and categorical variable $\dots$

References