Distributions and summary measures
The collection of values for a numerical, continuous variable (e.g., weight
) is the distribution of that variable.
Numerical and graphical summaries convey characteristics of a distribution without listing all the values.
Important characteristics include…
- Center: where is the middle of the distribution?
- Measures of center: mean, median
- Spread: how similar or varied are the values to each other?
- Measures of spread: standard deviation, interquartile range
Measures of center
The sample mean of a variable is the sum of all observations divided by the number of observations:
where represent the observed values in a sample.
The mean weight in the famuss
dataset is 155.648 pounds.
Measures of center
The median is the value of the middle observation in a sample. If the number of observations is
- odd, then the median is the middle observation
- even, then the median is the average of the two middle observations
The median is the percentile: 50% of observations lie below (and above) the median.
The median weight in the famuss
dataset is 150 pounds.
Measures of spread
The variance and standard deviation measure the distance between a “typical” observation and the mean.
- An observation’s deviation is the distance between its value and the sample mean , that is .
- Sample variance is the sum of squared deviations divided by the number of observations, minus 1. where represent the observed values.
Measures of spread
The standard deviation is the distance between a “typical” observation and the mean, on the same unit scale.
- The standard deviation is the square root of the variance .
In the famuss
dataset, the standard deviation of the variable weight
is 34.59
Measures of Spread: Percentiles/Quartiles
The percentile is the observation such that % of the remaining observations fall below this observation.
- The first quartile () is the percentile.
- The second quartile (), i.e., the median, is the percentile.
- The third quartile () is the percentile.
Measures of Spread: Percentiles/Quartiles
The interquartile range (IQR) is the distance between the third and first quartiles:
In the famuss
dataset, the IQR for the variable weight
is 42
Robust estimates
The median and IQR are often called robust estimates since they are less affected by extreme values than are means and standard deviations.
For distributions containing extreme observations, the median and IQR provide a more accurate sense of center and spread.
Histograms
Histograms
Histograms show important features of the shape of a distribution:
- Symmetry, or lack of it (skew)
- Minimum and maximum values
- Regions of high frequency (modes)
Histograms are not so good for:
- Displaying the median or quartiles
- Showing subtle skewing
- Identifying extreme values
Box-and-whisker plots
![]()
Vu and Harrington (2020), Figure 1.20 (frog data)
Boxplots
A boxplot indicates the first, second, and third quartiles of a distribution
It also identifies potential outliers – observations far from the center
- Large outliers are >
- Small outliers are <
On a boxplot
- The rectangle extends from the first quartile to the third quartile, with a line at the second quartile (median).
- Whiskers capture data that fall between and , and they must end at data points.
- Potential outliers are dotted.
Relationships between two variables
Summarizing relationships between two variables
Approaches for summarizing relationships between two variables vary depending on variable types…
- Two numerical variables
- Two categorical variables
- One numerical variable and one categorical variable
Two numerical variables
Two variables and are
- positively associated if increases as increases.
- negatively associated if decreases as increases.
Height and weight are positively associated.
Two numerical variables
Two numerical variables
Correlation is a numerical summary that measures the strength of a linear relationship between two variables.
- Introduced in Vu and Harrington (2020) Section 1.6.1; details in Ch. 6 (on regression).
- The correlation coefficient takes on values between -1 and 1.
- The closer is to , the stronger the linear association.
In the famuss
dataset, the correlation between height
and weight
is 0.5309
Two categorical variables
Relative risk (RR) is one way of summarizing data presented in a two-way table of study outcome by participant group.
A numerical variable and categorical variable
FAMuSS was designed to study the relationship between genotype at the location r577x in the gene ACTN3 and muscle strength.
Muscle strength was assessed by the percent change in non-dominant arm strength after resistance training (ndrm.ch
).
What visualization would be a good choice to make this comparison?
A numerical variable and categorical variable