Visualizing distributions 1

Introduction

In this worksheet, we will discuss how to display distributions of data values using histograms and density plots.

We will be using the R package tidyverse, which includes ggplot() and related functions.

# load required library
library(tidyverse)

The dataset we will be working with contains information about passengers on the Titanic, including their age, sex, the class in which they traveled on the ship, and whether they survived or not:

titanic

Histograms

We start by drawing a histogram of the passenger ages (column age in the dataset titanic). We can do this in ggplot with the geom geom_histogram(). Try this for yourself.

ggplot(titanic, aes(___)) +
  ___

ggplot(titanic, aes(age)) +
  geom_histogram()

If you don’t specify how many bins you want or how wide you want them to be, geom_histogram() will make an automatic choice, but it will also give you a warning that the automatic choice is probably not good. Make a better choice by setting the binwidth and center parameters. Try the values 5 and 2.5, respectively.

ggplot(titanic, aes(age)) +
  geom_histogram(___)

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = ___, center = ___)

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5)

Try a few more different binwidths, e.g. 1 or 10. What are good values for center that go with these choices?

Density plots

Density plots are a good alternative to histograms. We can create them with geom_density(). Try this out by drawing a density plot of the passenger ages (column age in the dataset titanic). Also, by default geom_density() does not draw a filled area under the density line. We can change this by setting an explicit fill color, e.g. “cornsilk”.

ggplot(titanic, aes(___)) +
  ___

ggplot(titanic, aes(age)) +
  geom_density(___)

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk")

Just like for histograms, there are options to modify how much detail a density plot shows. A small binwidth in a histogram corresponds to a low bandwidth (bw) in a density plot and similarly a large binwidth corresponds to a high bandwidth. In addition, you can change the kernel, e.g. kernel = "rectangular" or kernel = "triangular". Try this out by using a bandwidth of 1 and a triangular kernel.

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", ___)

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", bw = ___, kernel = ___)

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", bw = 1, kernel = "triangular")

Try a few more different bandwidth and kernel choices, e.g. 0.1 or 10, or rectangular or gaussian kernels. How does the density plot depend on these choices?

Small multiples (facets)

We can also draw separate histograms for passengers meeting different criteria, for example for passengers traveling in the different classes. Whenever we draw multiple plot panels containing the same type of plot but for different subsets of the data, we speak of “small multiples”. In ggplot, we generate small multiples with the function facet_wrap(). The function facet_wrap() takes as its argument a list of data columns to subdivide the data by. This list is provided via the vars() function. For example, vars(class) means draw a separate panel for each class, vars(survived) means draw a separate panel for each survival status, and vars(class, survived) means draw a separate panel for each combination of class and survival status.

As an example, the following code generates small multiple histograms by class:

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(class))

Now use the same principle to draw small multiple histograms by survival status.

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  ___

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(___))

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived))

Now make a plot that breaks down the data by both survival status and class.

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, ___))

ggplot(titanic, aes(age)) +
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

Finally, do the same but drawing density plots rather than histograms.

ggplot(titanic, aes(age)) +
  ___ +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", bw = ___) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age)) +
  geom_density(fill = "cornsilk", bw = 2) +
  facet_wrap(vars(survived, class))

Notice the difference between this plot and the corresponding histogram plot. Histograms show absolute counts whereas the density plots are normalized so that the area under the curve is 1. As a consequence, the density plot does not provide an accurate representation of the number of passengers in each grouping. This can be changed. See next section.

Manipulating stats

You may have noticed that neither geom_histogram() nor geom_density() require you to define an aesthetic mapping for the y variable. This is because under the hood, a statistical transformation (called a “stat”) calculates the histogram or density from the raw data and then sets the appropriate y mapping.

Sometimes it can be useful to access or modify this mapping directly. We tell ggplot that we want to map a value calculated by a stat, rather than one that is in the original data, by writing after_stat(...) inside the aes() function. So, for example, the default y mapping for geom_density() is y = after_stat(density). An alternative mapping, y = after_stat(count) scales densities by the number of points in each grouping, thus producing something more similar to a histogram. You can see the difference between these two choices in the following two examples:

# use the default y mapping
ggplot(titanic, aes(age, y = after_stat(density))) + 
  geom_density(fill = "cornsilk", bw = 2) +
  facet_wrap(vars(survived, class))

# use a modified y mapping
ggplot(titanic, aes(age, y = after_stat(count))) + 
  geom_density(fill = "cornsilk", bw = 2) +
  facet_wrap(vars(survived, class))

The same options of after_stat(count) and after_stat(density) exist for geom_histogram() as well. Try this by making histograms that use the calculated density for the y value.

ggplot(titanic, aes(age, y = ___)) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age, y = after_stat(___))) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age, y = after_stat(density))) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

Now, instead, try mapping the calculated counts onto the fill aesthetic.

ggplot(titanic, aes(age, fill = ___)) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age, fill = after_stat(___))) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age, fill = after_stat(count))) + 
  geom_histogram(binwidth = 5, center = 2.5) +
  facet_wrap(vars(survived, class))

Finally, we can make our own combination of geoms and stats, by setting the stat argument of a geom, e.g. stat = "density" to use the density stat. To try this out, draw a density plot using geom_point(), and also map the calculated density values onto the point color.

ggplot(titanic, aes(age, ___)) +
  geom_point(___) +
  facet_wrap(vars(survived, class))

ggplot(titanic, aes(age, color = ___)) + 
  geom_point(stat = "density") +
  facet_wrap(vars(survived, class))