Effective Data Visualisation with R
The {ggplot2} Package

Paul Murrell
The University of Auckland

Review

A data visualisation is made up of data symbols and labels.
data symbols represent data values.
Our goal is to learn how to choose an effective mapping from data values to data symbols.

How To Draw

What to draw:
Many sections will focus primarily on the choice of data symbols to use in a data visualisation.
How to draw:
In this section we will focus on writing R code to create data visualisations, using the {ggplot2} package.
Later sections will supplement advice on what to draw with further information about {ggplot2}.

{ggplot2}

Geoms
Aesthetics
Scales
Stats

A Grammar of Graphics

{ggplot2} is built on a set of basic building blocks:
- The data to visualise, as a data frame.
- A geom to use as a data symbol
  (e.g., a point or a bar).
- mappings from data to aesthetics
  (e.g., map the number of crimes to the height of a bar).

ggplot2 Code

library(ggplot2)

The code for a {ggplot2} data visualisation has the following format:


ggplot(data) +
    geom_shape(aes(mappings))

ggplot2 Code

For simple demonstrations, we will work with the total number of youth offences from 2010 to 2020, categorised by the level of crime severity.

head(crimeLevelTotal)

# A tibble: 5 × 3
  level       total   prop
  <ord>       <int>  <dbl>
1 Low         24242 0.283 
2 Low-Medium  21513 0.251 
3 Medium      13823 0.161 
4 Medium-High 18315 0.214 
5 High         7793 0.0909

ggplot2 Code

In the following code:
- the data is crimeLevelTotal.
- the geom is a column (or bar).
- level maps to the x position of the columns and total maps to the y position of (the tops of) the columns.
```
ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))
```

ggplot2 Code

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

WARNING

Many code examples have been deliberately simplified.
Some code examples will not exactly correspond to the details of the accompanying data visualisation.
In some cases, warning messages have also been suppressed.

Geoms

We choose a geom as the shape to draw for the data symbols.

Geoms

We can draw rectangles with geom_col.

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

Geoms

We can draw points with geom_point.

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y=total))

We can draw lines with geom_line().

ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1))

We need group=1 to tell {ggplot2} that all of the positions belong to the same line.

Geoms

Function	Shape	Function	Shape

`geom_col()`	Rectangle	`geom_line()`	Line
`geom_bar()`		`geom_path()`
`geom_tile()`		`geom_segment()`
`geom_rect()`		`geom_curve()`

`geom_point()`	Data point	`geom_text()`	Text
`geom_jitter()`		`geom_label()`

`geom_polygon()`	Polygon	`geom_raster()`	Bitmap

Aesthetics

An aesthetic is something that affects how a geom is drawn.
- Where do we draw the geom?
- What size is it?
- What colour is it?
An aesthetic mapping specifies that a data variable controls how a geom is drawn.

Aesthetics

The x and y aesthetics map data values to positions.

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y=level))

Aesthetics

The size aesthetics map data values to size.

ggplot(crimeLevelTotal) +
    geom_point(aes(x="", y=level, size=total))

Aesthetics

The shape aesthetics map data values to shape.

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level))

Aesthetics

The colour aesthetics map data values to colour.

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level))

Aesthetics

The fill aesthetics map data values to fill colour.

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", fill=level))

Aesthetics

Different geoms can have different aesthetics.
- Only geom_point() has a shape aesthetic.
The same aesthetic can mean different things for different geoms.

Aesthetics

The x and y aesthetics map data values to positions.
For geom_line() that means x- and y-locations.

ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1))

Aesthetics

The x and y aesthetics map data values to positions.
For geom_col() that means the x-centre and y-top.

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

Aesthetics

Aesthetics outside of aes() are settings (not mappings).

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total), 
             fill="red")

Aesthetics

Some aesthetics are used more for settings.

ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1), 
              linewidth=3, linetype="dashed")

Aesthetic	Meaning	Aesthetic	Meaning
`x`	Position (x)	`y`	Position (y)
`width`	Length (x)	`height`	Length (y)
`colour`	Border colour	`fill`	Fill colour
`alpha`	Level of transparency
`linetype`	Line style	`linewidth`	Line width
`shape`	Shape of points
`size`	Size of points and text
`label`	Text label
`family`	Font family	`fontface`	Font face
`hjust`	Justification (x)	`vjust`	Justification (y)
`lineheight`	Interline spacing
`group`	Group membership

Scales

Scales control how data values are mapped to aesthetic values.
{ggplot2} will provide a default scale, but we can change the defaults using scale functions.

Scales

scale_x_continuous() controls the axis range and the axis labelling when quantitative data is mapped to x.
scale_x_discrete() controls the axis range and labelling when qualitative data is mapped to x.
There are also scale_y_continuous() and scale_y_discrete().

Scales

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

Scales

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total)) +
    scale_y_continuous(limits=c(0, 25000), expand=expansion(0)) +
    scale_x_discrete(labels=c("Low", "Low-Med", "Med", 
                              "Med-High", "High"))

Scales

scale_colour_continuous() controls the colour gradient when quantitative data is mapped to colour.
scale_colour_discrete() controls the palette of colours when qualitative data is mapped to colour.
There are also scale_fill_continuous() and scale_fill_discrete().
We will talk more about selecting colours later on.

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", colour=total))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", colour=total)) + 
    scale_colour_continuous(low="yellow", high="red")

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level)) +
    scale_colour_discrete(type=scale_colour_hue)

Scales

scale_size() controls the range of sizes used (in millimetres).
There are “manual” versions of scales, e.g., scale_shape_manual() and scale_linetype_manual(), that allow a set of explicit values to control the aesthetic mappings.

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", size=total))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", size=total)) + 
    scale_size(range=c(0, 9))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level)) + 
    scale_shape_manual(values=1:5)

Stats

Some geoms do not map the raw data.
Some geoms apply a stat to transform the raw data to a set of data statistics.
The data statistics are then mapped to the aesthetics of a geom.

Stats

The offenders data frame contains information on individual offenders, including age group.

head(offenders["Age.Group"])

  Age.Group
1     15-19
2     15-19
3     20-24
4     20-24
5     20-24
6     20-24

Stats

In order to draw a bar plot of the number of offenders per age group, we need a table of counts.

table(offenders$Age.Group)


          0-4           5-9         10-14         15-19         20-24 
            1            91          4036         11648         10920 
        25-29         30-34         35-39         40-44         45-49 
        11879         11468          8980          7096          5642 
        50-54         55-59         60-64         65-69         70-74 
         4446          2983          1853           984           616 
        75-79 80yearsorover  NotSpecified 
          268           252             7

geom_bar() takes individual raw data and uses stat_count() to generate a table of counts.
```
ggplot(offenders) +
    geom_bar(aes(y=Age.Group))
```

Stats

The unique categories map to the bar position and the counts map to the bar lengths.

Stats

geom_histogram() uses stat_bin().

This bins raw quantitative data and maps the resulting bin boundaries and counts to the position and length of bars.
geom_density() uses stat_density().

This calculates a density estimate from raw quantitative data and maps the resulting density to the position of a line.
geom_smooth() uses stat_smooth().

This takes raw quantitative x and y data, calculates smoothed y values, and maps those to the position of a line.

Rugby World Cup

The RWCperGame data frame contains measures of performance at the Rugby World Cup of 2023 for different countries, plus the hemisphere that each country is from.

head(tibble(RWCperGame))

# A tibble: 6 × 11
  country   hemisphere yellowcards redcards cleanbreaks
  <chr>     <fct>            <dbl>    <dbl>       <dbl>
1 Namibia   South             1        0.5         2.5 
2 Romania   North             1.25     0           2.75
3 Chile     South             1.25     0           3.75
4 Samoa     South             1.25     0.25        3.75
5 Australia South             0.5      0           5.25
6 Georgia   North             0.5      0           5.25
# ℹ 6 more variables: tackles <dbl>, points <dbl>,
#   conversions <dbl>, offloads <dbl>, tries <dbl>,
#   runs <dbl>

Stats

ggplot(RWCperGame, aes(x = cleanbreaks, y = points)) + 
    geom_point() +
    geom_smooth(method="lm")

Stats

Note that if we produce a data symbol from data statistics, we can only map back to the data statistics.

{ggplot2} Help

The help pages for individual geoms describe:
- the aesthetics for a geom.
- the stat for a geom.

Summary

Exercises

Exercise

We will produce several data visualisations using the rates data frame.

head(rates)

  Age Year Rate
1  10 2010   66
2  11 2010  120
3  12 2010  211
4  13 2010  392
6  10 2011   62
7  11 2011  107

Exercise

Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.

Write {ggplot2} code to produce the data visualisation (ignoring labels).

Exercise

Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.

Write {ggplot2} code to produce the data visualisation (ignoring labels).

Exercise

The next data visualisation uses the RWCperGame data frame.

head(RWCperGame)

     country hemisphere yellowcards redcards cleanbreaks tackles points
11   Namibia      South        1.00     0.50        2.50  102.00   9.25
14   Romania      North        1.25     0.00        2.75  142.50   8.00
3      Chile      South        1.25     0.00        3.75  132.25   6.75
15     Samoa      South        1.25     0.25        3.75  109.50  23.00
2  Australia      South        0.50     0.00        5.25  108.25  22.50
7    Georgia      North        0.50     0.00        5.25  149.75  16.00
   conversions offloads tries   runs
11        0.50     2.25  0.75  92.00
14        0.75     1.50  1.00  81.00
3         0.50     5.25  1.00 102.25
15        2.00     9.25  2.75 102.50
2         1.75     7.75  2.75 110.75
7         1.00     8.00  1.75 114.00

Exercise

What values are being mapped in the data visualisation below?

What stat is being used?

Effective Data Visualisation with RThe {ggplot2} Package

Paul MurrellThe University of Auckland

Review

How To Draw

{ggplot2}

A Grammar of Graphics

ggplot2 Code

ggplot2 Code

ggplot2 Code

ggplot2 Code

WARNING

Geoms

Geoms

Geoms

Geoms

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Aesthetics

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Scales

Stats

Stats

Stats

Stats

Stats

Rugby World Cup

Stats

Stats

{ggplot2} Help

Summary

Exercise

Exercise

Exercise

Exercise

Exercise

Effective Data Visualisation with R
The {ggplot2} Package

Paul Murrell
The University of Auckland