Effective Data Visualisation with R
The {ggplot2} Package

Paul Murrell
The University of Auckland

Review

  • A data visualisation is made up of data symbols and labels.

  • data symbols represent data values.

  • Our goal is to learn how to choose an effective mapping from data values to data symbols.

How To Draw

  • What to draw:
    Many sections will focus primarily on the choice of data symbols to use in a data visualisation.

  • How to draw:
    In this section we will focus on writing R code to create data visualisations, using the {ggplot2} package.

  • Later sections will supplement advice on what to draw with further information about {ggplot2}.

{ggplot2}

  • Geoms
  • Aesthetics
  • Scales
  • Stats

A Grammar of Graphics

  • {ggplot2} is built on a set of basic building blocks:

    • The data to visualise, as a data frame.
    • A geom to use as a data symbol
      (e.g., a point or a bar).
    • mappings from data to aesthetics
      (e.g., map the number of crimes to the height of a bar).

ggplot2 Code

library(ggplot2)

The code for a {ggplot2} data visualisation has the following format:


ggplot(data) +
    geom_shape(aes(mappings))

ggplot2 Code

  • For simple demonstrations, we will work with the total number of youth offences from 2010 to 2020, categorised by the level of crime severity.
head(crimeLevelTotal)
# A tibble: 5 × 3
  level       total   prop
  <ord>       <int>  <dbl>
1 Low         24242 0.283 
2 Low-Medium  21513 0.251 
3 Medium      13823 0.161 
4 Medium-High 18315 0.214 
5 High         7793 0.0909

ggplot2 Code

  • In the following code:
    • the data is crimeLevelTotal.
    • the geom is a column (or bar).
    • level maps to the x position of the columns and total maps to the y position of (the tops of) the columns.
    ggplot(crimeLevelTotal) +
        geom_col(aes(x=level, y=total))

ggplot2 Code

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

WARNING

  • Many code examples have been deliberately simplified.

  • Some code examples will not exactly correspond to the details of the accompanying data visualisation.

  • In some cases, warning messages have also been suppressed.

Geoms

Geoms

  • We choose a geom as the shape to draw for the data symbols.

Geoms

  • We can draw rectangles with geom_col.
ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

Geoms

  • We can draw points with geom_point.
ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y=total))

  • We can draw lines with geom_line().
ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1))

  • We need group=1 to tell {ggplot2} that all of the positions belong to the same line.

Geoms

Function Shape Function Shape
geom_col() Rectangle geom_line() Line
geom_bar()   geom_path()  
geom_tile()   geom_segment()  
geom_rect()   geom_curve()  
geom_point() Data point geom_text() Text
geom_jitter()   geom_label()  
geom_polygon() Polygon geom_raster() Bitmap

Aesthetics

Aesthetics

  • An aesthetic is something that affects how a geom is drawn.

    • Where do we draw the geom?
    • What size is it?
    • What colour is it?
  • An aesthetic mapping specifies that a data variable controls how a geom is drawn.

Aesthetics

  • The x and y aesthetics map data values to positions.
ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y=level))

Aesthetics

  • The size aesthetics map data values to size.
ggplot(crimeLevelTotal) +
    geom_point(aes(x="", y=level, size=total))

Aesthetics

  • The shape aesthetics map data values to shape.
ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level))

Aesthetics

  • The colour aesthetics map data values to colour.
ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level))

Aesthetics

  • The fill aesthetics map data values to fill colour.
ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", fill=level))

Aesthetics

  • Different geoms can have different aesthetics.

    • Only geom_point() has a shape aesthetic.
  • The same aesthetic can mean different things for different geoms.

Aesthetics

  • The x and y aesthetics map data values to positions.
  • For geom_line() that means x- and y-locations.
ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1))

Aesthetics

  • The x and y aesthetics map data values to positions.
  • For geom_col() that means the x-centre and y-top.
ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))

Aesthetics

  • Aesthetics outside of aes() are settings (not mappings).
ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total), 
             fill="red")

Aesthetics

  • Some aesthetics are used more for settings.
ggplot(crimeLevelTotal) +
    geom_line(aes(x=level, y=total, group=1), 
              linewidth=3, linetype="dashed")

Aesthetic Meaning Aesthetic Meaning
x Position (x) y Position (y)
width Length (x) height Length (y)
colour Border colour fill Fill colour
alpha Level of transparency
linetype Line style linewidth Line width
shape Shape of points
size Size of points and text
label Text label
family Font family fontface Font face
hjust Justification (x) vjust Justification (y)
lineheight Interline spacing
group Group membership

Scales

Scales

  • Scales control how data values are mapped to aesthetic values.

  • {ggplot2} will provide a default scale, but we can change the defaults using scale functions.

Scales

  • scale_x_continuous() controls the axis range and the axis labelling when quantitative data is mapped to x.

  • scale_x_discrete() controls the axis range and labelling when qualitative data is mapped to x.

  • There are also scale_y_continuous() and scale_y_discrete().

Scales

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total))
 
 
 

Scales

ggplot(crimeLevelTotal) +
    geom_col(aes(x=level, y=total)) +
    scale_y_continuous(limits=c(0, 25000), expand=expansion(0)) +
    scale_x_discrete(labels=c("Low", "Low-Med", "Med", 
                              "Med-High", "High"))

Scales

  • scale_colour_continuous() controls the colour gradient when quantitative data is mapped to colour.

  • scale_colour_discrete() controls the palette of colours when qualitative data is mapped to colour.

  • There are also scale_fill_continuous() and scale_fill_discrete().

  • We will talk more about selecting colours later on.

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", colour=total))
 

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", colour=total)) + 
    scale_colour_continuous(low="yellow", high="red")

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level))
 

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", colour=level)) +
    scale_colour_discrete(type=scale_colour_hue)

Scales

  • scale_size() controls the range of sizes used (in millimetres).

  • There are “manual” versions of scales, e.g., scale_shape_manual() and scale_linetype_manual(), that allow a set of explicit values to control the aesthetic mappings.

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", size=total))
 

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=level, y="", size=total)) + 
    scale_size(range=c(0, 9))

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level))
 

Scales

ggplot(crimeLevelTotal) +
    geom_point(aes(x=total, y="", shape=level)) + 
    scale_shape_manual(values=1:5)

Stats

Stats

  • Some geoms do not map the raw data.

  • Some geoms apply a stat to transform the raw data to a set of data statistics.

  • The data statistics are then mapped to the aesthetics of a geom.

Stats

  • The offenders data frame contains information on individual offenders, including age group.

    head(offenders["Age.Group"])
      Age.Group
    1     15-19
    2     15-19
    3     20-24
    4     20-24
    5     20-24
    6     20-24

Stats

  • In order to draw a bar plot of the number of offenders per age group, we need a table of counts.

    table(offenders$Age.Group)
    
              0-4           5-9         10-14         15-19         20-24 
                1            91          4036         11648         10920 
            25-29         30-34         35-39         40-44         45-49 
            11879         11468          8980          7096          5642 
            50-54         55-59         60-64         65-69         70-74 
             4446          2983          1853           984           616 
            75-79 80yearsorover  NotSpecified 
              268           252             7 

  • geom_bar() takes individual raw data and uses stat_count() to generate a table of counts.

    ggplot(offenders) +
        geom_bar(aes(y=Age.Group))

Stats

  • The unique categories map to the bar position and the counts map to the bar lengths.

Stats

  • geom_histogram() uses stat_bin().

    This bins raw quantitative data and maps the resulting bin boundaries and counts to the position and length of bars.

  • geom_density() uses stat_density().

    This calculates a density estimate from raw quantitative data and maps the resulting density to the position of a line.

  • geom_smooth() uses stat_smooth().

    This takes raw quantitative x and y data, calculates smoothed y values, and maps those to the position of a line.

Rugby World Cup

  • The RWCperGame data frame contains measures of performance at the Rugby World Cup of 2023 for different countries, plus the hemisphere that each country is from.

    head(tibble(RWCperGame))
    # A tibble: 6 × 11
      country   hemisphere yellowcards redcards cleanbreaks
      <chr>     <fct>            <dbl>    <dbl>       <dbl>
    1 Namibia   South             1        0.5         2.5 
    2 Romania   North             1.25     0           2.75
    3 Chile     South             1.25     0           3.75
    4 Samoa     South             1.25     0.25        3.75
    5 Australia South             0.5      0           5.25
    6 Georgia   North             0.5      0           5.25
    # ℹ 6 more variables: tackles <dbl>, points <dbl>,
    #   conversions <dbl>, offloads <dbl>, tries <dbl>,
    #   runs <dbl>

Stats

ggplot(RWCperGame, aes(x = cleanbreaks, y = points)) + 
    geom_point() +
    geom_smooth(method="lm")

Stats

  • Note that if we produce a data symbol from data statistics, we can only map back to the data statistics.

{ggplot2} Help

Summary

Summary

Exercises

Exercise

  • We will produce several data visualisations using the rates data frame.

    head(rates)
      Age Year Rate
    1  10 2010   66
    2  11 2010  120
    3  12 2010  211
    4  13 2010  392
    6  10 2011   62
    7  11 2011  107

Exercise

  • Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.

    Write {ggplot2} code to produce the data visualisation (ignoring labels).

Exercise

  • Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.

    Write {ggplot2} code to produce the data visualisation (ignoring labels).

Exercise

  • The next data visualisation uses the RWCperGame data frame.

    head(RWCperGame)
         country hemisphere yellowcards redcards cleanbreaks tackles points
    11   Namibia      South        1.00     0.50        2.50  102.00   9.25
    14   Romania      North        1.25     0.00        2.75  142.50   8.00
    3      Chile      South        1.25     0.00        3.75  132.25   6.75
    15     Samoa      South        1.25     0.25        3.75  109.50  23.00
    2  Australia      South        0.50     0.00        5.25  108.25  22.50
    7    Georgia      North        0.50     0.00        5.25  149.75  16.00
       conversions offloads tries   runs
    11        0.50     2.25  0.75  92.00
    14        0.75     1.50  1.00  81.00
    3         0.50     5.25  1.00 102.25
    15        2.00     9.25  2.75 102.50
    2         1.75     7.75  2.75 110.75
    7         1.00     8.00  1.75 114.00

Exercise

  • What values are being mapped in the data visualisation below?

    What stat is being used?