Effective Data Visualisation with R
Introduction

Paul Murrell
The University of Auckland

Effective Data Visualisation with R

Goals

  • What to draw:
    Provide a framework for reasoning about what makes an effective data visualisation.

  • How to draw:
    Learn to make effective data visualisations with {ggplot2}.

What to Draw

  • We want a framework so that we can make deliberate and rational decisions when designing a data visualisation.

  • When we create something bad, we want to know what we did wrong so that we can make it better (and not do the bad thing ever again).

  • When we create something good, we want to know what we did right so that we can do it again.

What to Draw

Assumptions & Limitations

  • This course focuses only on static graphics.

  • This course focuses on graphics for presentation; we have a message to convey.

  • We will only consider producing a data visualisation by writing code.

  • You should be familiar with R
    (or be able to make friends quickly)

Course Structure

  • 6 sessions
  • 2 parts per session (30 + 15)

Day 1

  1. Introduction and
    The visual system.

  2. {ggplot2} and
    Quantitative data.

  3. Qualitative data and
    Accuracy.

Day 2

  1. Combining features and
    Multiple features

  2. Labels and
    Graphic design

  3. Customisation and
    Review

Introduction to Data Visualisation

  • What is data visualisation?

  • Terminology

  • Making use of the visual system

  • Mapping data to data symbols

What is Data Visualisation?

  • Data Visualisation is NOT Photography

An ultrasound of a beating heart

What is Data Visualisation?

  • Data Visualisation is NOT Scientific Visualisation

A 3D simulation of a beating heart

What is Data Visualisation?

  • Data Visualisation is NOT Art or Entertainment

A cute animated cartoon heart doing crunchies

What is Data Visualisation?

  • A Data Visualisation is an artificial, abstract image
    that uses geometric shapes to represent data.

A line plot of the (normal) electrical activity of a human heart (an ECG)

Terminology

Terminology

  • What do a scatter plot, a bar plot, and a pie chart have in common?

A simple line plot.  What it shows is not important.A simple bar plot.  What it shows is not important.A donut plot (a pie chart with a hole in the middle).  What it shows is not important.

Terminology

  • Data symbols are the geometric shapes that represent data values (e.g., points and lines).

The line plot with the line highlighted (in red) to show that it is the data symbol in this case.  All other plot components are light grey.The bar plot with the bars highlighted (in red) to show that they are the data symbols in this case.  All other plot components are light grey.The donut plot with the donut segments highlighted (in red) to show that they are the data symbols in this case.  All other plot components are light grey.

Terminology

  • Guides are elements of a data visualisation that explain how data values are represented (e.g., axes and legends).

The line plot with the axis lines, tick marks, and tick mark labels highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.The bar plot with the axis lines, grid lines, tick marks, and tick mark labels highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.The donut plot with the legend boxes and the legend labels highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.

  • Guides are themselves mini data visualisations.

Terminology

  • Labels provide context and background information about a data visualisation (e.g., titles and captions).

The line plot with the axis titles and the plot title highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.The bar plot with the axis titles and the plot title highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.The donut plot with the legend title and the plot title highlighted (in red) to show that they are the guides in this case.  All other plot components are light grey.

  • Not all text is a label.

A data visualisation is made up of data symbols and labels.

  • We will spend most of our time on the task of selecting data symbols for a data visualisation.

  • We will also spend some time towards the end on the labelling of data visualisations.

Making Use of the Visual System

Example Data: NZ Youth Crime

  • The data frame rates contains the rate of youth crime (the number of youth offenders per 10,000 of population), for each age (from 10 to 13) in each year (from 2010 to 2020).
head(rates)
  Age Year Rate
1  10 2010   66
2  11 2010  120
3  12 2010  211
4  13 2010  392
6  10 2011   62
7  11 2011  107

Making Use of the Visual System

  • We use data visualisation to help answer questions.

  • Is the crime rate increasing or decreasing over time for each age group?

Age 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
10 66 62 65 43 41 39 29 33 22 24 16
11 120 107 107 77 82 60 62 58 46 50 38
12 211 189 166 154 133 115 116 115 97 88 76
13 392 356 293 263 230 216 195 203 176 158 144
Total 197 179 160 137 122 107 100 101 83 79 69

Making Use of the Visual System

  • The visual system is good at answering some questions.

  • Is the crime rate increasing or decreasing over time for each age group?

A line plot of youth crime rates in New Zealand over time.  Time ranges from 2010 to 2020 and number of offenders (per 10,000) ranges from almost zero to almost 400.  There is a separate line for each age, from 10 to 13.  All lines decrease nearly monotonically, with the higher lines decreasing faster than the lower lines.  The lines for older ages are all higher than the lines for younger ages.

Making Use of the Visual System

  • We use data visualisation to help answer questions.

  • Was the total crime rate higher in 2016 or 2017?

Age 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
10 66 62 65 43 41 39 29 33 22 24 16
11 120 107 107 77 82 60 62 58 46 50 38
12 211 189 166 154 133 115 116 115 97 88 76
13 392 356 293 263 230 216 195 203 176 158 144
Total 197 179 160 137 122 107 100 101 83 79 69

Making Use of the Visual System

  • Our visual system is less good at answering some questions.

  • Was the total crime rate higher in 2016 or 2017?

A line plot of youth crime rates in New Zealand over time.  Time ranges from 2010 to 2020 and number of offenders (per 10,000) ranges from almost zero to almost 400.  There is a separate line for each age, from 10 to 13.  All lines decrease nearly monotonically, with the higher lines decreasing faster than the lower lines.  The lines for older ages are all higher than the lines for younger ages.

Making Use of the Visual System

  • Our visual system is very good at answering some questions.

  • Was the total crime rate higher in 2016 or 2017?

A line plot of youth crime rates in New Zealand over time.  Time ranges from 2010 to 2020 and number of offenders (per 10,000) ranges from almost zero to almost 400.  There is only one line showing the total crime rate for all ages (from 10 to 13). The line decrease almost monotonically, except for a very small increase from 2016 to 2017.

An effective data visualisation takes advantage of the strengths of the visual system.

  • We will spend a lot of our time exploring the strengths and weaknesses of the visual system.

  • Asking the visual system to perform some calculations is asking too much.

  • We need to choose a data vis based on the question, not just on the data.

Mapping data to data symbols

Mapping Data to Data Symbols

  • data symbols represent data values.

  • Our goal is to learn how to choose an effective mapping from data values to data symbols.

    A diagram with a box labelled “data values”, on the left, connected with an arrow to a box labelled “data symbols”, on the right.

Mapping Data to Data Symbols

    Symbol:     points (circles) and lines
    Mappings:   year  ->  x-location
                rate  ->  y-location
                age   ->  colour

This image embodies the mappings from data values to data symbols that are described in the text above the image.  A line plot of youth crime rates in New Zealand over time.  Time ranges from 2010 to 2020 and number of offenders (per 10,000) ranges from almost zero to almost 400.  There is a separate line for each age, from 10 to 13.  All lines decrease nearly monotonically, with the higher lines decreasing faster than the lower lines.  The lines for older ages are all higher than the lines for younger ages.

Mapping Data to Data Symbols

    Symbol:     bars (rectangles)
    Mappings:   year  ->  x-location
                rate  ->  length
                age   ->  colour (and x-location)

This image embodies the mappings from data values to data symbols that are described in the text above the image.  A bar plot of youth crime rates in New Zealand over time.  Time ranges from 2010 to 2020 and number of offenders (per 10,000) ranges from almost zero to almost 400.  For each year, there is a separate bar for each age, from 10 to 13.  The bars for the same age have the same colour.  For each age, the bars decrease nearly monotonically, with the bars for higher ages decreasing faster than the bars for lower ages.  The bars for older ages are all higher than the bars for younger ages.

Mapping Data to Data Symbols

    Symbol:     tiles (rectangles)
    Mappings:   year  ->  x-location
                rate  ->  colour
                age   ->  y-location

This image embodies the mappings from data values to data symbols that are described in the text above the image.  A heatmap of the crime rate, with a cell for each combination of year (2010 to 2020) and age (10 to 13).  Each cell is coloured with a different shade of red, with darker reds indicating higher crime rates.  Each row of the heatmap, which represents one age over time, shows darker reds changing to lighter reds, which corresponds to decreasing crime rates.

An effective data visualisation makes us of a good mapping from data to data symbols

  • We will spend a lot of time talking about how to choose a good mapping.

Summary

Summary

  • A data visualisation consists of data symbols, guides, and labels.

  • A data visualisation can help to answer questions.

    • An effective data visualisation will pose questions that the visual system is good at answering.
  • We need to choose a mapping from data values to data symbols.

    • An effective data visualisation will have good mappings from data to data symbols.

Exercises

Exercises

  • Can you identify the data symbols, guides, and labels in this image?

A line plot of crime rate (y-axis) against time (x-axis). There is a separate line for different ethnicities (Maori, Pacific Peoples, and European/Other), plus a line for total crime rate. Each line has a different colour. There are axis tick labels, axis titles, and a legend with lines and labels for each ethnicity. There are also text labels at start and end of each line (giving the start and end crime rates for each line).

Source: 2023 Youth Justice Indicators Summary Report

Exercises

  • Can you identify the mappings (of data values to data symbols) that are used in this image?

A line plot of crime rate (y-axis) against time (x-axis). There is a separate line for different ethnicities (Maori, Pacific Peoples, and European/Other), plus a line for total crime rate. Each line has a different colour. There are axis tick labels, axis titles, and a legend with lines and labels for each ethnicity. There are also text labels at start and end of each line (giving the start and end crime rates for each line).

  • The data visualisation below is from a graduate student project.

    A scatterplot of unidentified values against fraction of the year (0 to 1). the data points are coloured blue for real values and orange for predicted values from a model. There is a very large overlap between blue and orange symbols, but in some places the y-range of blue symbols extends beyond the range of the orange symbols or vice versa. The orange symbols extend above the blue symbols around about .75 of the way through the year.

  • The task of interest is: identify the month(s) in which the prediction is higher than the actual (and vice versa).

    A scatterplot of unidentified values against fraction of the year (0 to 1). the data points are coloured blue for real values and orange for predicted values from a model. There is a very large overlap between blue and orange symbols, but in some places the y-range of blue symbols extends beyond the range of the orange symbols or vice versa. The orange symbols extend above the blue symbols around about .75 of the way through the year.

  • Can you identify ways in which the visual system finds it either easy or difficult to answer this question?

    A scatterplot of unidentified values against fraction of the year (0 to 1). the data points are coloured blue for real values and orange for predicted values from a model. There is a very large overlap between blue and orange symbols, but in some places the y-range of blue symbols extends beyond the range of the orange symbols or vice versa. The orange symbols extend above the blue symbols around about .75 of the way through the year.