Effective Data Visualisation with R
Quantitative Data

Paul Murrell
The University of Auckland

Review

  • A data visualisation consists of data symbols, guides, and labels.

  • A data visualisation can help to answer questions.

    • An effective data visualisation will pose questions that the visual system is good at answering.
  • We need to choose a mapping from data values to data symbols.

    • An effective data visualisation will have good mappings from data to data symbols.

Quantitative Data

  • Visual features

  • Data types

  • Quantitative features

  • Quantitative scales

  • Feature mismatch

Visual Features

Visual Features

  • Data symbols are the geometric shapes that represent data values.

  • In {ggplot2}, our choice of geom selects a geometric shape.

  • geom_col() produces a rectangle.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

  • In {ggplot2}, our choice of geom and aesthetic selects a visual feature.

  • geom_col() plus x=total produces rectangles with different lengths.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

  • In {ggplot2}, our choice of geom and aesthetic selects a visual feature.

  • geom_col() plus y=level produces rectangles at different positions.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

Visual Features

  • The combination of geom and aesthetic that we choose maps data values to a visual feature.

  • Visual features are the properties of data symbols that our visual system identifies very rapidly, in parallel, and without conscious effort.

  • Our task is to identify a good mapping from data values to visual features.

Visual Features

  • Some common visual features are: position (or location), length (height or width), area, angle, colour, and shape.

Effective Visual Features

  • How do we choose an effective visual feature to represent data values?

  • It is important that we can map backwards from a visual feature to the data.

Effective Visual Features

  • One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.

  • For visualising quantitative data, we need to identify which visual features are quantitative visual features.

Data Types

Data Types

  • Nominal data only allows us to calculate whether X is equal to a specific value.

  • Ordinal data also allows us to calculate whether X is greater than a specific value.

  • Ordinal and nominal data are collectively referred to as qualitative data and typically involve character values (text labels).

Nominal Data

  • crimeGroup contains the total count of offenders aged 14-16, per year, broken down by ethnic group.
head(crimeGroup)
           group year count       pop      rate   yearDate
1          Māori 2011  5957  42960.00 1407.6156 2011-06-30
2       Pasifika 2011  1092  16947.80  654.0788 2011-06-30
3 European/Other 2011  5775 125542.24  466.9634 2011-06-30
6          Māori 2012  5458  43190.03 1287.8118 2012-06-30
7       Pasifika 2012  1040  16934.88  625.8260 2012-06-30
8 European/Other 2012  4831 125175.13  393.2976 2012-06-30

Nominal Data

  • crimeGroup$group is an example of nominal data.

  • We can calculate whether group is equal to a specific value.

head(crimeGroup$group == "Pasifika")
[1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Ordinal Data

  • crimeLevel contains the total count of offenders aged 14-16, per year, broken down by the level of crime.
head(crimeLevel)
        level year count      prop levelNumeric   yearDate
1         Low 2011  3986 30.619143            1 2011-06-30
2  Low-Medium 2011  4010 30.803503            2 2011-06-30
3      Medium 2011  1770 13.596559            3 2011-06-30
4 Medium-High 2011  2346 18.021201            4 2011-06-30
5        High 2011   906  6.959594            5 2011-06-30
6         Low 2012  3578 30.991771            1 2012-06-30

Ordinal Data

  • crimeLevel$level is an example of ordinal data.

  • We can calculate whether level is greater than a specific value.

head(crimeLevel$level > "Medium")
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE
  • We can also calculate whether level is equal to a specific value.
head(crimeLevel$level == "Medium")
[1] FALSE FALSE  TRUE FALSE FALSE FALSE

  • A character vector can take any value.

    char <- c("Low", "Medium", "High")
    char
    [1] "Low"    "Medium" "High"  
  • A factor has a set of valid levels.

    fac <- factor(char)
    fac
    [1] Low    Medium High  
    Levels: High Low Medium
  • An ordered factor has an order to the levels.

    ordered(fac, levels=char)
    [1] Low    Medium High  
    Levels: Low < Medium < High

Data Type

  • Interval data also allows us to calculate the difference between values.

  • Ratio data also allows us to calculate the ratio between values.

  • Ratio and interval data are collectively referred to as quantitative data and involve numeric values.

Interval and Ratio Data

  • crimeLevel$year is an example of interval data.

  • We can calculate the difference between year values.

head(crimeLevel$year - 2000)
[1] 11 11 11 11 11 12
  • We can also calculate whether year is greater than a specific value or whether year is equal to a specific value.
head(crimeLevel$year > 2010)
[1] TRUE TRUE TRUE TRUE TRUE TRUE
head(crimeLevel$year == 2010)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

  • crimeLevel$count is an example of ratio data.

  • We can calculate the ratio of count values.

crimeLevel$count[1] / crimeLevel$count[3]
[1] 2.251977
  • We can also calculate the difference between count values and whether values are greater or equal to a specific value.
crimeLevel$count[1] - crimeLevel$count[3]
[1] 2216
crimeLevel$count[1] > crimeLevel$count[3]
[1] TRUE
crimeLevel$count[1] == crimeLevel$count[3]
[1] FALSE

Effective Visual Features

  • One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.

  • For visualising quantitative data, we need to identify which visual features are quantitative visual features.

Quantitative Features

Quantitative Features

  • Position, length, area, and angle are all examples of features that are able to represent quantitative values.

  • In all of these cases, this includes ratio data because zero can be meaningfully represented by the feature.

Quantitative Features

  • Colour and shape are examples of dimenions that are not appropriate for representing quantitative values.

Quantitative Data

  • The crimeLevelTotal data frame contains the total count of offenders (from 2010 to 2020) for different levels of severity of crime.

  • total is an example of quantitative data.

    crimeLevelTotal
    # A tibble: 5 × 3
      level       total   prop
      <ord>       <int>  <dbl>
    1 Low         24242 0.283 
    2 Low-Medium  21513 0.251 
    3 Medium      13823 0.161 
    4 Medium-High 18315 0.214 
    5 High         7793 0.0909

Quantitative Data

  • If we map total to a quantitative visual feature, we can answer questions like:

    • What is the total for Low offences?
    • What is the difference in total for Low versus Low-Medium offences?
    • What is the ratio of total offenders for Medium-High versus High offences?

We can visualise quantitative data using the position of points.

  • What is the total for Low offences?
  • What is the difference in total for Low versus Low-Medium offences?

Quantitative Features in {ggplot2}

  • We need to select a geom and an aesthetic that map data values to a quantitative visual feature:

    • geom_point() and the x and y aesthetics map data values to position.

    • geom_col() and the y aesthetic maps data values to length.

    • geom_point() and the size aesthetic map data values to area.

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_point(aes(x=total, y=level), size=2)

We can visualise quantitative data using the length of bars.

  • What is the ratio of total offenders for Medium-High versus High offences?

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

We can visualise quantitative data using the area of points.

  • Which is larger: the difference in total between High and Medium-High offences, or the difference in total between Medium-High and Medium offences?

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_point(aes(x="", y=level, size=total), shape=1)

We can visualise quantitative data using the angle/area of wedges.

  • What proportion of total offences were Low offences?

Quantitative Features in {ggplot2}

  • coord_polar() changes the mapping of x and y so that these aesthetics map data values to angle and radius.

  • geom_col() with constant y produces a stacked barplot.

  • geom_col() with constant y and coord_polar() produces a pie chart.

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) +
    geom_col(aes(x=total, y="", fill=level))
ggplot(crimeLevelTotal) +
    geom_col(aes(x=total, y="", fill=level)) +
    coord_polar() 

Quantitative Scales

Quantitative Scales

  • The mapping from data values to a visual feature also depends on the scale of the mapping.

  • How are data values converted to values on the visual feature?

The Principle of Unambiguity

  • The scale should be chosen so that differences in the data are visible in the data visualisation.

Quantitative Scales

  • {ggplot2} provides a reasonable default.

  • We can control the x- and y-scale limits using scale_x_continuous() and scale_y_continuous().

Quantitative Scales

  • Position, length, area, and angle are only ratio features if the position or length is relative to zero.

  • This is particularly important for length, area, and angle because zero length, area, and angle naturally correspond to a data value of zero.

If we visualise ratio data using position of points, a zero position must be shown if we expect the viewer to make ratio comparisons.

Quantitative Scales

  • We can control the x- and y-scale limits using scale_x_continuous() and scale_y_continuous().
ggplot(crimeLevelTotal) + 
    geom_point(aes(x=total, y=level), size=2) +
    scale_x_continuous(limits=c(0, NA))

If we visualise data using length of bars, the bars should always start at zero.

  • This is the default behaviour for geom_col().

Area

  • We perceive the area of a circle rather than the radius.

Ratio Features in {ggplot2}

  • scale_size() (the default) maps values to area.

  • scale_size_area() maps zero to an area of zero.

ggplot(crimeLevelTotal) +
    geom_point(aes("", level, size=total)) +
    scale_size_area()

Feature Mismatch

Feature Mismatch

  • If the visual feature cannot represent quantitative values, we lose information.

Feature Mismatch

  • We may deliberately lose information if we are only asking simple questions.

  • Which level of offence has the highest total?

An effective data visualisation will select a visual feature that allows the visual system to answer the question of interest.

Summary

Summary

  • In {ggplot2}, the choice of geom plus aesthetic mapping dictates the visual feature that we end up using.

  • It makes sense to map quantitative data to quantitative visual features.

    • position, length, area, and angle are quantitative visual features.

    • A zero reference is necessary for visualising ratio data.

Summary

  • It makes sense to map quantitative data to quantitative visual features.

    • Quantitative data can be visualised with:

      • bars of different lengths.
      • points at different positions or with different areas.
      • pie wedges with different angles/areas.
  • Caveat: we only need fully quantitative visual features if we are asking fully quantitative questions.

Exercises

Exercise

  • We want to compare the total of offences for different ethnic groups:

    • What is the ratio of Māori versus European totals?
    • Which ethnic groups have a similar totals?
    crimeGroupTotal[1:2]
    # A tibble: 3 × 2
      group          total
      <chr>          <int>
    1 European/Other 31274
    2 Māori          41787
    3 Pasifika        6993

Exercise

  • What type of data are total and group?

  • Write {ggplot2} code to produce data visualisations that:

    • map total to length.
    • map total to position.
    • map total to angle.
    • map total to colour.
  • Which data visualisations allow us to answer the questions of interest?

Exercise

  • We want to compare the count of offences for different ethnic groups in different years.

    head(crimeGroup[1:3], 10)
                group year count
    1           Māori 2011  5957
    2        Pasifika 2011  1092
    3  European/Other 2011  5775
    6           Māori 2012  5458
    7        Pasifika 2012  1040
    8  European/Other 2012  4831
    11          Māori 2013  4866
    12       Pasifika 2013   880
    13 European/Other 2013  3775
    16          Māori 2014  4054

Exercise

  • Which visual features are count, group, and year being mapped to?

  • Are these mappings effective?

Exercise

  • Can you see anything wrong with this data visualisation?