Effective Data Visualisation with R
Quantitative Data

Paul Murrell
The University of Auckland

Review

A data visualisation consists of data symbols, guides, and labels.
A data visualisation can help to answer questions.
- An effective data visualisation will pose questions that the visual system is good at answering.
We need to choose a mapping from data values to data symbols.
- An effective data visualisation will have good mappings from data to data symbols.

Quantitative Data

Visual features
Data types
Quantitative features
Quantitative scales
Feature mismatch

Visual Features

Data symbols are the geometric shapes that represent data values.

In {ggplot2}, our choice of geom selects a geometric shape.
geom_col() produces a rectangle.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

In {ggplot2}, our choice of geom and aesthetic selects a visual feature.
geom_col() plus x=total produces rectangles with different lengths.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

In {ggplot2}, our choice of geom and aesthetic selects a visual feature.
geom_col() plus y=level produces rectangles at different positions.

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

Visual Features

The combination of geom and aesthetic that we choose maps data values to a visual feature.
Visual features are the properties of data symbols that our visual system identifies very rapidly, in parallel, and without conscious effort.
Our task is to identify a good mapping from data values to visual features.

Visual Features

Some common visual features are: position (or location), length (height or width), area, angle, colour, and shape.

Effective Visual Features

How do we choose an effective visual feature to represent data values?
It is important that we can map backwards from a visual feature to the data.

Effective Visual Features

One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.
For visualising quantitative data, we need to identify which visual features are quantitative visual features.

Data Types

Nominal data only allows us to calculate whether X is equal to a specific value.
Ordinal data also allows us to calculate whether X is greater than a specific value.
Ordinal and nominal data are collectively referred to as qualitative data and typically involve character values (text labels).

Nominal Data

crimeGroup contains the total count of offenders aged 14-16, per year, broken down by ethnic group.

head(crimeGroup)

           group year count       pop      rate   yearDate
1          Māori 2011  5957  42960.00 1407.6156 2011-06-30
2       Pasifika 2011  1092  16947.80  654.0788 2011-06-30
3 European/Other 2011  5775 125542.24  466.9634 2011-06-30
6          Māori 2012  5458  43190.03 1287.8118 2012-06-30
7       Pasifika 2012  1040  16934.88  625.8260 2012-06-30
8 European/Other 2012  4831 125175.13  393.2976 2012-06-30

Nominal Data

crimeGroup$group is an example of nominal data.
We can calculate whether group is equal to a specific value.

head(crimeGroup$group == "Pasifika")

[1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Ordinal Data

crimeLevel contains the total count of offenders aged 14-16, per year, broken down by the level of crime.

head(crimeLevel)

        level year count      prop levelNumeric   yearDate
1         Low 2011  3986 30.619143            1 2011-06-30
2  Low-Medium 2011  4010 30.803503            2 2011-06-30
3      Medium 2011  1770 13.596559            3 2011-06-30
4 Medium-High 2011  2346 18.021201            4 2011-06-30
5        High 2011   906  6.959594            5 2011-06-30
6         Low 2012  3578 30.991771            1 2012-06-30

Ordinal Data

crimeLevel$level is an example of ordinal data.
We can calculate whether level is greater than a specific value.

head(crimeLevel$level > "Medium")

[1] FALSE FALSE FALSE  TRUE  TRUE FALSE

We can also calculate whether level is equal to a specific value.

head(crimeLevel$level == "Medium")

[1] FALSE FALSE  TRUE FALSE FALSE FALSE

A character vector can take any value.

char <- c("Low", "Medium", "High")
char

[1] "Low"    "Medium" "High"

A factor has a set of valid levels.

fac <- factor(char)
fac

[1] Low    Medium High  
Levels: High Low Medium

An ordered factor has an order to the levels.

ordered(fac, levels=char)

[1] Low    Medium High  
Levels: Low < Medium < High

Data Type

Interval data also allows us to calculate the difference between values.
Ratio data also allows us to calculate the ratio between values.
Ratio and interval data are collectively referred to as quantitative data and involve numeric values.

Interval and Ratio Data

crimeLevel$year is an example of interval data.
We can calculate the difference between year values.

head(crimeLevel$year - 2000)

[1] 11 11 11 11 11 12

We can also calculate whether year is greater than a specific value or whether year is equal to a specific value.

head(crimeLevel$year > 2010)

[1] TRUE TRUE TRUE TRUE TRUE TRUE

head(crimeLevel$year == 2010)

[1] FALSE FALSE FALSE FALSE FALSE FALSE

crimeLevel$count is an example of ratio data.
We can calculate the ratio of count values.

crimeLevel$count[1] / crimeLevel$count[3]

[1] 2.251977

We can also calculate the difference between count values and whether values are greater or equal to a specific value.

crimeLevel$count[1] - crimeLevel$count[3]

[1] 2216

crimeLevel$count[1] > crimeLevel$count[3]

[1] TRUE

crimeLevel$count[1] == crimeLevel$count[3]

[1] FALSE

Effective Visual Features

One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.
For visualising quantitative data, we need to identify which visual features are quantitative visual features.

Quantitative Features

Position, length, area, and angle are all examples of features that are able to represent quantitative values.
In all of these cases, this includes ratio data because zero can be meaningfully represented by the feature.

Quantitative Features

Colour and shape are examples of dimenions that are not appropriate for representing quantitative values.

Quantitative Data

The crimeLevelTotal data frame contains the total count of offenders (from 2010 to 2020) for different levels of severity of crime.

total is an example of quantitative data.

crimeLevelTotal

# A tibble: 5 × 3
  level       total   prop
  <ord>       <int>  <dbl>
1 Low         24242 0.283 
2 Low-Medium  21513 0.251 
3 Medium      13823 0.161 
4 Medium-High 18315 0.214 
5 High         7793 0.0909

Quantitative Data

If we map total to a quantitative visual feature, we can answer questions like:
- What is the total for Low offences?
- What is the difference in total for Low versus Low-Medium offences?
- What is the ratio of total offenders for Medium-High versus High offences?

We can visualise quantitative data using the position of points.

What is the total for Low offences?
What is the difference in total for Low versus Low-Medium offences?

Quantitative Features in {ggplot2}

We need to select a geom and an aesthetic that map data values to a quantitative visual feature:
- geom_point() and the x and y aesthetics map data values to position.
- geom_col() and the y aesthetic maps data values to length.
- geom_point() and the size aesthetic map data values to area.

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_point(aes(x=total, y=level), size=2)

We can visualise quantitative data using the length of bars.

What is the ratio of total offenders for Medium-High versus High offences?

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_col(aes(x=total, y=level))

We can visualise quantitative data using the area of points.

Which is larger: the difference in total between High and Medium-High offences, or the difference in total between Medium-High and Medium offences?

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) + 
    geom_point(aes(x="", y=level, size=total), shape=1)

We can visualise quantitative data using the angle/area of wedges.

What proportion of total offences were Low offences?

Quantitative Features in {ggplot2}

coord_polar() changes the mapping of x and y so that these aesthetics map data values to angle and radius.
geom_col() with constant y produces a stacked barplot.
geom_col() with constant y and coord_polar() produces a pie chart.

Quantitative Features in {ggplot2}

ggplot(crimeLevelTotal) +
    geom_col(aes(x=total, y="", fill=level))

ggplot(crimeLevelTotal) +
    geom_col(aes(x=total, y="", fill=level)) +
    coord_polar()

Quantitative Scales

The mapping from data values to a visual feature also depends on the scale of the mapping.
How are data values converted to values on the visual feature?

The Principle of Unambiguity

The scale should be chosen so that differences in the data are visible in the data visualisation.

Quantitative Scales

{ggplot2} provides a reasonable default.
We can control the x- and y-scale limits using scale_x_continuous() and scale_y_continuous().

Quantitative Scales

Position, length, area, and angle are only ratio features if the position or length is relative to zero.
This is particularly important for length, area, and angle because zero length, area, and angle naturally correspond to a data value of zero.

If we visualise ratio data using position of points, a zero position must be shown if we expect the viewer to make ratio comparisons.

Quantitative Scales

We can control the x- and y-scale limits using scale_x_continuous() and scale_y_continuous().

ggplot(crimeLevelTotal) + 
    geom_point(aes(x=total, y=level), size=2) +
    scale_x_continuous(limits=c(0, NA))

If we visualise data using length of bars, the bars should always start at zero.

This is the default behaviour for geom_col().

Area

We perceive the area of a circle rather than the radius.

Ratio Features in {ggplot2}

scale_size() (the default) maps values to area.
scale_size_area() maps zero to an area of zero.

ggplot(crimeLevelTotal) +
    geom_point(aes("", level, size=total)) +
    scale_size_area()

Feature Mismatch

If the visual feature cannot represent quantitative values, we lose information.

Feature Mismatch

We may deliberately lose information if we are only asking simple questions.
Which level of offence has the highest total?

An effective data visualisation will select a visual feature that allows the visual system to answer the question of interest.

Summary

In {ggplot2}, the choice of geom plus aesthetic mapping dictates the visual feature that we end up using.
It makes sense to map quantitative data to quantitative visual features.
- position, length, area, and angle are quantitative visual features.
- A zero reference is necessary for visualising ratio data.

Summary

It makes sense to map quantitative data to quantitative visual features.
- Quantitative data can be visualised with:
  - bars of different lengths.
  - points at different positions or with different areas.
  - pie wedges with different angles/areas.
Caveat: we only need fully quantitative visual features if we are asking fully quantitative questions.

Exercises

Exercise

We want to compare the total of offences for different ethnic groups:
- What is the ratio of Māori versus European totals?
- Which ethnic groups have a similar totals?
```
crimeGroupTotal[1:2]
```
```
# A tibble: 3 × 2
  group          total
  <chr>          <int>
1 European/Other 31274
2 Māori          41787
3 Pasifika        6993
```

Exercise

What type of data are total and group?
Write {ggplot2} code to produce data visualisations that:
- map total to length.
- map total to position.
- map total to angle.
- map total to colour.
Which data visualisations allow us to answer the questions of interest?

Exercise

We want to compare the count of offences for different ethnic groups in different years.

head(crimeGroup[1:3], 10)

            group year count
1           Māori 2011  5957
2        Pasifika 2011  1092
3  European/Other 2011  5775
6           Māori 2012  5458
7        Pasifika 2012  1040
8  European/Other 2012  4831
11          Māori 2013  4866
12       Pasifika 2013   880
13 European/Other 2013  3775
16          Māori 2014  4054

Exercise

Which visual features are count, group, and year being mapped to?
Are these mappings effective?

Exercise

Can you see anything wrong with this data visualisation?

Effective Data Visualisation with RQuantitative Data

Paul MurrellThe University of Auckland

Review

Quantitative Data

Visual Features

Visual Features

Visual Features

Effective Visual Features

Effective Visual Features

Data Types

Nominal Data

Nominal Data

Ordinal Data

Ordinal Data

Data Type

Interval and Ratio Data

Effective Visual Features

Quantitative Features

Quantitative Features

Quantitative Data

Quantitative Data

Quantitative Features in {ggplot2}

Quantitative Features in {ggplot2}

Quantitative Features in {ggplot2}

Quantitative Features in {ggplot2}

Quantitative Features in {ggplot2}

Quantitative Features in {ggplot2}

Quantitative Scales

The Principle of Unambiguity

Quantitative Scales

Quantitative Scales

Quantitative Scales

Area

Ratio Features in {ggplot2}

Feature Mismatch

Feature Mismatch

Summary

Summary

Exercise

Exercise

Exercise

Exercise

Exercise

Effective Data Visualisation with R
Quantitative Data

Paul Murrell
The University of Auckland