A data visualisation consists of data symbols, guides, and labels.
A data visualisation can help to answer questions.
We need to choose a mapping from data values to data symbols.
Visual features
Data types
Quantitative features
Quantitative scales
Feature mismatch
Visual Features
In {ggplot2}, our choice of geom selects a geometric shape.
geom_col()
produces a rectangle.
In {ggplot2}, our choice of geom and aesthetic selects a visual feature.
geom_col()
plus x=total
produces
rectangles with different lengths.
In {ggplot2}, our choice of geom and aesthetic selects a visual feature.
geom_col()
plus y=level
produces
rectangles at different positions.
The combination of geom and aesthetic that we choose maps data values to a visual feature.
Visual features are the properties of data symbols that our visual system identifies very rapidly, in parallel, and without conscious effort.
Our task is to identify a good mapping from data values to visual features.
Some common visual features are: position (or location), length (height or width), area, angle, colour, and shape.
How do we choose an effective visual feature to represent data values?
It is important that we can map backwards from a visual feature to the data.
One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.
For visualising quantitative data, we need to identify which visual features are quantitative visual features.
Data Types
Nominal data only allows us to calculate whether X is equal to a specific value.
Ordinal data also allows us to calculate whether X is greater than a specific value.
Ordinal and nominal data are collectively referred to as qualitative data and typically involve character values (text labels).
crimeGroup
contains the total count
of
offenders aged 14-16, per year, broken down by ethnic
group
. group year count pop rate yearDate
1 Māori 2011 5957 42960.00 1407.6156 2011-06-30
2 Pasifika 2011 1092 16947.80 654.0788 2011-06-30
3 European/Other 2011 5775 125542.24 466.9634 2011-06-30
6 Māori 2012 5458 43190.03 1287.8118 2012-06-30
7 Pasifika 2012 1040 16934.88 625.8260 2012-06-30
8 European/Other 2012 4831 125175.13 393.2976 2012-06-30
crimeGroup$group
is an example of
nominal data.
We can calculate whether group
is equal to
a specific value.
[1] FALSE TRUE FALSE FALSE TRUE FALSE
crimeLevel
contains the total count
of
offenders aged 14-16, per year, broken down by the level
of
crime. level year count prop levelNumeric yearDate
1 Low 2011 3986 30.619143 1 2011-06-30
2 Low-Medium 2011 4010 30.803503 2 2011-06-30
3 Medium 2011 1770 13.596559 3 2011-06-30
4 Medium-High 2011 2346 18.021201 4 2011-06-30
5 High 2011 906 6.959594 5 2011-06-30
6 Low 2012 3578 30.991771 1 2012-06-30
crimeLevel$level
is an example of
ordinal data.
We can calculate whether level
is greater
than a specific value.
[1] FALSE FALSE FALSE TRUE TRUE FALSE
level
is equal
to a specific value.[1] FALSE FALSE TRUE FALSE FALSE FALSE
A character vector can take any value.
[1] "Low" "Medium" "High"
A factor has a set of valid levels.
[1] Low Medium High
Levels: High Low Medium
An ordered factor has an order to the levels.
[1] Low Medium High
Levels: Low < Medium < High
Interval data also allows us to calculate the difference between values.
Ratio data also allows us to calculate the ratio between values.
Ratio and interval data are collectively referred to as
quantitative data and involve numeric
values.
crimeLevel$year
is an example of
interval data.
We can calculate the difference between
year
values.
[1] 11 11 11 11 11 12
year
is greater
than a specific value or whether year
is equal to
a specific value.[1] TRUE TRUE TRUE TRUE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE
crimeLevel$count
is an example of
ratio data.
We can calculate the ratio of count
values.
[1] 2.251977
count
values and whether values are greater or
equal to a specific value.[1] 2216
[1] TRUE
[1] FALSE
One idea is to select geoms and aesthetics so that data values are mapped to visual features that have the same type as the data.
For visualising quantitative data, we need to identify which visual features are quantitative visual features.
Quantitative Features
Position, length, area, and angle are all examples of features that are able to represent quantitative values.
In all of these cases, this includes ratio data because zero can be meaningfully represented by the feature.
Colour and shape are examples of dimenions that are not appropriate for representing quantitative values.
The crimeLevelTotal
data frame contains the
total
count of offenders (from 2010 to 2020) for different
levels of severity of crime.
total
is an example of quantitative
data.
# A tibble: 5 × 3
level total prop
<ord> <int> <dbl>
1 Low 24242 0.283
2 Low-Medium 21513 0.251
3 Medium 13823 0.161
4 Medium-High 18315 0.214
5 High 7793 0.0909
If we map total
to a quantitative
visual feature, we can answer questions like:
total
for Low offences?total
for Low versus
Low-Medium offences?total
offenders for Medium-High
versus High offences?We can visualise quantitative data using the position of points.
total
for Low offences?total
for Low versus
Low-Medium offences?We need to select a geom and an aesthetic that map data values to a quantitative visual feature:
geom_point()
and the x
and
y
aesthetics map data values to
position.
geom_col()
and the y
aesthetic maps
data values to length.
geom_point()
and the size
aesthetic map
data values to area.
We can visualise quantitative data using the length of bars.
total
offenders for Medium-High
versus High offences?We can visualise quantitative data using the area of points.
Which is larger: the difference in total
between
High and Medium-High offences, or the difference in total
between Medium-High and Medium offences?
We can visualise quantitative data using the angle/area of wedges.
total
offences were Low
offences?coord_polar()
changes the mapping of x
and y
so that these aesthetics map data values to
angle and radius.
geom_col()
with constant y
produces a
stacked barplot.
geom_col()
with constant y
and
coord_polar()
produces a pie chart.
Quantitative Scales
The mapping from data values to a visual feature also depends on the scale of the mapping.
How are data values converted to values on the visual feature?
The scale should be chosen so that differences in the data are visible in the data visualisation.
{ggplot2} provides a reasonable default.
We can control the x- and y-scale limits using
scale_x_continuous()
and
scale_y_continuous()
.
Position, length, area, and angle are only ratio features if the position or length is relative to zero.
This is particularly important for length, area, and angle because zero length, area, and angle naturally correspond to a data value of zero.
If we visualise ratio data using position of points, a zero position must be shown if we expect the viewer to make ratio comparisons.
scale_x_continuous()
and
scale_y_continuous()
.If we visualise data using length of bars, the bars should always start at zero.
geom_col()
.We perceive the area of a circle rather than the radius.
scale_size()
(the default) maps values to
area.
scale_size_area()
maps zero to an area of
zero.
Feature Mismatch
If the visual feature cannot represent quantitative values, we lose information.
We may deliberately lose information if we are only asking simple questions.
Which level of offence has the highest
total
?
An effective data visualisation will select a visual feature that allows the visual system to answer the question of interest.
Summary
In {ggplot2}, the choice of geom plus aesthetic mapping dictates the visual feature that we end up using.
It makes sense to map quantitative data to quantitative visual features.
position, length, area, and angle are quantitative visual features.
A zero reference is necessary for visualising ratio data.
It makes sense to map quantitative data to quantitative visual features.
Quantitative data can be visualised with:
Caveat: we only need fully quantitative visual features if we are asking fully quantitative questions.
Exercises
We want to compare the total
of offences for
different ethnic groups:
total
s?total
s?# A tibble: 3 × 2
group total
<chr> <int>
1 European/Other 31274
2 Māori 41787
3 Pasifika 6993
What type of data are total
and
group
?
Write {ggplot2} code to produce data visualisations that:
total
to length.total
to position.total
to angle.total
to colour.Which data visualisations allow us to answer the questions of interest?
We want to compare the count
of offences for
different ethnic group
s in different
year
s.
group year count
1 Māori 2011 5957
2 Pasifika 2011 1092
3 European/Other 2011 5775
6 Māori 2012 5458
7 Pasifika 2012 1040
8 European/Other 2012 4831
11 Māori 2013 4866
12 Pasifika 2013 880
13 European/Other 2013 3775
16 Māori 2014 4054
Which visual features are count
, group
,
and year
being mapped to?
Are these mappings effective?
Can you see anything wrong with this data visualisation?