A data visualisation is made up of data symbols and labels.
data symbols represent data values.
Our goal is to learn how to choose an effective mapping from data values to data symbols.
What to draw:
Many sections will focus
primarily on the choice of data symbols to use in a data
visualisation.
How to draw:
In this section we will focus
on writing R code to create data visualisations, using the {ggplot2}
package.
Later sections will supplement advice on what to draw with further information about {ggplot2}.
{ggplot2} is built on a set of basic building blocks:
The code for a {ggplot2} data visualisation has the following format:
ggplot(data) + geom_shape(aes(mappings))
# A tibble: 5 × 3
level total prop
<ord> <int> <dbl>
1 Low 24242 0.283
2 Low-Medium 21513 0.251
3 Medium 13823 0.161
4 Medium-High 18315 0.214
5 High 7793 0.0909
crimeLevelTotal
.col
umn (or bar).level
maps to the x
position of the columns and total
maps to
the y
position of (the tops of) the columns.Many code examples have been deliberately simplified.
Some code examples will not exactly correspond to the details of the accompanying data visualisation.
In some cases, warning messages have also been suppressed.
Geoms
geom_col
.geom_point
.geom_line()
.group=1
to tell {ggplot2} that all of the
positions belong to the same line.Function | Shape | Function | Shape |
---|---|---|---|
geom_col() |
Rectangle | geom_line() |
Line |
geom_bar() |
geom_path() |
||
geom_tile() |
geom_segment() |
||
geom_rect() |
geom_curve() |
||
geom_point() |
Data point | geom_text() |
Text |
geom_jitter() |
geom_label() |
||
geom_polygon() |
Polygon | geom_raster() |
Bitmap |
Aesthetics
An aesthetic is something that affects how a geom is drawn.
An aesthetic mapping specifies that a data variable controls how a geom is drawn.
x
and y
aesthetics map data values to
positions.size
aesthetics map data values to
size.shape
aesthetics map data values to
shape.colour
aesthetics map data values to
colour.fill
aesthetics map data values to fill
colour.Different geoms can have different aesthetics.
geom_point()
has a shape
aesthetic.The same aesthetic can mean different things for different geoms.
x
and y
aesthetics map data values to
positions.geom_line()
that means x- and y-locations.x
and y
aesthetics map data values to
positions.geom_col()
that means the x-centre and y-top.aes()
are
settings (not mappings).Aesthetic | Meaning | Aesthetic | Meaning |
---|---|---|---|
x |
Position (x) | y |
Position (y) |
width |
Length (x) | height |
Length (y) |
colour |
Border colour | fill |
Fill colour |
alpha |
Level of transparency | ||
linetype |
Line style | linewidth |
Line width |
shape |
Shape of points | ||
size |
Size of points and text | ||
label |
Text label | ||
family |
Font family | fontface |
Font face |
hjust |
Justification (x) | vjust |
Justification (y) |
lineheight |
Interline spacing | ||
group |
Group membership |
Scales
Scales control how data values are mapped to aesthetic values.
{ggplot2} will provide a default scale, but we can change the defaults using scale functions.
scale_x_continuous()
controls the axis range and the
axis labelling when quantitative data is mapped to
x
.
scale_x_discrete()
controls the axis range and
labelling when qualitative data is mapped to x
.
There are also scale_y_continuous()
and
scale_y_discrete()
.
scale_colour_continuous()
controls the colour
gradient when quantitative data is mapped to
colour
.
scale_colour_discrete()
controls the palette of
colours when qualitative data is mapped to colour
.
There are also scale_fill_continuous()
and
scale_fill_discrete()
.
We will talk more about selecting colours later on.
scale_size()
controls the range
of
sizes used (in millimetres).
There are “manual” versions of scales, e.g.,
scale_shape_manual()
and
scale_linetype_manual()
, that allow a set of explicit
values
to control the aesthetic mappings.
Stats
Some geoms do not map the raw data.
Some geoms apply a stat to transform the raw data to a set of data statistics.
The data statistics are then mapped to the aesthetics of a geom.
The offenders
data frame contains information on
individual offenders, including age group.
Age.Group
1 15-19
2 15-19
3 20-24
4 20-24
5 20-24
6 20-24
In order to draw a bar plot of the number of offenders per age group, we need a table of counts.
0-4 5-9 10-14 15-19 20-24
1 91 4036 11648 10920
25-29 30-34 35-39 40-44 45-49
11879 11468 8980 7096 5642
50-54 55-59 60-64 65-69 70-74
4446 2983 1853 984 616
75-79 80yearsorover NotSpecified
268 252 7
geom_bar()
takes individual raw data and uses
stat_count()
to generate a table of counts.
The unique categories map to the bar position and the counts map to the bar lengths.
geom_histogram()
uses stat_bin()
.
This bins raw quantitative data and maps the resulting bin boundaries and counts to the position and length of bars.
geom_density()
uses stat_density()
.
This calculates a density estimate from raw quantitative data and maps the resulting density to the position of a line.
geom_smooth()
uses stat_smooth()
.
This takes raw quantitative x and y data, calculates smoothed y values, and maps those to the position of a line.
The RWCperGame
data frame contains measures of
performance at the Rugby World Cup of 2023 for different countries, plus
the hemisphere
that each country is from.
# A tibble: 6 × 11
country hemisphere yellowcards redcards cleanbreaks
<chr> <fct> <dbl> <dbl> <dbl>
1 Namibia South 1 0.5 2.5
2 Romania North 1.25 0 2.75
3 Chile South 1.25 0 3.75
4 Samoa South 1.25 0.25 3.75
5 Australia South 0.5 0 5.25
6 Georgia North 0.5 0 5.25
# ℹ 6 more variables: tackles <dbl>, points <dbl>,
# conversions <dbl>, offloads <dbl>, tries <dbl>,
# runs <dbl>
Note that if we produce a data symbol from data statistics, we can only map back to the data statistics.
The help pages for individual geoms describe:
Summary
Exercises
We will produce several data visualisations using the
rates
data frame.
Age Year Rate
1 10 2010 66
2 11 2010 120
3 12 2010 211
4 13 2010 392
6 10 2011 62
7 11 2011 107
Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.
Write {ggplot2} code to produce the data visualisation (ignoring labels).
Identify the geoms, aesthetic mappings, and scales employed in this data visualisation.
Write {ggplot2} code to produce the data visualisation (ignoring labels).
The next data visualisation uses the RWCperGame
data
frame.
country hemisphere yellowcards redcards cleanbreaks tackles points
11 Namibia South 1.00 0.50 2.50 102.00 9.25
14 Romania North 1.25 0.00 2.75 142.50 8.00
3 Chile South 1.25 0.00 3.75 132.25 6.75
15 Samoa South 1.25 0.25 3.75 109.50 23.00
2 Australia South 0.50 0.00 5.25 108.25 22.50
7 Georgia North 0.50 0.00 5.25 149.75 16.00
conversions offloads tries runs
11 0.50 2.25 0.75 92.00
14 0.75 1.50 1.00 81.00
3 0.50 5.25 1.00 102.25
15 2.00 9.25 2.75 102.50
2 1.75 7.75 2.75 110.75
7 1.00 8.00 1.75 114.00
What values are being mapped in the data visualisation below?
What stat is being used?