Data Visualization with ggplot2

Overview

Teaching: 60 min
Exercises: 30 min

Questions

How do I plot my data in R?

How do I adjust plots to better display my data?

Objectives

Describe the role of data, aesthetics, and geoms in ggplot functions.

Choose aesthetics and alter the geom parameters for a scatter plot, histogram, or box plot.

Customize plot scales, titles, subtitles, themes, fonts, layout, and orientation.

Apply a facet to a plot.

Save a ggplot to a file.

List resources for getting help with ggplot and creating informative scientific plots.

We start by loading the package ggplot2.

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.0.2

The gg in ggplot2 stands for “grammar of graphics”. The idea is that any tidy data set can be made into an informative plot in a systematic way, by mapping variables in the data to positions, colours, shapes, and sizes of visual output. These visual features are referred to as “aesthetics”, or “aes” for short. There are different ways of mapping aesthetics to visual features - bar plots, line plots, and so on. These differ in their geometry and so are referred to as “geometries”, or “geom” for short.

Some simple examples will help. We will first plot one of R’s built-in datasets, the Iris flower dataset, which measured sizes of iris flowers in different species, in the 1930s. This data set records the length and width of two different parts of flowers, the Sepal and the Petal, for three different species, (Iris setosa, Iris virginica and Iris versicolor).

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
        5.1         3.5          1.4         0.2  setosa
        4.9         3.0          1.4         0.2  setosa
        4.7         3.2          1.3         0.2  setosa
        4.6         3.1          1.5         0.2  setosa
        5.0         3.6          1.4         0.2  setosa
        5.4         3.9          1.7         0.4  setosa

This scatter plot displays points (geom_point), where:

x-axis position is given by Petal.Width
y-axis position is given by Petal.Length
colour is given by Species

ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length, colour = Species)) +
  geom_point()

plot of chunk scatter-plot-iris

We could just as easily make a scatter plot where:

x-axis position is given by Species
y-axis position is given by Petal.Length
colour is given by Petal.Width

ggplot(data = iris, aes(x = Species, y = Petal.Length, colour = Petal.Width)) +
  geom_point()

plot of chunk scatter-plot-iris-ugly

You may find this plot less informative, or even uglier. This makes the point that the choice of how to map data features onto visual aesthetics is important.

Plotting with `ggplot2`

ggplot2 is a plotting package that makes it simple to create complex plots from a tidy data frame. It provides a programmatic interface for specifying what variables to plot, how they are displayed, and other visual properties. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking.

ggplot2 works with tidy data, i.e., a column for every variable, and a row for every observation. Well-structured data will save you lots of time when making figures with ggplot2. If you need to reorganise your data, you can use other tidyverse packages such as dplyr and tidyr, and then afterwards plot the reorganised data using ggplot2.

ggplot graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

To build a ggplot, we will use the following basic template that can be used for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) +  <GEOM_FUNCTION>()

The ggplot() function sets up an empty plot, which is “bound” to a specific data frame, using the data argument. We will use the variants data frame from earlier in this workshop.

ggplot(data = variants)

The aesthetic (aes) function defines a mapping by selecting the variables to be plotted and specifying how to present them in the plot. Here we use genomic position POS for the x position, and read depth DP for the y position.

ggplot(data = variants, aes(x = POS, y = DP))

‘geoms’ provide graphical representations of the data in the plot. ggplot2 offers many different geoms; we will use some common ones today, including:

  * `geom_point()` for scatter plots, dot plots, etc.
  * `geom_boxplot()` for, well, boxplots!
  * `geom_line()` for trend lines, time series, etc.

To add a geom to the plot we use the + operator. Because we have two continuous variables, let’s use geom_point() first:

ggplot(data = variants, aes(x = POS, y = DP)) +
  geom_point()

plot of chunk first-ggplot

The + in the ggplot2 package allows you to modify existing ggplot objects, by adding more plot elements. This means you can easily set up plot templates and conveniently explore different types of plots, like this:

# Assign plot to a variable
coverage_plot <- ggplot(data = variants, aes(x = POS, y = DP))

# Draw the plot
coverage_plot +
    geom_point()

Notes

Data and aesthetics in the ggplot() function can be seen by geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
You can instead specify mappings for a given geom to replace or add to those mappings defined globally in the ggplot() function.
The + sign used to add new layers must be placed at the end of the line containing the previous layer. This is because R assumes that the end of a line is the end of a command, unless you tell it otherwise. So, if the + sign is added at the beginning of the next line containing the new layer, ggplot2 will not add that new layer and will return an error message.

# This is the correct syntax for adding layers
coverage_plot +
  geom_point()

# This will not add the new layer and will return an error message
coverage_plot
  + geom_point()

Building your plots iteratively: geoms.

Building plots with ggplot2 is typically an iterative process. We start by defining the dataset we’ll use, lay out the axes, and choose a geom:

ggplot(data = variants, aes(x = POS, y = DP)) +
  geom_point()

plot of chunk create-ggplot-object

To be clear, this plot tells us the position and read depth of each genomic variant.

To learn more, we start modifying this plot. For example nearby points may be “overplotted” so that it is hard to distinguish one point from many nearby points. To reduce overplotting, we can add transparency using the aesthetic (alpha):

ggplot(data = variants, aes(x = POS, y = DP)) +
    geom_point(alpha = 0.5)

plot of chunk adding-transparency

Using colors as an aesthetic

If plotting with a different color is more clear, we can add colors for all the points:

ggplot(data = variants, aes(x = POS, y = DP)) +
  geom_point(alpha = 0.5, color = "blue")

plot of chunk adding-colors

To add sample-specific visual information, we can color each sample in the plot differently. ggplot2 is designed to make it easy to map variables to visual aesthetics, such as color. Here is an example where we color with sample_id:

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
  geom_point(alpha = 0.5)

plot of chunk color-by-sample-1

Challenge:

What is the difference between color = sample_id (no quotes) and color = "sample_id" (auotes)?

What is the difference between specifying color inside the aes() function, or outside?
Solution
ggplot(data = variants, aes(x = POS, y = DP, color = "sample_id")) +
 geom_point(alpha = 0.5)
In this case, ggplot interprets color = "sample_id" as mapping the string "sample_id", not the variable sample_id. These distinctions turn out to be important.

We can plot the same aesthetic mappings using different geometries. For example, we can change the geom from point to column, and positions and colors will be still determined in the same way, but they will look different:

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
  geom_col(alpha = 0.5)

plot of chunk color-by-sample-2

To find out more about the different aesthetics, you can use the help functions, such as ?geom_point and ?geom_col. There are many more useful resources that we link to below.

Building plots iteratively: axis labels

To make our plot more interpretable, we can add axis labels:

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
  geom_jitter(alpha = 0.5) +
  labs(x = "Base Pair Position",
       y = "Read Depth (DP)")

plot of chunk add-axis-labels

ggplot also allows you to change the way in which variables are mapped to aesthetics, using scales. For example, variables that are widely dispersed may be easier to visualise on a log scale, where equal spacings on the page represent equal fold-changes in the variable. Here we use the built-in scale_y_log10 to display the y-axis on a log(10) scale.

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
  geom_jitter(alpha = 0.5) +
  labs(x = "Base Pair Position",
       y = "Read Depth (DP)") +
  scale_y_log10()

plot of chunk y-log10-scale

Althoguh we could transform the x-axis using scale_x_log10, that would be absurd for genomic position. ggplot’s capabilities are designed to To find out more about different scales, you can again use the help functions, such as ?scale_y_log10. Or, type scale_ and tab autocomplete to find out what other options are available.

Challenge

Use what you just learned to create a scatter plot of mapping quality (MQ) over position (POS) with the samples showing in different colors. Make sure to give your plot relevant axis labels.

Does it make sense to try a different y-axis scale here?
Solution
 ggplot(data = variants, aes(x = POS, y = MQ, color = sample_id)) +
  geom_point() +
  labs(x = "Base Pair Position",
       y = "Mapping Quality (MQ)")

Creating subplots by Faceting

ggplot2 has a feature called faceting that allows the user to split one plot into multiple plots based on a factor included in the dataset. This offers an alternative to splittting differnt factors by, for example, colour or shape. We will use it to split our read depth plot into three panels, one for each sample.

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
 geom_point() +
 labs(x = "Base Pair Position",
      y = "Read Depth (DP)") +
 facet_grid(. ~ sample_id)

plot of chunk first-facet

This looks ok, but it would be easier to read if the plot facets were stacked vertically rather than horizontally. The facet_grid geometry allows you to explicitly specify how you want your plots to be arranged via formula notation (rows ~ columns; a . can be used as a placeholder that indicates only one row or column).

ggplot(data = variants, aes(x = POS, y = DP, color = sample_id)) +
 geom_point() +
 labs(x = "Base Pair Position",
      y = "Read Depth (DP)") +
 facet_grid(sample_id ~ .)

plot of chunk second-facet

In this case, stacking the plots vertically makes sense as all samples measure read depth along the same genome, represented by the x-axis.

Building plots iteratively: adjusting plot appearance with themes

Plots have many visual elements besides the data - axis and tick labels, grid lines, background colours, and so on. All of these can vary in their appearance. Text elements vary in font. Most elements vary by their size and color, and even by their presence or absence. ggplot controls the appearance of these other plot elements with themes.

For example, plots with white background look more readable when printed. We can set the background to white using built-in theme, theme_bw():

ggplot(data = variants, aes(x = POS, y = MQ, color = sample_id)) +
  geom_point() +
  labs(x = "Base Pair Position",
       y = "Mapping Quality (MQ)") +
  facet_grid(sample_id ~ .) +
  theme_bw()

plot of chunk facet-plot-white-bg

We can remove individual plot elements that aren’t needed, for example the grid lines:

ggplot(data = variants, aes(x = POS, y = MQ, color = sample_id)) +
  geom_point() +
  labs(x = "Base Pair Position",
       y = "Mapping Quality (MQ)") +
  facet_grid(sample_id ~ .) +
  theme_bw() +
  theme(panel.grid = element_blank())

plot of chunk facet-plot-white-bg-no-grid

Challenge

Use what you just learned to create a scatter plot of PHRED scaled quality (QUAL) over position (POS) with the samples showing in different colors. Make sure to give your plot relevant axis labels.
Solution
 ggplot(data = variants, aes(x = POS, y = QUAL, color = sample_id)) +
  geom_point() +
  labs(x = "Base Pair Position",
       y = "PHRED-sacled Quality (QUAL)") +
  facet_grid(sample_id ~ .)

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Making barplots to count the number of observations in groups

Plots with similar appearances can mean different things. Earlier we used geom_col to plot read depth (DP) as the height of a column. However, bar plots are commonly used to count the number of observations in groups. ggplot distinguishes between geom_col, which plots the raw dat supplied, and geom_bar, which does the counting for you.

We can create barplots of counts using geom_bar: Let’s make a barplot showing the number of variants for each sample that are indels.

ggplot(data = variants, aes(x = INDEL)) +
  geom_bar() +
  facet_grid(sample_id ~ .)

plot of chunk barplot

Challenge

Since we already have the sample_id labels on the individual plot facets, we don’t need the legend. Use the help file for geom_bar and any other online resources you want to use to remove the legend from the plot.
Solution
ggplot(data = variants, aes(x = INDEL, color = sample_id)) +
   geom_bar(show.legend = F) +
   facet_grid(sample_id ~ .)

There is much more flexibility in terms of statistical summaries and counting in ggplot, but we don’t have time to cover it in this lesson.

To export a plot to a file, use the Export button, in the bottom right plotting window in RStudio. Or you can write this into a script using the function ggsave - remember, use ?ggsave to ask for help from your R session.

Challenge

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own. Use the RStudio ggplot2 cheat sheet for inspiration. Here are some ideas:

See if you can change the size or shape of the plotting symbol.

Can you find a way to change the name of the legend? What about its labels?

Try using a different color palette (see http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/).

Tip: Where to learn more about plotting

There are amazing online and print resources for learning more about ggplot2. Some of them talk about the ideas behind plotting, and give galleries of examples. Others are more practical about how to program with ggplot2.

R for ecology ggplot2: Episode from Data Carpentry lesson introducing ggplot with a different example dataset. https://ggplot2-book.org/

ggplot2: elegant graphics for data analysis: By Hadley Wickham, Danielle Navarro, and Thomas Lin Pedersen. Gives some basics of ggplot2, but its primary focus is explaining the Grammar of Graphics that ggplot2 uses.

Data Visualization: A Practical Introduction: By Kieran Healy. Teaches how to makedata visualizations with R and ggplot2 in a clear, sensible, and reproducible way. Accompanied by the socviz package with example code.

Fundamentals of Data Visualization: By Claus Wilke. A guide to making visualizations that accurately reflect the data, tell a story, and look professional. Complete source code, with plots made in ggplot2, is available.

Key Points

ggplot2 maps variables in a tidy data frame to visual aesthetics such as position and color.

Plots are built iteratively from different components: data, geoms, scales, and themes.

You can customize any aspect of a plots to best display your data.

previous episode

Intro to R and RStudio for Genomics

next episode

Data Visualization with ggplot2

Overview

Plotting with `ggplot2`

Building your plots iteratively: geoms.

Using colors as an aesthetic

Challenge:

Solution

Building plots iteratively: axis labels

Challenge

Solution

Creating subplots by Faceting

Building plots iteratively: adjusting plot appearance with themes

Challenge

Solution

Making barplots to count the number of observations in groups

Challenge

Solution

Challenge

Tip: Where to learn more about plotting

Key Points

previous episode

next episode

previous episode

Intro to R and RStudio for Genomics

next episode

Data Visualization with ggplot2

Overview

Plotting with ggplot2

Building your plots iteratively: geoms.

Using colors as an aesthetic

Challenge:

Solution

Building plots iteratively: axis labels

Challenge

Solution

Creating subplots by Faceting

Building plots iteratively: adjusting plot appearance with themes

Challenge

Solution

Making barplots to count the number of observations in groups

Challenge

Solution

Challenge

Tip: Where to learn more about plotting

Key Points

previous episode

next episode

Plotting with `ggplot2`