How to Make a Plot in RMar 06, 2023
Grammar of Graphics
A grammar of graphics is a layered approach to creating a plot. The basic idea is to build a plot up layer by layer, specifying different parts of the plot as you go. The main plotting package in R is called
ggplot2, and the gg stands for grammar of graphics. This package is one of the many packages included in the
tidyverse collection of packages. Check out the How to Get Started Using R post for more about packages.
To create a plot in R, we start with specifying the data layer, the data we want to visualize. Then, we define how the data will be mapped to plot elements, such as x and y, using aesthetics mappings. After that, we add on geometric objects to define how those plot elements will be displayed. For example, as points or lines or bars. Next, we can define scales, coordinate systems, and statistical transformations to change how the data is presented. Finally, we can add labels and legends and adjust the theme or the look and feel of the plot.
Every piece of a plot can be defined programmatically and is completely customizable, making this a very powerful process. Admittedly, it can feel a bit overwhelming at first, and it takes some time to get a feel for the different functions used in the process. But the layered approach and the flow of the process remains the same with each plot, allowing us to repeat the same framework each time and experiment with different visual elements.
Making Our First Plot
Let’s make our first plot!
To begin, create a new R script (refer to the R Script section of How to Get Started Using R post for a reminder on how to create an R script).
The first step for any R script is to load the packages required. We will be using two packages for our first few plots:
Make sure you have these packages installed by running i
tidyverse package is actually a collection of packages used for data processing, analysis, and visualization. When we load it, we will see a list of all the packages included. For plotting, we use functions from the
ggplot2 package. We will be using functions from other packages included in the collection later on.
palmerpenguins package contains the data we’ll use for our first plot (and several follow-on plots).
The data in the
palmerpenguins package is a set of penguin observations. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
There are three species of penguins in the data: Adélie, Chinstrap, and Gentoo. The data set includes information about the penguins such as the island on which they were observed, their bill length and depth, and their flipper length. Here are ten rows from the data set.
Loading the packages gives us access to the data and functions included within them. We have to load packages every time we open RStudio. It’s a good practice to load your packages at the top of your R script.
Let’s do that now. Type the following code at the top of your R script in the Source pane. Then highlight those two lines and press CTRL / COMMAND + ENTER to run the code. You should see the output printed below in your Console.
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ✔ ggplot2 3.3.6 ✔ purrr 0.3.4 ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ✔ tidyr 1.2.1 ✔ stringr 1.4.1 ✔ readr 2.1.2 ✔ forcats 0.5.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
The messages under
library(tidyverse) in the Console list out the packages loaded in the collection - notice
ggplot2, which is our main plotting package. There is also a message about Conflicts - this portion of the message tells us that some functions, specifically
lag() from the
dplyr package are overwriting the functions with those same names in the
stats package, which is part of base R. The two colons
:: after the package name help us identify the source of the function. For example,
dplyr::filter() can be read as “the filter function from the dplyr package”.
We’ve now completed the first two steps of creating a plot in R - we created an R script, then we loaded the packages we need.
The next step is to load our data. In this case, we’ve actually already done that by loading the
palmerpenguins package, but in later examples, we’ll read a file to load the data.
Let’s take a look at the data in the
penguins data set. In your R script, type the following then press CTRL / COMMAND + ENTER to run the code.
This code will open a table view of the
penguins data set. The screenshot below shows part of this view.
A screenshot of the penguins data set table view accessed by running view(penguins).
There are a few things to note about this table view we have accessed by running
view(penguins). First, the column names are shown at the top of each column in bold font. The first column is named “species”, the second column is named “island”, and so on. Second, the bottom of the view says “Showing 1 to 19 of 344 entries, 8 total columns”. This tells us that our data set has 344 rows and 8 columns. We can also see the individual data points available and where there is missing data indicated by an NA.
We loaded the necessary packages and we have our data, so now we’re ready to create our first plot!
Referring back to the grammar of graphics principle, we build our plot layer by layer. The first layer is the data, and we use a function called
ggplot() to specify the data we will use for our plot. Let’s see how this works.
ggplot() function there is an argument (or input) called
data, we use the equals sign to set the
data argument equal to our
penguins data set.
ggplot(data = penguins)
If you type and run the code above in your R script, you will see a gray square appear in the Plots pane. This is the foundation of our first plot.
The second layer is the mapping which tells R what columns or variables in our data set we want to place on certain elements of the plot. Let’s see if there is a relationship between the bill depth and the bill length of the penguins in our data set. To explore this relationship, we will put the bill depth on the x-axis and the bill length on the y-axis. We tell R to do this using the
mapping argument inside the
data = penguins argument within the
ggplot() function, we add a comma to tell R we’re going to add more arguments or inputs. Then we write
mapping = to specify that we’re now going to define the aesthetic mappings or how we want R to place variables onto plot elements.
This time, we have to use another function called
aes() on the right-hand side of the equals sign. The
aes() function allows us to define how variables in our data are mapped to visual elements (or aesthetics) of the plot. Within this function, we set the
y arguments to the column names from our data set for bill depth and bill length.
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm))
Running this code adds grid lines to our plot and puts each of the variables on their respective axes. But we don’t see any data yet. This brings us to the third layer of the process, adding geometries. Geometries tell R how we want to see the data. In this case, we want to see a point for each penguin in the data set showing its bill depth and bill length, so we add on to the
ggplot() function by adding a
+ to the end of the line of code and then typing
geom_point(). This will add points for each x and y coordinate defined by our data and mapping arguments.
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm)) + geom_point()
If you ran the code above, you received a warning in the Console: “Warning: Removed 2 rows containing missing values (geom_point).” This warning tells us that two rows of data had missing values for in bill depth or bill length (in this case, these rows are actually missing values for both), so the plot does not include those two rows.
Now that we have our basic plot created by defining the data, mapping, and geometry, we can add on to it. Can we add color to each point to show the species?
Yes, we can! We do this by adding to the
aes() function in the
mapping argument. After
y = bill_length_mm, add a comma and then type
color = species. Now, R will map data in the species column onto the color for each point.
ggplot(data = penguins, mapping = aes(x = bill_depth_mm, y = bill_length_mm, color = species)) + geom_point()
Now you've made your first plot in R!
This is an excerpt from my upcoming book, Data Viz in R. To get the latest on the release of this book, upcoming trainings, and data viz tips, subscribe to my newsletter below.
Stay connected with news and updates!
Join our mailing list to receive the latest news and updates from our team.
Don't worry, your information will not be shared.
We hate SPAM. We will never sell your information, for any reason.