Web scraping in R

Antoine Soetewey 2023-01-16 15 minute read

Introduction
- HTML and CSS
- Web scraping vs. APIs
  - Why does web scraping exist if APIs are so powerful and do exactly the same work?
Web scraping in R
A real application of web scraping in R
To go further
Conclusion

Note: This post has been written in collaboration with Pietro Zanotta.

Introduction

Almost anyone is familiar with web pages (otherwise you would not be here), but what if we tell you that how you see a site is different from how Google or your browser does?

In fact, when you type any site address in your browser, your browser will download and render the page for you, but for rendering the page it needs some instructions.

There are 3 types of instructions:

HTML: describes a web page’s infrastructure;
CSS: defines the appearance of a site;
JavaScript: decides the behavior of the page.

Web scraping is the art of extracting information from the HTML, CSS and Javascript lines of code. The term usually refers to an automated process, which is less error-prone and faster than gathering data by hand.

It is important to note that web scraping can raise ethical concerns, as it involves accessing and using data from websites without the explicit permission of the website owner. It is a good practice to respect the terms of use for a website, and to seek written permission before scraping large amounts of data.

This article aims to cover the basics of how to do web scraping in R. We will conclude by creating a database on Formula 1 drivers from Wikipedia.

Note that this article doesn’t want to be exhaustive on topic. To learn more, see this section.

HTML and CSS

Before starting it is important to have a basic knowledge of HTML and CSS. This section aims to briefly explain how HTML and CSS work, to learn more we leave you some resources at the bottom of this article.

Feel free to skip this section if you already are knowledgeable in this topic.

Starting from HTML, an HTML file looks like the following piece of code.

<!DOCTYPE html>
<html lang="en">
<body>

<h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"> Carl Friedrich Gauss</h1>
<h2> Biography </h2>
<p> Johann Carl Friedrich Gauss was born on 30 April 1777 in Brunswick. </p>
<h2> Profession </h2>
<p> Gauss is considered as one of the greatest mathematician, statistician and physicist of all time. </p>

</body>
</html>

Those instructions produce the following:

As you read above, HTML is used to describe the infrastructure of a web page, for example we may want to define the headings, the paragraphs, etc.

This infrastructure is represented by what are called tags (for example <h1>...<\h1> or <p>...<\p> are tags). Tags are the core of an HTML document as they represent the nature of what is inside the tag (for example h1 stands for heading 1). It is important to observe that there are two types of tags:

starting tags (e.g. <h1>)
ending tags (e.g. <\h1>)

This is what allows to nest different tags.

Tags can also have attributes, for example in <h1 href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss"Carl Friedrich Gauss</h1>, href is an attribute of the tag h1 that specifies an URL.

As the output of the above HTML code is not super elegant, CSS is used to style the final website. For example CSS is used to define the font, the color, the size, the spacing and many more features of a website.

What is important for this article are CSS selectors, which are patterns used to select elements. The most important is the .class selector, which selects all elements with the same class. For example the .xyz selector selects all elements with class="xyz".

Web scraping vs. APIs

Going back to web scraping, you may know that APIs are another way to access data from websites and online services.

In fact an API is a set of rules and protocols that allows two different software systems to communicate with each other. When a website or online service provides an API, it means that they have made it possible for developers to access their data in a structured and controlled way.

Why does web scraping exist if APIs are so powerful and do exactly the same work?

The main difference between web scraping and using APIs is that APIs are typically provided by the website or service to allow access to their data, while web scraping involves accessing data without the explicit permission of the website owner.

This means that using APIs is generally considered more ethical than web scraping, as it is done with the explicit permission of the website or service.

However, there are also some limitations to using APIs:

many APIs have rate limits, which means that they will only allow a certain number of requests to be made within a certain time period, i.e. you may not access large amounts of data;
not all websites or online services provide APIs, which means the only way to access their data is via web scraping.

Web scraping in R

There are several packages for web scraping in R, every package has its strengths and limitations. We will cover only the rvest package since it is the most used.

To get started with web scraping in R you will first need R and RStudio installed (if needed, see here). Once you have R and RStudio installed, you need to install the rvest package:

install.packages("rvest")

rvest

Inspired by beautiful soup and RoboBrowser (two Python libraries for web scraping), rvest has a similar syntax, which makes it the most eligible package for those who come from Python.

rvest provides functions to access a web page and specific elements using CSS selectors and XPath. The library is a part of the Tidyverse collection of packages, i.e. it shares some coding conventions (e.g. the pipes) with other libraries as tybble and ggplot2.

Before the real scraping it is necessary to load the rvest package:

library(rvest)

Now that everything is settled down, we can start the web scraping operation, which is usually made in 3 steps:

HTTP GET request
Parsing HTML content
Getting HTML element attributes

These steps are detailed in the following sections.

HTTP GET request

The HTTP GET method is a method used to send a server a question to get certain data and information. It is important to notice that this method does not change the state of the server.

To send a GET request we need the link (as a character) to the page we want to scrape:

link <- "https://www.nytimes.com/"

Sending the request to the page is simple, rvest provides the read_html function, which returns an object of html_document type:

NYT_page <- read_html(link)

NYT_page

## {html_document}
## <html lang="en" class=" nytapp-vi-homepage" xmlns:og="http://opengraphprotocol.org/schema/">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n    <div id="app">\n<a class="css-kgn7zc" href="#site-content">Sk ...

Parsing HTML content

As we saw in the last chunk of code, NYT_page contains the raw HTML code, which is not so easily readable.

In order to make it readable from R it has to be parsed, which means generating a Document Object Model (DOM) from the raw HTML. DOM is what connects scripts and web pages by representing the structure of a document in memory. If you retrieve the HTTP request using Node.js, you can give the raw HTML response to R for parsing and further analysis.

rvest provides 2 ways to select HTML elements:

XPath
CSS selectors

Selecting elements with rvest is simple, for XPath we use the following syntax:

NYT_page %>%
  html_elements(xpath = "")

while for CSS elector we need:

NYT_page %>%
  html_elements(css = "")

CSS selector

Suppose that for a project you need the summaries of the articles of the NYT (note that what is in the following picture is not what you see in the New York Times web page).

Searching in the HTML code, it is not that complex to find <p class="summary-class">, which is the markup of what we are looking for. To parse the HTML using this selector we use the html_element function:

summaries_css <- NYT_page %>%
  html_elements(css = ".summary-class")

head(summaries_css)

## {xml_nodeset (6)}
## [1] <p class="summary-class css-fkntuz">Even though Donald Trump defeated Nik ...
## [2] <p class="summary-class css-8h5y1w">Discussions between the British forei ...
## [3] <p class="summary-class css-8h5y1w">Hamas has pressed Israel for a commit ...
## [4] <p class="summary-class css-8h5y1w">Beijing is set to further increase it ...
## [5] <p class="summary-class css-8h5y1w">Temu, Shein, and streaming and gaming ...
## [6] <p class="summary-class css-fkntuz">The publication of “Until August” add ...

The easiest way to obtain a CSS selector is opening the inspect mode, find the element you desire and right click on it. Then click on copy and copy selector.

XPath

Parsing with XPath is similar to parsing using selectors. In fact, we just need to repeat what we did above using XPath of the element of interest. Moreover, obtaining an element’s XPath is not different form selector: inspector mode -> right click on element of interest -> copy -> copy XPath.

Repeating what we did above with XPath:

summaries_xpath <- NYT_page %>%
  html_elements(xpath = "//*[contains(@class, 'summary-class')]")

head(summaries_xpath)

## {xml_nodeset (6)}
## [1] <p class="summary-class css-fkntuz">Even though Donald Trump defeated Nik ...
## [2] <p class="summary-class css-8h5y1w">Discussions between the British forei ...
## [3] <p class="summary-class css-8h5y1w">Hamas has pressed Israel for a commit ...
## [4] <p class="summary-class css-8h5y1w">Beijing is set to further increase it ...
## [5] <p class="summary-class css-8h5y1w">Temu, Shein, and streaming and gaming ...
## [6] <p class="summary-class css-fkntuz">The publication of “Until August” add ...

Obviously the data we collected with CSS selector and XPath are exactly the same.

Getting attributes

Since the chunk of code above collect all the elements p with the class summary, we render all the elements of NYT_summary_css as a text using the html_text function:

NYT_summaries_css <- html_text(summaries_css)
NYT_summaries_xpath <- html_text(summaries_xpath)

We only print some of them:

head(NYT_summaries_css)

## [1] "Even though Donald Trump defeated Nikki Haley, the primary results suggested he still has long-term problems with suburban voters, moderates and independents."
## [2] "Discussions between the British foreign secretary, David Cameron, and a popular Israeli minister, Benny Gantz, stressed the frustration of Israel’s allies."   
## [3] "Hamas has pressed Israel for a commitment to a permanent cease-fire after a multistage release of all hostages, but Israel has refused, officials said."       
## [4] "Beijing is set to further increase its manufacturing and installation of solar panels as it seeks to master global markets and wean itself from imports."      
## [5] "Temu, Shein, and streaming and gaming apps looking to break into the U.S. market are spending huge sums to get their wares in front of American consumers."    
## [6] "The publication of “Until August” adds a twist to Gabriel García Márquez’s legacy, and may stir questions about posthumous releases."

A real application of web scraping in R

To conclude this brief introduction to web scraping we want to use the rvest package in a real word application of web scraping. The goal is to scrape data from Formula 1 Wikipedia’s voice and create a CSV file containing the name, the nationality, the number of podiums and some other statistics for every pilot.

The table we are going to scrape is the following:

If you haven’t done so, you need to install the rvest package:

install.packages("rvest")

and then load it:

library(rvest)

HTTP GET request

The GET request is the easiest part of scraping, we just need the following line of code:

link <- "https://en.wikipedia.org/wiki/List_of_Formula_One_drivers"

Parsing HTML content and getting attributes

Again we repeat what we did before with the NYT example:

page <- read_html(link)

Searching in the HTML code we find that the table is a table element with the sortable attribute:

Therefore we run the following lines of code:

drivers_F1 <- html_element(page, "table.sortable") %>%
  html_table()

In the chunk of code above, the html_table function is used to render the HTML code into tables.

To inspect it, we display the first and last observations, and the structure of the dataset:

head(drivers_F1) # first 6 rows

## # A tibble: 6 × 11
##   `Driver name`     Nationality    `Seasons competed` `Drivers' Championships`
##   <chr>             <chr>          <chr>              <chr>                   
## 1 Carlo Abate       Italy          1962–1963          0                       
## 2 George Abecassis  United Kingdom 1951–1952          0                       
## 3 Kenny Acheson     United Kingdom 1983, 1985         0                       
## 4 Andrea de Adamich Italy          1968, 1970–1973    0                       
## 5 Philippe Adams    Belgium        1994               0                       
## 6 Walt Ader         United States  1950               0                       
## # ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
## #   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
## #   `Fastest laps` <chr>, `Points[a]` <chr>

tail(drivers_F1) # last 6 rows

## # A tibble: 6 × 11
##   `Driver name`  Nationality `Seasons competed`   `Drivers' Championships`
##   <chr>          <chr>       <chr>                <chr>                   
## 1 Emilio Zapico  Spain       1976                 0                       
## 2 Zhou Guanyu*   China       2022–2024            0                       
## 3 Ricardo Zonta  Brazil      1999–2001, 2004–2005 0                       
## 4 Renzo Zorzi    Italy       1975–1977            0                       
## 5 Ricardo Zunino Argentina   1979–1981            0                       
## 6 Driver name    Nationality Seasons competed     Drivers' Championships  
## # ℹ 7 more variables: `Race entries` <chr>, `Race starts` <chr>,
## #   `Pole positions` <chr>, `Race wins` <chr>, Podiums <chr>,
## #   `Fastest laps` <chr>, `Points[a]` <chr>

str(drivers_F1) # structure of the dataset

## tibble [868 × 11] (S3: tbl_df/tbl/data.frame)
##  $ Driver name           : chr [1:868] "Carlo Abate" "George Abecassis" "Kenny Acheson" "Andrea de Adamich" ...
##  $ Nationality           : chr [1:868] "Italy" "United Kingdom" "United Kingdom" "Italy" ...
##  $ Seasons competed      : chr [1:868] "1962–1963" "1951–1952" "1983, 1985" "1968, 1970–1973" ...
##  $ Drivers' Championships: chr [1:868] "0" "0" "0" "0" ...
##  $ Race entries          : chr [1:868] "3" "2" "10" "36" ...
##  $ Race starts           : chr [1:868] "0" "2" "3" "30" ...
##  $ Pole positions        : chr [1:868] "0" "0" "0" "0" ...
##  $ Race wins             : chr [1:868] "0" "0" "0" "0" ...
##  $ Podiums               : chr [1:868] "0" "0" "0" "0" ...
##  $ Fastest laps          : chr [1:868] "0" "0" "0" "0" ...
##  $ Points[a]             : chr [1:868] "0" "0" "0" "6" ...

Now that we have a tibble (a sort of dataframe used in the tidyverse universe), we just need to select the variables of interest and eliminate the last row that contains the name of the variables:

drivers_F1 <- drivers_F1[c(1:4, 7:9)] # select variables

drivers_F1 <- drivers_F1[-nrow(drivers_F1), ] # remove last row

At this point we may want to clean our data. For example, we notice that Drivers' Championships has a small formatting issue: it returns not only the number of championships the driver won, but also the years of the victories. To extract only the number of victories (without the years) we use the substr() function:

drivers_F1$`Drivers' Championships` <- substr(drivers_F1$`Drivers' Championships`,
  start = 1, stop = 1
)

With this code, we actually extract only the first character since we start at 1 and stop at 1. At the moment, the maximum number of championships won by a driver is 7 (Lewis Hamilton & Michael Schumacher), so it is fine to extract only the first digit.

Et voila! With only a few lines of code, we scraped a table and we are now ready to perform our analysis.

If you want to save the dataset, you can always do so:

write.csv(drivers_F1, "F1_drivers.csv", row.names = FALSE)

Analysis on the database

To convince you that this is a real database, we will now answer some simple questions.

First of all, we load the tidyverse package:

library(tidyverse)

Which country has the largest number of wins?

drivers_F1 %>%
  group_by(Nationality) %>%
  summarise(championship_country = sum(as.double(`Drivers' Championships`))) %>%
  arrange(desc(championship_country))

## # A tibble: 48 × 2
##    Nationality    championship_country
##    <chr>                         <dbl>
##  1 United Kingdom                   20
##  2 Germany                          12
##  3 Brazil                            8
##  4 Argentina                         5
##  5 Australia                         4
##  6 Austria                           4
##  7 Finland                           4
##  8 France                            4
##  9 Italy                             3
## 10 Netherlands                       3
## # ℹ 38 more rows

Who has the most Championships?

drivers_F1 %>%
  group_by(`Driver name`) %>%
  summarise(championship_pilot = sum(as.double(`Drivers' Championships`))) %>%
  arrange(desc(championship_pilot))

## # A tibble: 867 × 2
##    `Driver name`       championship_pilot
##    <chr>                            <dbl>
##  1 Lewis Hamilton~                      7
##  2 Michael Schumacher^                  7
##  3 Juan Manuel Fangio^                  5
##  4 Alain Prost^                         4
##  5 Sebastian Vettel^                    4
##  6 Ayrton Senna^                        3
##  7 Jack Brabham^                        3
##  8 Jackie Stewart^                      3
##  9 Max Verstappen~                      3
## 10 Nelson Piquet^                       3
## # ℹ 857 more rows

Sorry Michael, it looks like Lewis dethroned you.

Is there a relation between the number of Championships won and the number of race pole positions?

drivers_F1 %>%
  filter(`Pole positions` > 1) %>%
  ggplot(aes(x = as.double(`Pole positions`), y = as.double(`Drivers' Championships`))) +
  geom_point(position = "jitter") +
  labs(y = "Championships won", x = "Pole positions") +
  theme_minimal()

As expected, there seems to be a positive relationship between the number of pole positions and the number of Championships won. To quantify this relationship, we could build a linear model but this is beyond the scope of the article.

To go further

As you have seen, rvest is a powerful tool. The goal of the article is to show just the tip of the iceberg regarding web scraping in R.

There are many resources online that you can read if you want to know more:

Web scraping: Basics by Paul Bauer
rvest CRAN documentation
xml2 CRAN documentation
httr CRAN documentation
rvest CRAN vignette
httr CRAN vignette
Beautiful Soup documentation
RoboBrowser documentation
HTML documentation
CSS documentation

Conclusion

Thanks for reading.

I hope this article helped you to learn about web scraping in R, and gave you the incentive to use it for your projects. If you are interested in seeing another example, see how to scrape Yahoo search engine results with R.

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.

Liked this post?

Get updates every time a new article is published (no spam and unsubscribe anytime):

Support the blog
Share on:

Stats and R

Web scraping in R

Introduction

HTML and CSS

Web scraping vs. APIs

Why does web scraping exist if APIs are so powerful and do exactly the same work?

Web scraping in R

rvest

HTTP GET request

Parsing HTML content

CSS selector

XPath

Getting attributes

A real application of web scraping in R

HTTP GET request

Parsing HTML content and getting attributes

Analysis on the database

To go further

Conclusion

Related articles

Liked this post?