Web Scraping Using R and rvest

Step-by-step guide to scraping IMDB for TV Series ratings and votes using R and `rvest` package

Karat Sidhu
6 min readJun 18, 2022
Photo by Chris Liverani on Unsplash

Introduction

There are a few tutorials available for scraping IMDB for TV Series ratings and votes using Python but I wanted to write my own tutorial to learn how to scrape IMDB for TV Series ratings and votes using R and rvest package because of the lack of such tutorials in R. Even when such guides are available, they are not very clear and usually deal with scraping data from IMDB data related to top 100/top 1000 lists and not specific seasons of a particular TV Series.

The series I am going to scrape is South Park. This is a TV Series that is popular in the US and is one of the most popular TV Shows in the world. Moreover, this show has over 25 seasons and a lot of episodes, so it helps with the learning of scraping data when dealing with a relatively long dataset compared to a top 100 list.

The basic steps to scrape data are:

  • Step 1: Find the URL of the page that contains the data you want to scrape
  • Step 2: Parse the data
  • Step 3: Repeat the steps above for each page you want to scrape
  • Step 4: Transform the data
  • Step 5: Check the data, clean it
  • Step 6: Save the data

Prerequisites

We need to have a working R installation on your computer. In addition, you will need to have the rvest package installed. You can install it by running the following command in your R console:

install.packages("rvest")

rvest package helps us to scrape data from the web. RVest is part of the tidyverse. To find out more about it visit rvest documentation.

Additionally, it is recommended you install the CSS Selector Gadget for your web browser. It is a Chrome extension that helps you to select elements on the web page and makes it much simpler to find the elements you want to scrape. It is not a necessary dependency for this tutorial, but it is recommended, alternatively using the “Inspect Element” option in your browser would work as well.

Lubridate package is useful when working with dates. Since we will be extract the airdate of each episode, Lubridate is helpful in converting it to a usable format.

I will be using the following packages for this tutorial:

  • stringr
  • readr
  • magrittr

To make it simpler, install the entire tidyverse set of packages, because they are useful for analysis and visualization of the data scraped.

Scraping

We will be extracting the text elements from the IMDB pages that contain the data we want to scrape. Once extracted, those elements are saved as lists that become the columns of a data frame. Assembling the data frame is the last step of the scraping process.

Find & navigate to the URL

To find the exact url of the page that we want to extract the data from, we first look at the South Park Homepage on IMDB. This page contains the rating and vote count of the South Park TV Series, but it does not contain the ratings and votes of the seasons of the TV Series for each individual episode/season.

On the top of the page, we see the list of episodes of South Park, and clicking that link takes me to the latest season of South Park, which is 25 at the time of writing this post.

For simplicity, we will start with the first season, which is under the link Season 1.

Parse the URL

Using the read_html function, we can parse the HTML code of the page

link <- "https://www.imdb.com/title/tt0121955/episodes?season=1"
page <- read_html(link)

This page should now appear in the R console environment. We can see that the page contains a list of HTML elements. It can take a minute to get the page, depending on your internet connection/processor speed.

Extract each element

Using the CSS Selector Gadget, we can find the elements we want to scrape. The following code snippet will find the elements that contain the TV Series name, the TV Series rating, the TV Series votes and the TV Series year and then it will extract the text from each of them.

Episode Name

episode_name <- page |>
html_nodes("#episodes_content strong a") |>
html_text()

Episode Number

episode_season <- page |>
html_nodes(".zero-z-index div") |>
html_text()

Episode Rating

rating <- page |>
html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") |>
html_text()

Episode Votes

total_votes <- page |>
html_nodes(".ipl-rating-star__total-votes") |>
html_text()

Air Date

air_date <- page |>
html_nodes(".airdate") |>
html_text()

Episode Description

description <- page |>
html_nodes(".item_description") |>
html_text()

Repeat for each season

The codes above will scrape the data for one season. To scrape the data for all seasons, we need to repeat the above steps for each season. To do that, we need to use the for loop. Additionally, we need to change the url variable to point to the correct URL for each season.

Final Code

The final code snippet will look like the following:

library(rvest)
library(tidyverse)
south_park <- data.frame()for (seasons in seq(from = 1, to = 25, by = 1)) {
link <- paste0("https://www.imdb.com/title/tt0121955/episodes?season=", seasons)
page <- read_html(link) episode_name <- page |>
html_nodes("#episodes_content strong a") |>
html_text()
rating <- page |>
html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") |>
html_text()
total_votes <- page |>
html_nodes(".ipl-rating-star__total-votes") |>
html_text()
air_date <- page |>
html_nodes(".airdate") |>
html_text()
description <- page |>
html_nodes(".item_description") |>
html_text()
episode_season <- page |>
html_nodes(".zero-z-index div") |>
html_text()
south_park <- rbind(south_park, data.frame(episode_name, episode_season,
rating,
total_votes,
air_date,
description,
stringsAsFactors = FALSE
))
}

Make the Dataframe

The final code snippet will create a data frame with the following columns:

  • episode_name
  • episode_season
  • rating
  • total_votes
  • air_date
  • description

and save it under the variable south_park.

Let use look at the data frame.

south_park  |> head(10)  |> gt::gt()  |> gtExtras::gt_theme_538()
Image by Author

The scraped data is now in the south_park data frame, and can be saved from here. However, to further analyze the data, it needs to be cleaned up.

Data Cleaning

The data frame we created above contains the data we want to analyze. However, the data is not in the most usable format. We need to clean the data to make it more useful.

Total Votes to Integer

Remove the () from the total votes string, and convert it into an integer type.

south_park$total_votes <- south_park$total_votes |>
str_replace_all("\\(|\\)|\\,", "") |>
as.integer()

Rating to a double

south_park$rating <- as.double(south_park$rating)

Air_date to a Date

south_park$air_date <-
str_replace_all(south_park$air_date, "\\n", "") |>
trimws() |>
lubridate::dmy()

Description column cleaning

The description column contains a new line, so need to remove it.

south_park$description <-
str_replace_all(south_park$description, "\\n", "")

Save the data

Finally save the data as a new dataframe:

clean_sp <- south_park |>
mutate(
season = str_extract(episode_season, "(?<=S)[:digit:]+"),
episode = str_extract(episode_season, "[:digit:]+$")
) |>
relocate(season, .before = rating) |>
relocate(episode, .before = rating) |>
select(-episode_season) |>
mutate(
id = row_number()
)

Conclusion

For static webpages, Rvest package is a great tool to scrape data from a webpage. It is easy to use, and it is very flexible. Further, the data is in a very usable format, and does not need to be cleaned up a great deal. Each function in the package is well documented, and the code is easy to read, like most tidyverse packages.

This post was originally posted on my blog:

Karat Sidhu — Today I learned

--

--

No responses yet