Web Scraping Using R and rvest
Step-by-step guide to scraping IMDB for TV Series ratings and votes using R and `rvest` package
Introduction
There are a few tutorials available for scraping IMDB for TV Series ratings and votes using Python but I wanted to write my own tutorial to learn how to scrape IMDB for TV Series ratings and votes using R and rvest
package because of the lack of such tutorials in R. Even when such guides are available, they are not very clear and usually deal with scraping data from IMDB data related to top 100/top 1000 lists and not specific seasons of a particular TV Series.
The series I am going to scrape is South Park. This is a TV Series that is popular in the US and is one of the most popular TV Shows in the world. Moreover, this show has over 25 seasons and a lot of episodes, so it helps with the learning of scraping data when dealing with a relatively long dataset compared to a top 100 list.
The basic steps to scrape data are:
- Step 1: Find the URL of the page that contains the data you want to scrape
- Step 2: Parse the data
- Step 3: Repeat the steps above for each page you want to scrape
- Step 4: Transform the data
- Step 5: Check the data, clean it
- Step 6: Save the data
Prerequisites
We need to have a working R installation on your computer. In addition, you will need to have the rvest
package installed. You can install it by running the following command in your R console:
install.packages("rvest")
rvest
package helps us to scrape data from the web. RVest is part of the tidyverse. To find out more about it visit rvest documentation.
Additionally, it is recommended you install the CSS Selector Gadget for your web browser. It is a Chrome extension that helps you to select elements on the web page and makes it much simpler to find the elements you want to scrape. It is not a necessary dependency for this tutorial, but it is recommended, alternatively using the “Inspect Element” option in your browser would work as well.
Lubridate
package is useful when working with dates. Since we will be extract the airdate of each episode, Lubridate
is helpful in converting it to a usable format.
I will be using the following packages for this tutorial:
stringr
readr
magrittr
To make it simpler, install the entire tidyverse
set of packages, because they are useful for analysis and visualization of the data scraped.
Scraping
We will be extracting the text elements from the IMDB pages that contain the data we want to scrape. Once extracted, those elements are saved as lists that become the columns of a data frame. Assembling the data frame is the last step of the scraping process.
Find & navigate to the URL
To find the exact url of the page that we want to extract the data from, we first look at the South Park Homepage on IMDB. This page contains the rating and vote count of the South Park TV Series, but it does not contain the ratings and votes of the seasons of the TV Series for each individual episode/season.
On the top of the page, we see the list of episodes of South Park, and clicking that link takes me to the latest season of South Park, which is 25 at the time of writing this post.
For simplicity, we will start with the first season, which is under the link Season 1.
Parse the URL
Using the read_html
function, we can parse the HTML code of the page
link <- "https://www.imdb.com/title/tt0121955/episodes?season=1"
page <- read_html(link)
This page should now appear in the R console environment. We can see that the page contains a list of HTML elements. It can take a minute to get the page, depending on your internet connection/processor speed.
Extract each element
Using the CSS Selector Gadget, we can find the elements we want to scrape. The following code snippet will find the elements that contain the TV Series name, the TV Series rating, the TV Series votes and the TV Series year and then it will extract the text from each of them.
Episode Name
episode_name <- page |>
html_nodes("#episodes_content strong a") |>
html_text()
Episode Number
episode_season <- page |>
html_nodes(".zero-z-index div") |>
html_text()
Episode Rating
rating <- page |>
html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") |>
html_text()
Episode Votes
total_votes <- page |>
html_nodes(".ipl-rating-star__total-votes") |>
html_text()
Air Date
air_date <- page |>
html_nodes(".airdate") |>
html_text()
Episode Description
description <- page |>
html_nodes(".item_description") |>
html_text()
Repeat for each season
The codes above will scrape the data for one season. To scrape the data for all seasons, we need to repeat the above steps for each season. To do that, we need to use the for
loop. Additionally, we need to change the url
variable to point to the correct URL for each season.
Final Code
The final code snippet will look like the following:
library(rvest)
library(tidyverse)south_park <- data.frame()for (seasons in seq(from = 1, to = 25, by = 1)) {
link <- paste0("https://www.imdb.com/title/tt0121955/episodes?season=", seasons) page <- read_html(link) episode_name <- page |>
html_nodes("#episodes_content strong a") |>
html_text() rating <- page |>
html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") |>
html_text() total_votes <- page |>
html_nodes(".ipl-rating-star__total-votes") |>
html_text() air_date <- page |>
html_nodes(".airdate") |>
html_text() description <- page |>
html_nodes(".item_description") |>
html_text() episode_season <- page |>
html_nodes(".zero-z-index div") |>
html_text() south_park <- rbind(south_park, data.frame(episode_name, episode_season,
rating,
total_votes,
air_date,
description,
stringsAsFactors = FALSE
))
}
Make the Dataframe
The final code snippet will create a data frame with the following columns:
episode_name
episode_season
rating
total_votes
air_date
description
and save it under the variable south_park
.
Let use look at the data frame.
south_park |> head(10) |> gt::gt() |> gtExtras::gt_theme_538()
The scraped data is now in the south_park
data frame, and can be saved from here. However, to further analyze the data, it needs to be cleaned up.
Data Cleaning
The data frame we created above contains the data we want to analyze. However, the data is not in the most usable format. We need to clean the data to make it more useful.
Total Votes to Integer
Remove the () from the total votes string, and convert it into an integer type.
south_park$total_votes <- south_park$total_votes |>
str_replace_all("\\(|\\)|\\,", "") |>
as.integer()
Rating to a double
south_park$rating <- as.double(south_park$rating)
Air_date to a Date
south_park$air_date <-
str_replace_all(south_park$air_date, "\\n", "") |>
trimws() |>
lubridate::dmy()
Description column cleaning
The description column contains a new line, so need to remove it.
south_park$description <-
str_replace_all(south_park$description, "\\n", "")
Save the data
Finally save the data as a new dataframe:
clean_sp <- south_park |>
mutate(
season = str_extract(episode_season, "(?<=S)[:digit:]+"),
episode = str_extract(episode_season, "[:digit:]+$")
) |>
relocate(season, .before = rating) |>
relocate(episode, .before = rating) |>
select(-episode_season) |>
mutate(
id = row_number()
)
Conclusion
For static webpages, Rvest package is a great tool to scrape data from a webpage. It is easy to use, and it is very flexible. Further, the data is in a very usable format, and does not need to be cleaned up a great deal. Each function in the package is well documented, and the code is easy to read, like most tidyverse packages.
This post was originally posted on my blog: