# define url
<- "https://www.cheese.com"
cheese_url
# fetch html content
<- read_html(cheese_url) # note: in the real world, add error handling
html_doc_cheese
# select all <a> tags on page
<- html_nodes(html_doc_cheese, "a")
all_links_nodes
# use html_attr() to extract the 'href' attribute from each link
<- html_attr(all_links_nodes, "href")
all_hrefs_cheese
head(all_hrefs_cheese, 10)
Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤
Part 1: Locate and examine the robots.txt
file for this website. Summarize what you learn from it.
The robots.txt is located at https://www.cheese.com/robots.txt . This file is pretty blank, with only two lines. These two lines tell us that anyone can crawl (User-agent
is set to *) and where the site map is located (https://www.cheese.com/sitemap.xml)
Part 2: Learn about the html_attr()
function from rvest
. Describe how this function works with a small example.
The html_attr()
function gets and returns a character vector containing the values of the specified attribute for an element of the HTML file (ex. href, class, id). This function requires two arguments; x
, the HTML element (usually from read_html()
), name
, the name of the attribute to retrieve, like href
, class
, etc. There is a third optional argument, default
, which is related to how non-existent attributes are handled, which defaults to NA_character
, meaning NA is returned in these cases.
Part 3: (Do this alongside Part 4 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:
Write R code using the rvest package that allows me to scrape cheese information from cheese.com.
Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.
# load libraries
library(rvest)
library(dplyr)
# define url
<- "https://www.cheese.com/alphabetical"
url
# read html content from the page
<- read_html(url)
webpage
# extract cheese names and urls
<- webpage %>%
cheese_data html_nodes(".cheese-item") %>%
html_nodes("a") %>%
html_attr("href") %>%
paste0("https://cheese.com", .)
<- webpage %>%
cheese_names html_nodes(".cheese-item h3") %>%
html_text()
# create df to store results
<- data.frame(Name = cheese_names,
cheese_df URL = cheese_data,
stringsAsFactors = FALSE)
print(cheese_df)
Not useful:
There were many empty results. Both cheese_data (for URLs) and cheese_names (for cheese names) were empty character vectors. Because of this, the cheese_df dataframe was also empty.
The selectors were not specific enough. The CSS selectors .cheese-item
and .cheese-item h3
suggested by ChatGPT were too generic/ did not accurately reflect the current structure of the cheese.com/alphabetical page. Websites frequently update their structure, and AI might be trained on old versions or make incorrect assumptions about common class names.
Useful:
ChatGPT provided a basic template of rvest functions (read_html, html_nodes, html_attr, html_text) which was conceptually helpful for recalling the workflow.
Part 4: Obtain the following information for all cheeses in the database:
- cheese name
- URL for the cheese’s webpage (e.g., https://www.cheese.com/gouda/)
- whether or not the cheese has a picture (e.g., gouda has a picture, but bianco does not).
To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
# defining url and pages to look at
<- "https://www.cheese.com/alphabetical/?per_page=100"
base_url <- 1:21
page_numbers
# Helper function to extract text or attributes based on tag structure
<- function(page, outer_selector, inner_selector, attr = NULL) {
extract_info <- page %>%
nodes html_elements(outer_selector) %>%
html_elements(inner_selector)
if (!is.null(attr)) { #dealing with null values
html_attr(nodes, attr)
else {
} html_text(nodes)
}
}
# Function to scrape a single page
<- function(page_number) {
scrape_cheese_page <- paste0(base_url, "&page=", page_number)
full_url <- read_html(full_url)
page
data.frame(
Name = extract_info(page, "div.product-item", "h3"), #cheese name
url = paste0("https://www.cheese.com", extract_info(page, "div.product-item", "h3 a", "href")), #cheese url
whether = extract_info(page, "div.product-item", "img", "class"), #if there is image
stringsAsFactors = FALSE
)
}
# Map over all pages and bind results
<- map_dfr(page_numbers, function(pg) {
cheese_data <- scrape_cheese_page(pg)
result Sys.sleep(1) # delay to be nice
result
})
head(cheese_data)
Name
1 2 Year Aged Cumin Gouda
2 3-Cheese Italian Blend
3 30 Month Aged Parmigiano Reggiano
4 3yrs Aged Vintage Gouda
5 Aarewasser
6 Abbaye de Belloc
url whether
1 https://www.cheese.com/2-year-aged-cumin-gouda/ image-exists
2 https://www.cheese.com/3-cheese-italian-blend/ image-missing
3 https://www.cheese.com/30-month-aged-parmigiano-reggiano-150g/ image-exists
4 https://www.cheese.com/3yrs-aged-vintage-gouda/ image-exists
5 https://www.cheese.com/aarewasser/ image-exists
6 https://www.cheese.com/abbaye-de-belloc/ image-exists
Part 5: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information:
- milk information
- country of origin
- family
- type
- flavour
(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)
<- function(page, selector) {
extract_text %>%
page html_elements(selector) %>%
html_text()
}
# Scrape cheese detail from a single page URL
<- function(url) {
scrape_cheese_details Sys.sleep(1) # delay to be nice
<- read_html(url)
page
tibble(
family = extract_text(page, ".summary_family p"),
milk = extract_text(page, ".summary_milk p"),
country_of_origin = extract_text(page, ".summary_country p"),
type = extract_text(page, ".summary_moisture_and_type p"),
flavour = extract_text(page, ".summary_taste p")
)
}
<- c(
cheeses "Gouda", "Colby", "Applewood",
"Vacherin", "Pecorino Romano",
"Cornish Blue", "Camembert",
"Stella Feta", "Dubliner", "Paneer"
)# Select cheese URLs of interest
<- cheese_data %>%
cheese_urls filter(Name %in% cheeses) %>%
pull(url)
# Map and combine all details into a single tibble
<- map_dfr(cheese_urls, scrape_cheese_details)
df_cheeses
# cleaning df for readibility
<- df_cheeses %>%
df_cheeses mutate( # removing unnecessary labels in vars
family = str_remove(family, 'Family: '),
milk = str_remove(milk, 'Made from '),
country_of_origin = str_remove(country_of_origin, 'Country of origin: '),
type = str_remove(type, 'Type: '),
flavour = str_remove(flavour, "Flavour: ")
)
<- cheese_data %>%
names filter(Name %in% cheeses) %>%
select(Name)
<- cbind(names, df_cheeses)
df_cheeses
df_cheeses
Name family
1 Applewood Cheddar
2 Camembert Camembert
3 Colby Cheddar
4 Cornish Blue Blue
5 Dubliner Cheddar
6 Gouda Gouda
7 Paneer Cottage
8 Pecorino Romano Pecorino
9 Stella Feta Feta
10 Vacherin Brie
milk
1 pasteurized cow's milk
2 cow's milk
3 cow's milk
4 pasteurized cow's milk
5 pasteurized cow's milk
6 pasteurized or unpasteurized cow's, goat's or sheep's milk
7 pasteurized cow's or water buffalo's milk
8 sheep's milk
9 pasteurized cow's milk
10 pasteurized cow's milk
country_of_origin type flavour
1 England semi-hard smokey
2 France soft, artisan sweet
3 United States semi-hard sweet
4 England semi-soft, artisan creamy, sweet
5 Ireland hard nutty, sweet
6 Netherlands hard full-flavored
7 Bangladesh and India fresh firm milky
8 Italy hard salty, sharp
9 United States firm, artisan tangy
10 France and Switzerland soft, artisan smooth
Part 6: Evaluate the code that you wrote in terms of efficiency. To what extent do your function(s) adhere to the principles for writing good functions? To what extent are your functions efficient? To what extent is your iteration of these functions efficient?
The functions we wrote follow the principles of good function design by being modular, clear, and reusable. Each function performs a single responsibility. For example, extract_info()
and scrape_cheese_details()
are each focused on one task. Using purrr::map_dfr()
improves efficiency over for loops by combining iteration and row-binding in a memory-friendly way, avoiding repeated rbind()
calls that can slow down execution. While Sys.sleep(1)
adds intentional delay, it’s necessary for responsible scraping and to be nice to the website.