Web Scraping – Lily Cook

Goal: Scrape information from https://www.cheese.com to obtain a dataset of characteristics about different cheeses, and gain deeper insight into your coding process. 🪤

Part 1: Locate and examine the robots.txt file for this website. Summarize what you learn from it.

The robots.txt is located at https://www.cheese.com/robots.txt . This file is pretty blank, with only two lines. These two lines tell us that anyone can crawl (User-agent is set to *) and where the site map is located (https://www.cheese.com/sitemap.xml)

Part 2: Learn about the html_attr() function from rvest. Describe how this function works with a small example.

The html_attr() function gets and returns a character vector containing the values of the specified attribute for an element of the HTML file (ex. href, class, id). This function requires two arguments; x, the HTML element (usually from read_html()), name, the name of the attribute to retrieve, like href, class, etc. There is a third optional argument, default, which is related to how non-existent attributes are handled, which defaults to NA_character, meaning NA is returned in these cases.

# define url
cheese_url <- "https://www.cheese.com"

# fetch html content 
html_doc_cheese <- read_html(cheese_url) # note: in the real world, add error handling

# select all <a> tags on page
all_links_nodes <- html_nodes(html_doc_cheese, "a")

# use html_attr() to extract the 'href' attribute from each link
all_hrefs_cheese <- html_attr(all_links_nodes, "href")

head(all_hrefs_cheese, 10)

Part 3: (Do this alongside Part 4 below.) I used ChatGPT to start the process of scraping cheese information with the following prompt:

Write R code using the rvest package that allows me to scrape cheese information from cheese.com.

Fully document your process of checking this code. Record any observations you make about where ChatGPT is useful / not useful.

# load libraries
library(rvest)
library(dplyr)

# define url
url <- "https://www.cheese.com/alphabetical"

# read html content from the page
webpage <- read_html(url)

# extract cheese names and urls
cheese_data <- webpage %>%
  html_nodes(".cheese-item") %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  paste0("https://cheese.com", .)

cheese_names <- webpage %>%
  html_nodes(".cheese-item h3") %>%
  html_text()

# create df to store results
cheese_df <- data.frame(Name = cheese_names,
                        URL = cheese_data,
                        stringsAsFactors = FALSE)

print(cheese_df)

Not useful:

There were many empty results. Both cheese_data (for URLs) and cheese_names (for cheese names) were empty character vectors. Because of this, the cheese_df dataframe was also empty.

The selectors were not specific enough. The CSS selectors .cheese-item and .cheese-item h3 suggested by ChatGPT were too generic/ did not accurately reflect the current structure of the cheese.com/alphabetical page. Websites frequently update their structure, and AI might be trained on old versions or make incorrect assumptions about common class names.

Useful:
ChatGPT provided a basic template of rvest functions (read_html, html_nodes, html_attr, html_text) which was conceptually helpful for recalling the workflow.

Part 4: Obtain the following information for all cheeses in the database:

cheese name
URL for the cheese’s webpage (e.g., https://www.cheese.com/gouda/)
whether or not the cheese has a picture (e.g., gouda has a picture, but bianco does not).

To be kind to the website owners, please add a 1 second pause between page queries. (Note that you can view 100 cheeses at a time.)

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

# defining url and pages to look at
base_url <- "https://www.cheese.com/alphabetical/?per_page=100"
page_numbers <- 1:21

# Helper function to extract text or attributes based on tag structure
extract_info <- function(page, outer_selector, inner_selector, attr = NULL) {
  nodes <- page %>%
    html_elements(outer_selector) %>%
    html_elements(inner_selector)
  
  if (!is.null(attr)) { #dealing with null values
    html_attr(nodes, attr)
  } else {
    html_text(nodes)
  }
}

# Function to scrape a single page
scrape_cheese_page <- function(page_number) {
  full_url <- paste0(base_url, "&page=", page_number)
  page <- read_html(full_url)

  data.frame(
    Name = extract_info(page, "div.product-item", "h3"), #cheese name
    url = paste0("https://www.cheese.com", extract_info(page, "div.product-item", "h3 a", "href")), #cheese url
    whether = extract_info(page, "div.product-item", "img", "class"), #if there is image
    stringsAsFactors = FALSE
  )
}

# Map over all pages and bind results
cheese_data <- map_dfr(page_numbers, function(pg) {
  result <- scrape_cheese_page(pg)
  Sys.sleep(1)  # delay to be nice
  result
})

head(cheese_data)

                               Name
1           2 Year Aged Cumin Gouda
2            3-Cheese Italian Blend
3 30 Month Aged Parmigiano Reggiano
4           3yrs Aged Vintage Gouda
5                        Aarewasser
6                  Abbaye de Belloc
                                                             url       whether
1                https://www.cheese.com/2-year-aged-cumin-gouda/  image-exists
2                 https://www.cheese.com/3-cheese-italian-blend/ image-missing
3 https://www.cheese.com/30-month-aged-parmigiano-reggiano-150g/  image-exists
4                https://www.cheese.com/3yrs-aged-vintage-gouda/  image-exists
5                             https://www.cheese.com/aarewasser/  image-exists
6                       https://www.cheese.com/abbaye-de-belloc/  image-exists

Part 5: When you go to a particular cheese’s page (like gouda), you’ll see more detailed information about the cheese. For just 10 of the cheeses in the database, obtain the following detailed information:

milk information
country of origin
family
type
flavour

(Just 10 to avoid overtaxing the website! Continue adding a 1 second pause between page queries.)

extract_text <- function(page, selector) {
  page %>%
    html_elements(selector) %>%
    html_text()
}

# Scrape cheese detail from a single page URL
scrape_cheese_details <- function(url) {
  Sys.sleep(1)  # delay to be nice
  
  page <- read_html(url)
  
  tibble(
    family = extract_text(page, ".summary_family p"),
    milk = extract_text(page, ".summary_milk p"),
    country_of_origin = extract_text(page, ".summary_country p"),
    type = extract_text(page, ".summary_moisture_and_type p"),
    flavour = extract_text(page, ".summary_taste p")
  )
}

cheeses <- c(
    "Gouda", "Colby", "Applewood",
    "Vacherin", "Pecorino Romano",
    "Cornish Blue", "Camembert", 
    "Stella Feta", "Dubliner", "Paneer"
  )
# Select cheese URLs of interest
cheese_urls <- cheese_data %>%
  filter(Name %in% cheeses) %>%
  pull(url)

# Map and combine all details into a single tibble
df_cheeses <- map_dfr(cheese_urls, scrape_cheese_details)

# cleaning df for readibility
df_cheeses <- df_cheeses %>%
  mutate( # removing unnecessary labels in vars
    family = str_remove(family, 'Family: '),
    milk = str_remove(milk, 'Made from '),
    country_of_origin = str_remove(country_of_origin, 'Country of origin: '),
    type = str_remove(type, 'Type: '),
    flavour = str_remove(flavour, "Flavour: ")
  )

names <- cheese_data %>%
  filter(Name %in% cheeses) %>%
  select(Name)

df_cheeses <- cbind(names, df_cheeses)

df_cheeses

              Name    family
1        Applewood   Cheddar
2        Camembert Camembert
3            Colby   Cheddar
4     Cornish Blue      Blue
5         Dubliner   Cheddar
6            Gouda     Gouda
7           Paneer   Cottage
8  Pecorino Romano  Pecorino
9      Stella Feta      Feta
10        Vacherin      Brie
                                                         milk
1                                      pasteurized cow's milk
2                                                  cow's milk
3                                                  cow's milk
4                                      pasteurized cow's milk
5                                      pasteurized cow's milk
6  pasteurized or unpasteurized cow's, goat's or sheep's milk
7                   pasteurized cow's or water buffalo's milk
8                                                sheep's milk
9                                      pasteurized cow's milk
10                                     pasteurized cow's milk
        country_of_origin               type       flavour
1                 England          semi-hard       smokey 
2                  France      soft, artisan         sweet
3           United States          semi-hard         sweet
4                 England semi-soft, artisan creamy, sweet
5                 Ireland               hard  nutty, sweet
6             Netherlands               hard full-flavored
7    Bangladesh and India         fresh firm         milky
8                   Italy               hard  salty, sharp
9           United States      firm, artisan         tangy
10 France and Switzerland      soft, artisan        smooth

Part 6: Evaluate the code that you wrote in terms of efficiency. To what extent do your function(s) adhere to the principles for writing good functions? To what extent are your functions efficient? To what extent is your iteration of these functions efficient?

The functions we wrote follow the principles of good function design by being modular, clear, and reusable. Each function performs a single responsibility. For example, extract_info() and scrape_cheese_details() are each focused on one task. Using purrr::map_dfr() improves efficiency over for loops by combining iteration and row-binding in a memory-friendly way, avoiding repeated rbind() calls that can slow down execution. While Sys.sleep(1) adds intentional delay, it’s necessary for responsible scraping and to be nice to the website.