This browser does not support JavaScript

Web Scraping in R: A Comprehensive Guide with OkeyProxy

教程
OkeyProxy

Web scraping transforms unstructured web pages into clean datasets for analysis, reporting, or dashboards—and R’s rich package ecosystem makes it easy. In this guide, you’ll start from zero to advanced techniques: environment setup, HTML/CSS fundamentals, basic rvest usage, HTTP control, dynamic content handling, proxy integration, parallel crawling, error resilience, and ethics.

Web Scraping in R A Comprehensive Guide

Understanding HTML & CSS Basics

Before you can scrape data, you need to know how web pages are structured and how to pinpoint the elements you care about.

Why HTML & CSS Matter

Web pages are built from HTML—a hierarchy of elements (tags)—and styled or organized using CSS. To extract specific text, images, or tables, you tell your scraping tools exactly which elements to target via selectors.

Term Check

Element: An HTML tag, e.g. <p>, <h1>, <table>.

Attribute: Additional information inside a tag, e.g. class="price", id="main".

Selector: A pattern (CSS or XPath) that matches one or more elements on the page.

Example HTML

html

 

<div class="product">

  <h2 id="title">Widget A</h2>

  <p class="price">$10.00</p>

</div>

  • The <div> groups a product.
  • The <h2> has an id of title.
  • The <p> has a class of price.

Common CSS Selectors

Class selector

css

 

.price       /* matches <p class="price"> */

ID selector

css

 

#title       /* matches <h2 id="title"> */

Nested selector

css

 

div.product > p.price  

/* matches any <p class="price"> directly inside <div class="product"> */

Simple XPath Example

xpath

 

//div[@class="product"]/p[@class="price"]

This finds any <p> with class="price" inside a <div> of class product.

Pro Tip

Install the SelectorGadget browser extension (or try CSS Diner) to click on elements and generate CSS selectors visually—no manual trial and error needed.

Step 1: Environment & Prerequisites

1. Install R & RStudio (or VS Code)

2. Create a new project

3. Install packages

r

 

install.packages(c(

  "rvest", "xml2", "httr2",

  "RSelenium", "chromote",

  "polite", "parallel", "Rcrawler",

  "tidyverse", "purrr", "stringr", "lubridate"

))

4. Load libraries once per script

r

 

library(rvest); library(xml2); library(httr2)

library(RSelenium); library(chromote)

library(polite); library(parallel); library(Rcrawler)

library(tidyverse); library(purrr); library(stringr); library(lubridate)

Core Tools & Packages Overview

You need the right packages loaded before writing any scraping code.

Task Package(s) Purpose
Static HTML scraping rvest, xml2 Fetch and parse page content
JavaScript-rendered pages RSelenium, chromote Drive headless browsers for dynamic content
HTTP control httr2 Custom headers, cookies, rate limiting
Parallel crawling parallel, Rcrawler Multi-core scraping, depth/pagination control
Data hygiene & export tidyverse, jsonlite Clean, transform, save to CSV or JSON
Polite scraping polite, req_throttle Respect robots.txt and throttle requests

Step 2: Basic Static Scraping with rvest

rvest makes fetching and parsing HTML trivial.

1. Fetch & Parse

r

 

url  <- "https://example.com/products"

page <- read_html(url)

2. Extract Elements

r

 

titles <- page %>% html_elements(".product-name") %>% html_text2()

prices <- page %>% html_elements(".price")        %>% html_text2()

df     <- tibble(title = titles, price = prices)

head(df)

Beginner: html_elements() returns a list; html_text2() cleans whitespace.

Professional: Chain pipes for readability; inspect intermediate objects with print().

Step 3: Pagination, Tables & Early Error Handling

Most sites spread data across pages and tables, and sometimes requests fail.

1. Scrape a Table

r

 

table <- page %>% html_node("table.inventory") %>% html_table(header = TRUE)

2. Loop Through Pages

r

 

pages <- sprintf("https://example.com/page/%d", 1:5)

safe_read <- possibly(read_html, otherwise = NA)

 

all_tables <- map_dfr(pages, ~ {

  pg <- safe_read(.x)

  pg %>% html_node("table.inventory") %>% possibly(html_table, otherwise = tibble())()

})

Beginner: map_dfr() binds rows and handles lists of tibbles.

Professional: Wrapping in possibly() ensures your loop continues even if one page errors.

Step 4: HTTP Control with httr2

Customize headers and throttle to avoid blocks.

r

 

response <- request("https://example.com/data") %>%

  req_headers(`User-Agent` = "R Scraper v1.0") %>%

  req_throttle(5 / 60) %>%    # 5 requests per minute

  req_perform()

page <- resp_body_html(response)

Beginner: Change User-Agent to mimic a browser and bypass simple bot filters.

Professional: Randomize delays:

r

 

Sys.sleep(runif(1, 1, 3))

Step 5: Dynamic Pages via RSelenium & chromote

Some sites render content client-side with JavaScript.

1. RSelenium (Headless Chrome)

r

 

rD    <- rsDriver(browser = "chrome", chromever = "latest",

                  extraCapabilities = list(chromeOptions=list(args="--headless")))

remDr <- rD$client

remDr$navigate("https://example.com/js-content")

Sys.sleep(4)  # wait for JS to load

html  <- remDr$getPageSource()[[1]] %>% read_html()

titles <- html %>% html_elements(".js-title") %>% html_text2()

remDr$close(); rD$server$stop()

2. chromote (Lightweight JS)

r

 

session <- ChromoteSession$new()

session$Page$navigate("https://example.com/js-content")

session$Page$loadEventFired()

html <- session$Runtime$evaluate("document.documentElement.outerHTML")$result$value

doc  <- read_html(html)

items <- doc %>% html_elements(".item") %>% html_text2()

Beginner: RSelenium automates what you’d do manually in a browser.

Professional: Pass proxy settings via Chrome arguments if needed:

r

 

ChromoteSession$new(

  browser_args = c("--proxy-server=http://user:[email protected]:8000")

).

Step 6: Integrating OkeyProxy & Rate-Limiting

Global proxy network

Proxies keep your IP fresh, avoid geo-blocks, and distribute load. Sign up here and select a rotating residential proxy plan or a rotating datacenter proxy plan as needed.

1. Single-Request Proxy

r

 

proxy <- "http://user:[email protected]:8000"

resp  <- request("https://example.com") %>%

         req_proxy(proxy) %>%

         req_throttle(2 / 1) %>%  # 2 requests/sec

         req_perform()

page  <- resp_body_html(resp)

2. Automated IP Rotation

r

 

proxies <- c(

  "http://u:[email protected]:8000",

  "http://u:[email protected]:8000",

  "http://u:[email protected]:8000"

)

scrape_page <- function(url, proxy) {

  request(url) %>% req_proxy(proxy) %>% req_perform() %>%

    resp_body_html() %>% read_html() %>%

    html_elements(".data") %>% html_text2() %>%

    tibble(data = .)

}

results <- map2_dfr(pages, rep(proxies, length.out = length(pages)), scrape_page)

Professional: Rotate proxies round-robin to spread traffic and stay under per-IP limits.

Step 7: Scaling with Parallel & Rcrawler

Parallelization and automated crawling speed up large jobs.

1. Parallel Scraping

r

 

cl <- makeCluster(detectCores() - 1)

clusterEvalQ(cl, library(rvest))

out <- parLapply(cl, pages, function(u) {

  read_html(u) %>% html_elements(".product") %>% html_text2()

})

stopCluster(cl)

2. Automated Crawling with Rcrawler

r

 

Rcrawler(

  Website   = "https://example.com",

  no_cores  = 4,

  MaxDepth  = 2,

  FUN       = function(url, ...) {

    doc <- read_html(url)

    tibble(title = doc %>% html_element("h1") %>% html_text2())

  }

)

Professional: Monitor memory, checkpoint progress, and gracefully stop clusters on repeated failures.

Step 8: Errors, Cleaning & Selector Tools

Clean data and resilient scripts save time down the road.

1. Error Handling

r

 

safe_parse <- possibly(~ html_elements(read_html(.x), ".item"), otherwise = character())

results <- map(pages, safe_parse)

2. Data Cleaning

r

 

df <- df %>%

  mutate(

    price = str_remove_all(price, "[^0-9\\.]"),

    date  = parse_date(date, "%B %d, %Y"),

    text  = str_squish(text)

  )

3. Selector Discovery

Use SelectorGadget, DevTools, or CSS Diner to refine your CSS or XPath queries.

4. CAPTCHA & Bot Defenses

Introduce randomized delays, header rotations, or leverage headless browsers.

Tip: In your Mini-Project below, see how cleaning functions produce a polished CSV.

Ethics & Best Practices

Responsible scraping protects you and the sites you crawl.

1. Respect robots.txt:

r

 

session <- bow("https://example.com", delay = 2)

page <- scrape(session)

2. Check Terms: Look at the site’s Terms of Service.

3. Throttle Requests: Use req_throttle() or Sys.sleep().

4. Identify Yourself: Clear User-Agent unless anonymity is needed.

5. Privacy Compliance: Don’t scrape personal data without permission; follow GDPR.

6. Monitor & Log: Record successes, failures, and response times for debugging.

Putting It All Together: A Mini-Project Example

Scrape a paginated e‑commerce site, rotate OkeyProxy endpoints, clean data, and export a CSV:

r

 

library(rvest); library(httr2); library(tidyverse)

 

proxies <- c(

  "http://u:[email protected]:8000",

  "http://u:[email protected]:8000"

)

urls <- paste0("https://shop.example.com/page", 1:5)

 

scrape_page <- function(url, proxy) {

  resp <- request(url) %>% req_proxy(proxy) %>% req_throttle(1/1) %>% req_perform()

  doc  <- read_html(resp)

  tibble(

    title = doc %>% html_elements(".title") %>% html_text2(),

    price = doc %>% html_elements(".price") %>% html_text2()

  )

}

 

raw_data <- map2_dfr(urls, rep(proxies, length.out = length(urls)), scrape_page)

 

cleaned <- raw_data %>%

  mutate(

    price = str_remove_all(price, "[^0-9\\.]") %>% as.numeric()

  ) %>%

  drop_na(price)

 

write_csv(cleaned, "products.csv")

Conclusion

You now have a full-stack R scraping workflow—from rvest basics through JS rendering, OkeyProxy rotation, parallel crawls, and robust data cleaning. Happy scraping—and remember: scrape responsibly.

Unlock the full potential of your R scrapers—sign up for a free trial of OkeyProxy today. High quality and cheap rotating proxies to stay under rate limits and avoid blocks. See how easy it is to power fast, reliable, and ethical data collection at scale!