Web Scraping in R: A Comprehensive Guide with OkeyProxy

教程

OkeyProxy

Web scraping transforms unstructured web pages into clean datasets for analysis, reporting, or dashboards—and R’s rich package ecosystem makes it easy. In this guide, you’ll start from zero to advanced techniques: environment setup, HTML/CSS fundamentals, basic rvest usage, HTTP control, dynamic content handling, proxy integration, parallel crawling, error resilience, and ethics.

Web Scraping in R A Comprehensive Guide

Understanding HTML & CSS Basics

Before you can scrape data, you need to know how web pages are structured and how to pinpoint the elements you care about.

Why HTML & CSS Matter

Web pages are built from HTML—a hierarchy of elements (tags)—and styled or organized using CSS. To extract specific text, images, or tables, you tell your scraping tools exactly which elements to target via selectors.

Term Check

Element: An HTML tag, e.g. <p>, <h1>, <table>.

Attribute: Additional information inside a tag, e.g. class="price", id="main".

Selector: A pattern (CSS or XPath) that matches one or more elements on the page.

Example HTML

html

<h2 id="title">Widget A</h2>

</div>

The <div> groups a product.
The <h2> has an id of title.
The <p> has a class of price.

Common CSS Selectors

Class selector

css

.price /* matches <p class="price"> */

ID selector

css

#title /* matches <h2 id="title"> */

Nested selector

css

div.product > p.price

/* matches any <p class="price"> directly inside <div class="product"> */

Simple XPath Example

xpath

//div[@class="product"]/p[@class="price"]

This finds any <p> with class="price" inside a <div> of class product.

Pro Tip

Install the SelectorGadget browser extension (or try CSS Diner) to click on elements and generate CSS selectors visually—no manual trial and error needed.

Step 1: Environment & Prerequisites

1. Install R & RStudio (or VS Code)

2. Create a new project

3. Install packages

install.packages(c(

"rvest", "xml2", "httr2",

"RSelenium", "chromote",

"polite", "parallel", "Rcrawler",

"tidyverse", "purrr", "stringr", "lubridate"

))

4. Load libraries once per script

library(rvest); library(xml2); library(httr2)

library(RSelenium); library(chromote)

library(polite); library(parallel); library(Rcrawler)

library(tidyverse); library(purrr); library(stringr); library(lubridate)

Core Tools & Packages Overview

You need the right packages loaded before writing any scraping code.

Task	Package(s)	Purpose
Static HTML scraping	rvest, xml2	Fetch and parse page content
JavaScript-rendered pages	RSelenium, chromote	Drive headless browsers for dynamic content
HTTP control	httr2	Custom headers, cookies, rate limiting
Parallel crawling	parallel, Rcrawler	Multi-core scraping, depth/pagination control
Data hygiene & export	tidyverse, jsonlite	Clean, transform, save to CSV or JSON
Polite scraping	polite, req_throttle	Respect robots.txt and throttle requests

Step 2: Basic Static Scraping with rvest

rvest makes fetching and parsing HTML trivial.

1. Fetch & Parse

url <- "https://example.com/products"

page <- read_html(url)

2. Extract Elements

titles <- page %>% html_elements(".product-name") %>% html_text2()

prices <- page %>% html_elements(".price") %>% html_text2()

df <- tibble(title = titles, price = prices)

head(df)

Beginner: html_elements() returns a list; html_text2() cleans whitespace.

Professional: Chain pipes for readability; inspect intermediate objects with print().

Step 3: Pagination, Tables & Early Error Handling

Most sites spread data across pages and tables, and sometimes requests fail.

1. Scrape a Table

table <- page %>% html_node("table.inventory") %>% html_table(header = TRUE)

2. Loop Through Pages

pages <- sprintf("https://example.com/page/%d", 1:5)

safe_read <- possibly(read_html, otherwise = NA)

all_tables <- map_dfr(pages, ~ {

pg <- safe_read(.x)

pg %>% html_node("table.inventory") %>% possibly(html_table, otherwise = tibble())()

})

Beginner: map_dfr() binds rows and handles lists of tibbles.

Professional: Wrapping in possibly() ensures your loop continues even if one page errors.

Step 4: HTTP Control with httr2

Customize headers and throttle to avoid blocks.

response <- request("https://example.com/data") %>%

req_headers(`User-Agent` = "R Scraper v1.0") %>%

req_throttle(5 / 60) %>% # 5 requests per minute

req_perform()

page <- resp_body_html(response)

Beginner: Change User-Agent to mimic a browser and bypass simple bot filters.

Professional: Randomize delays:

Sys.sleep(runif(1, 1, 3))

Step 5: Dynamic Pages via RSelenium & chromote

Some sites render content client-side with JavaScript.

1. RSelenium (Headless Chrome)

rD <- rsDriver(browser = "chrome", chromever = "latest",

extraCapabilities = list(chromeOptions=list(args="--headless")))

remDr <- rD$client

remDr$navigate("https://example.com/js-content")

Sys.sleep(4) # wait for JS to load

html <- remDr$getPageSource()[[1]] %>% read_html()

titles <- html %>% html_elements(".js-title") %>% html_text2()

remDr$close(); rD$server$stop()

2. chromote (Lightweight JS)

session <- ChromoteSession$new()

session$Page$navigate("https://example.com/js-content")

session$Page$loadEventFired()

html <- session$Runtime$evaluate("document.documentElement.outerHTML")$result$value

doc <- read_html(html)

items <- doc %>% html_elements(".item") %>% html_text2()

Beginner: RSelenium automates what you’d do manually in a browser.

Professional: Pass proxy settings via Chrome arguments if needed:

ChromoteSession$new(

browser_args = c("--proxy-server=http://user:[email protected]:8000")

Step 6: Integrating OkeyProxy & Rate-Limiting

Global proxy network

Proxies keep your IP fresh, avoid geo-blocks, and distribute load. Sign up here and select a rotating residential proxy plan or a rotating datacenter proxy plan as needed.

1. Single-Request Proxy

proxy <- "http://user:[email protected]:8000"

resp <- request("https://example.com") %>%

req_proxy(proxy) %>%

req_throttle(2 / 1) %>% # 2 requests/sec

req_perform()

page <- resp_body_html(resp)

2. Automated IP Rotation

proxies <- c(

"http://u:[email protected]:8000",

"http://u:[email protected]:8000"

)

scrape_page <- function(url, proxy) {

request(url) %>% req_proxy(proxy) %>% req_perform() %>%

resp_body_html() %>% read_html() %>%

html_elements(".data") %>% html_text2() %>%

tibble(data = .)

}

results <- map2_dfr(pages, rep(proxies, length.out = length(pages)), scrape_page)

Professional: Rotate proxies round-robin to spread traffic and stay under per-IP limits.

Step 7: Scaling with Parallel & Rcrawler

Parallelization and automated crawling speed up large jobs.

1. Parallel Scraping

cl <- makeCluster(detectCores() - 1)

clusterEvalQ(cl, library(rvest))

out <- parLapply(cl, pages, function(u) {

read_html(u) %>% html_elements(".product") %>% html_text2()

})

stopCluster(cl)

2. Automated Crawling with Rcrawler

Rcrawler(

Website = "https://example.com",

no_cores = 4,

MaxDepth = 2,

FUN = function(url, ...) {

doc <- read_html(url)

tibble(title = doc %>% html_element("h1") %>% html_text2())

}

)

Professional: Monitor memory, checkpoint progress, and gracefully stop clusters on repeated failures.

Step 8: Errors, Cleaning & Selector Tools

Clean data and resilient scripts save time down the road.

1. Error Handling

safe_parse <- possibly(~ html_elements(read_html(.x), ".item"), otherwise = character())

results <- map(pages, safe_parse)

2. Data Cleaning

df <- df %>%

mutate(

price = str_remove_all(price, "[^0-9\\.]"),

date = parse_date(date, "%B %d, %Y"),

text = str_squish(text)

)

3. Selector Discovery

Use SelectorGadget, DevTools, or CSS Diner to refine your CSS or XPath queries.

4. CAPTCHA & Bot Defenses

Introduce randomized delays, header rotations, or leverage headless browsers.

Tip: In your Mini-Project below, see how cleaning functions produce a polished CSV.

Ethics & Best Practices

Responsible scraping protects you and the sites you crawl.

1. Respect robots.txt:

session <- bow("https://example.com", delay = 2)

page <- scrape(session)

2. Check Terms: Look at the site’s Terms of Service.

3. Throttle Requests: Use req_throttle() or Sys.sleep().

4. Identify Yourself: Clear User-Agent unless anonymity is needed.

5. Privacy Compliance: Don’t scrape personal data without permission; follow GDPR.

6. Monitor & Log: Record successes, failures, and response times for debugging.

Putting It All Together: A Mini-Project Example

Scrape a paginated e‑commerce site, rotate OkeyProxy endpoints, clean data, and export a CSV:

library(rvest); library(httr2); library(tidyverse)

proxies <- c(

"http://u:[email protected]:8000",

"http://u:[email protected]:8000"

)

urls <- paste0("https://shop.example.com/page", 1:5)

scrape_page <- function(url, proxy) {

resp <- request(url) %>% req_proxy(proxy) %>% req_throttle(1/1) %>% req_perform()

doc <- read_html(resp)

tibble(

title = doc %>% html_elements(".title") %>% html_text2(),

price = doc %>% html_elements(".price") %>% html_text2()

)

}

raw_data <- map2_dfr(urls, rep(proxies, length.out = length(urls)), scrape_page)

cleaned <- raw_data %>%

mutate(

price = str_remove_all(price, "[^0-9\\.]") %>% as.numeric()

) %>%

drop_na(price)

write_csv(cleaned, "products.csv")

Conclusion

You now have a full-stack R scraping workflow—from rvest basics through JS rendering, OkeyProxy rotation, parallel crawls, and robust data cleaning. Happy scraping—and remember: scrape responsibly.

Unlock the full potential of your R scrapers—sign up for a free trial of OkeyProxy today. High quality and cheap rotating proxies to stay under rate limits and avoid blocks. See how easy it is to power fast, reliable, and ethical data collection at scale!

Web Scraping in R: A Comprehensive Guide with OkeyProxy

Understanding HTML & CSS Basics

Why HTML & CSS Matter

Term Check

Example HTML

Common CSS Selectors

Simple XPath Example

Pro Tip

Step 1: Environment & Prerequisites

1. Install R & RStudio (or VS Code)

2. Create a new project

3. Install packages

4. Load libraries once per script

Core Tools & Packages Overview

Step 2: Basic Static Scraping with rvest

1. Fetch & Parse

2. Extract Elements

Step 3: Pagination, Tables & Early Error Handling

1. Scrape a Table

2. Loop Through Pages

Step 4: HTTP Control with httr2

Step 5: Dynamic Pages via RSelenium & chromote

1. RSelenium (Headless Chrome)

2. chromote (Lightweight JS)

Step 6: Integrating OkeyProxy & Rate-Limiting

1. Single-Request Proxy

2. Automated IP Rotation

Step 7: Scaling with Parallel & Rcrawler

1. Parallel Scraping

2. Automated Crawling with Rcrawler

Step 8: Errors, Cleaning & Selector Tools

1. Error Handling

2. Data Cleaning

3. Selector Discovery

4. CAPTCHA & Bot Defenses

Ethics & Best Practices

Putting It All Together: A Mini-Project Example

Conclusion

立即开始免费试用！

Term Check