Web Scraping in R: A Comprehensive Guide with OkeyProxy
Web scraping transforms unstructured web pages into clean datasets for analysis, reporting, or dashboards—and R’s rich package ecosystem makes it easy. In this guide, you’ll start from zero to advanced techniques: environment setup, HTML/CSS fundamentals, basic rvest usage, HTTP control, dynamic content handling, proxy integration, parallel crawling, error resilience, and ethics.

Understanding HTML & CSS Basics
Before you can scrape data, you need to know how web pages are structured and how to pinpoint the elements you care about.
Why HTML & CSS Matter
Web pages are built from HTML—a hierarchy of elements (tags)—and styled or organized using CSS. To extract specific text, images, or tables, you tell your scraping tools exactly which elements to target via selectors.
Term Check
Element: An HTML tag, e.g. <p>, <h1>, <table>.
Attribute: Additional information inside a tag, e.g. class="price", id="main".
Selector: A pattern (CSS or XPath) that matches one or more elements on the page.
Example HTML
html
<div class="product">
<h2 id="title">Widget A</h2>
<p class="price">$10.00</p>
</div>
- The <div> groups a product.
- The <h2> has an id of title.
- The <p> has a class of price.
Common CSS Selectors
Class selector
css
.price /* matches <p class="price"> */
ID selector
css
#title /* matches <h2 id="title"> */
Nested selector
css
div.product > p.price
/* matches any <p class="price"> directly inside <div class="product"> */
Simple XPath Example
xpath
//div[@class="product"]/p[@class="price"]
This finds any <p> with class="price" inside a <div> of class product.
Pro Tip
Install the SelectorGadget browser extension (or try CSS Diner) to click on elements and generate CSS selectors visually—no manual trial and error needed.
Step 1: Environment & Prerequisites
1. Install R & RStudio (or VS Code)
2. Create a new project
3. Install packages
r
install.packages(c(
"rvest", "xml2", "httr2",
"RSelenium", "chromote",
"polite", "parallel", "Rcrawler",
"tidyverse", "purrr", "stringr", "lubridate"
))
4. Load libraries once per script
r
library(rvest); library(xml2); library(httr2)
library(RSelenium); library(chromote)
library(polite); library(parallel); library(Rcrawler)
library(tidyverse); library(purrr); library(stringr); library(lubridate)
Core Tools & Packages Overview
You need the right packages loaded before writing any scraping code.
| Task | Package(s) | Purpose |
| Static HTML scraping | rvest, xml2 | Fetch and parse page content |
| JavaScript-rendered pages | RSelenium, chromote | Drive headless browsers for dynamic content |
| HTTP control | httr2 | Custom headers, cookies, rate limiting |
| Parallel crawling | parallel, Rcrawler | Multi-core scraping, depth/pagination control |
| Data hygiene & export | tidyverse, jsonlite | Clean, transform, save to CSV or JSON |
| Polite scraping | polite, req_throttle | Respect robots.txt and throttle requests |
Step 2: Basic Static Scraping with rvest
rvest makes fetching and parsing HTML trivial.
1. Fetch & Parse
r
url <- "https://example.com/products"
page <- read_html(url)
2. Extract Elements
r
titles <- page %>% html_elements(".product-name") %>% html_text2()
prices <- page %>% html_elements(".price") %>% html_text2()
df <- tibble(title = titles, price = prices)
head(df)
Beginner: html_elements() returns a list; html_text2() cleans whitespace.
Professional: Chain pipes for readability; inspect intermediate objects with print().
Step 3: Pagination, Tables & Early Error Handling
Most sites spread data across pages and tables, and sometimes requests fail.
1. Scrape a Table
r
table <- page %>% html_node("table.inventory") %>% html_table(header = TRUE)
2. Loop Through Pages
r
pages <- sprintf("https://example.com/page/%d", 1:5)
safe_read <- possibly(read_html, otherwise = NA)
all_tables <- map_dfr(pages, ~ {
pg <- safe_read(.x)
pg %>% html_node("table.inventory") %>% possibly(html_table, otherwise = tibble())()
})
Beginner: map_dfr() binds rows and handles lists of tibbles.
Professional: Wrapping in possibly() ensures your loop continues even if one page errors.
Step 4: HTTP Control with httr2
Customize headers and throttle to avoid blocks.
r
response <- request("https://example.com/data") %>%
req_headers(`User-Agent` = "R Scraper v1.0") %>%
req_throttle(5 / 60) %>% # 5 requests per minute
req_perform()
page <- resp_body_html(response)
Beginner: Change User-Agent to mimic a browser and bypass simple bot filters.
Professional: Randomize delays:
r
Sys.sleep(runif(1, 1, 3))
Step 5: Dynamic Pages via RSelenium & chromote
Some sites render content client-side with JavaScript.
1. RSelenium (Headless Chrome)
r
rD <- rsDriver(browser = "chrome", chromever = "latest",
extraCapabilities = list(chromeOptions=list(args="--headless")))
remDr <- rD$client
remDr$navigate("https://example.com/js-content")
Sys.sleep(4) # wait for JS to load
html <- remDr$getPageSource()[[1]] %>% read_html()
titles <- html %>% html_elements(".js-title") %>% html_text2()
remDr$close(); rD$server$stop()
2. chromote (Lightweight JS)
r
session <- ChromoteSession$new()
session$Page$navigate("https://example.com/js-content")
session$Page$loadEventFired()
html <- session$Runtime$evaluate("document.documentElement.outerHTML")$result$value
doc <- read_html(html)
items <- doc %>% html_elements(".item") %>% html_text2()
Beginner: RSelenium automates what you’d do manually in a browser.
Professional: Pass proxy settings via Chrome arguments if needed:
r
ChromoteSession$new(
browser_args = c("--proxy-server=http://user:[email protected]:8000")
).
Step 6: Integrating OkeyProxy & Rate-Limiting

Proxies keep your IP fresh, avoid geo-blocks, and distribute load. Sign up here and select a rotating residential proxy plan or a rotating datacenter proxy plan as needed.
1. Single-Request Proxy
r
proxy <- "http://user:[email protected]:8000"
resp <- request("https://example.com") %>%
req_proxy(proxy) %>%
req_throttle(2 / 1) %>% # 2 requests/sec
req_perform()
page <- resp_body_html(resp)
2. Automated IP Rotation
r
proxies <- c(
"http://u:[email protected]:8000",
"http://u:[email protected]:8000",
"http://u:[email protected]:8000"
)
scrape_page <- function(url, proxy) {
request(url) %>% req_proxy(proxy) %>% req_perform() %>%
resp_body_html() %>% read_html() %>%
html_elements(".data") %>% html_text2() %>%
tibble(data = .)
}
results <- map2_dfr(pages, rep(proxies, length.out = length(pages)), scrape_page)
Professional: Rotate proxies round-robin to spread traffic and stay under per-IP limits.
Step 7: Scaling with Parallel & Rcrawler
Parallelization and automated crawling speed up large jobs.
1. Parallel Scraping
r
cl <- makeCluster(detectCores() - 1)
clusterEvalQ(cl, library(rvest))
out <- parLapply(cl, pages, function(u) {
read_html(u) %>% html_elements(".product") %>% html_text2()
})
stopCluster(cl)
2. Automated Crawling with Rcrawler
r
Rcrawler(
Website = "https://example.com",
no_cores = 4,
MaxDepth = 2,
FUN = function(url, ...) {
doc <- read_html(url)
tibble(title = doc %>% html_element("h1") %>% html_text2())
}
)
Professional: Monitor memory, checkpoint progress, and gracefully stop clusters on repeated failures.
Step 8: Errors, Cleaning & Selector Tools
Clean data and resilient scripts save time down the road.
1. Error Handling
r
safe_parse <- possibly(~ html_elements(read_html(.x), ".item"), otherwise = character())
results <- map(pages, safe_parse)
2. Data Cleaning
r
df <- df %>%
mutate(
price = str_remove_all(price, "[^0-9\\.]"),
date = parse_date(date, "%B %d, %Y"),
text = str_squish(text)
)
3. Selector Discovery
Use SelectorGadget, DevTools, or CSS Diner to refine your CSS or XPath queries.
4. CAPTCHA & Bot Defenses
Introduce randomized delays, header rotations, or leverage headless browsers.
Tip: In your Mini-Project below, see how cleaning functions produce a polished CSV.
Ethics & Best Practices
Responsible scraping protects you and the sites you crawl.
1. Respect robots.txt:
r
session <- bow("https://example.com", delay = 2)
page <- scrape(session)
2. Check Terms: Look at the site’s Terms of Service.
3. Throttle Requests: Use req_throttle() or Sys.sleep().
4. Identify Yourself: Clear User-Agent unless anonymity is needed.
5. Privacy Compliance: Don’t scrape personal data without permission; follow GDPR.
6. Monitor & Log: Record successes, failures, and response times for debugging.
Putting It All Together: A Mini-Project Example
Scrape a paginated e‑commerce site, rotate OkeyProxy endpoints, clean data, and export a CSV:
r
library(rvest); library(httr2); library(tidyverse)
proxies <- c(
"http://u:[email protected]:8000",
"http://u:[email protected]:8000"
)
urls <- paste0("https://shop.example.com/page", 1:5)
scrape_page <- function(url, proxy) {
resp <- request(url) %>% req_proxy(proxy) %>% req_throttle(1/1) %>% req_perform()
doc <- read_html(resp)
tibble(
title = doc %>% html_elements(".title") %>% html_text2(),
price = doc %>% html_elements(".price") %>% html_text2()
)
}
raw_data <- map2_dfr(urls, rep(proxies, length.out = length(urls)), scrape_page)
cleaned <- raw_data %>%
mutate(
price = str_remove_all(price, "[^0-9\\.]") %>% as.numeric()
) %>%
drop_na(price)
write_csv(cleaned, "products.csv")
Conclusion
You now have a full-stack R scraping workflow—from rvest basics through JS rendering, OkeyProxy rotation, parallel crawls, and robust data cleaning. Happy scraping—and remember: scrape responsibly.
Unlock the full potential of your R scrapers—sign up for a free trial of OkeyProxy today. High quality and cheap rotating proxies to stay under rate limits and avoid blocks. See how easy it is to power fast, reliable, and ethical data collection at scale!








