Web Scraping Dynamic Pages with Python: Selenium & APIs
Web scraping dynamic pages requires specialized techniques to extract data that appears after the initial page load. Our tutorial will guide you through two methods: browser automation with Selenium and requests-html, and reverse-engineering JavaScript to directly fetch data from API calls.
Method 1: Browser Automation with Selenium and requests-html
This method uses browser automation tools to render JavaScript and capture the fully loaded HTML, mimicking how a user interacts with a website.
Step 1: Set Up Your Environment
Install Python (3.8 or higher) and the necessary libraries:
● Selenium: For robust browser automation.
● requests-html: A lightweight alternative for rendering JavaScript.
● BeautifulSoup: For parsing HTML.
Run:
bash
pip install selenium requests-html beautifulsoup4
For Selenium, download ChromeDriver from chromedriver.chromium.org, ensuring it matches your Chrome version. Place it in your system’s PATH or specify its path in your script. To avoid IP blocks when scraping, consider integrating OkeyProxy’s residential proxies for seamless access to dynamic content.
Step 2: Scrape with Selenium
Here’s a sample script to scrape product names and prices from a dynamic e-commerce site:
python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import csv
# Set up Chrome WebDriver
driver_path = "path/to/chromedriver" # Replace with your ChromeDriver path
service = Service(driver_path)
options = webdriver.ChromeOptions()
options.add_argument("--headless") # Run in headless mode
driver = webdriver.Chrome(service=service, options=options)
try:
# Navigate to the target page
url = "https://example.com/products" # Replace with your target URL
driver.get(url)
# Wait for dynamic content
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, "product-item"))
)
# Scroll to load more content (if needed)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2) # Adjust based on site
# Parse page source with BeautifulSoup
soup = BeautifulSoup(driver.page_source, "html.parser")
products = soup.find_all("div", class_="product-item")
# Save to CSV
with open("products_selenium.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Product Name", "Price"])
for product in products:
try:
name = product.find("h2", class_="product-name").text.strip()
price = product.find("span", class_="product-price").text.strip()
writer.writerow([name, price])
print(f"Product: {name}, Price: {price}")
except AttributeError:
print("Skipping product: Missing data")
finally:
driver.quit()
This script uses headless mode for efficiency, waits for elements, scrolls for lazy-loaded content, and saves data to a CSV. To enhance reliability, OkeyProxy’s proxies can be integrated to rotate IPs and avoid detection.
Step 3: Alternative with requests-html
For simpler sites, requests-html is a lightweight option. Here’s an example:
python
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import csv
# Initialize session
session = HTMLSession()
# Fetch and render page
url = "https://example.com/products" # Replace with your target URL
response = session.get(url)
response.html.render(sleep=2, timeout=20) # Render JavaScript
# Parse with BeautifulSoup
soup = BeautifulSoup(response.html.html, "html.parser")
products = soup.find_all("div", class_="product-item")
# Save to CSV
with open("products_requests_html.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Product Name", "Price"])
for product in products:
try:
name = product.find("h2", class_="product-name").text.strip()
price = product.find("span", class_="product-price").text.strip()
writer.writerow([name, price])
print(f"Product: {name}, Price: {price}")
except AttributeError:
print("Skipping product: Missing data")
session.close()
Pros:
● Handles complex JavaScript and user interactions.
● Aligns with browser developer tools for easy selector identification.
Cons:
● Resource-intensive (Selenium more than requests-html).
● Slower due to full-page rendering.
● May face IP blocks without proxy rotation.
Tip: Use requests-html for lightweight tasks and Selenium for complex interactions. Pair with OkeyProxy to manage IP rotation effortlessly.
Why OkeyProxy?
OkeyProxy provides Residential Proxies, Rotating Proxies, Static Proxies, and Datacenter Proxies with over 150 million IPs across 200+ countries, supporting HTTP(S) and SOCKS5 for web scraping, ad verification, and more. Its user-friendly dashboard integrates with tools like Selenium, ensuring 99.9% uptime and GDPR compliance. Learn more.
Method 2: Reverse-Engineer JavaScript for API Calls
This method involves identifying and replicating the API calls that load dynamic data, often yielding clean JSON output.
Step 1: Inspect Network Activity
1. Open the target website in Chrome and press F12 to access Developer Tools.
2. Go to the “Network” tab, filter by “XHR” or “Fetch” to find API requests.
3. Reload the page and identify JSON-returning requests (e.g., https://example.com/api/products?page=1).
4. Note the URL, headers, and parameters.
Example JSON response:
json
{
"products": [
{"name": "Widget A", "price": "$19.99"},
{"name": "Widget B", "price": "$29.99"}
]
}
Step 2: Replicate the API Call
Use the requests library to fetch data directly:
python
import requests
import csv
import time
# API endpoint and headers
url = "https://example.com/api/products" # Replace with actual API URL
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124",
"Accept": "application/json"
}
# Make the request
response = requests.get(url, headers=headers, params={"page": 1})
response.raise_for_status()
# Parse JSON data
data = response.json()
products = data.get("products", [])
# Save to CSV
with open("products_api.csv", "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Product Name", "Price"])
for product in products:
name = product.get("name", "N/A")
price = product.get("price", "N/A")
writer.writerow([name, price])
print(f"Product: {name}, Price: {price}")
time.sleep(1) # Avoid rate limits
Step 3: Handle Pagination
For paginated APIs, loop through pages:
python
page = 1
products = []
while True:
response = requests.get(url, headers=headers, params={"page": page})
if response.status_code != 200:
break
data = response.json()
if not data.get("products"):
break
products.extend(data["products"])
page += 1
time.sleep(1) # Avoid rate limits
Step 4: Enhance with Proxies
To prevent rate-limiting or IP bans, integrate OkeyProxy’s residential proxies:
python
proxies = {
"http": "http://your_okeyproxy_username:[email protected]:port",
"https": "http://your_okeyproxy_username:[email protected]:port"
}
response = requests.get(url, headers=headers, params={"page": 1}, proxies=proxies)
Pros:
● Clean JSON data, no HTML parsing needed.
● Faster and less resource-intensive.
● Lower server load.
Cons:
● Requires reverse-engineering skills.
● APIs may change or require authentication.
● Risk of rate-limiting without proper proxy management.
Tip: Use OkeyProxy to rotate IPs and maintain uninterrupted access to APIs. Explore our plans and start scraping reliably.
Comparison of Methods
The table below compares browser automation (Selenium/requests-html) and API scraping based on key factors:
Factor | Browser Automation (Selenium/requests-html) | API Scraping |
Ease of Setup | Moderate: Requires installing libraries and browser drivers (Selenium) or lightweight setup (requests-html). | Harder: Requires reverse-engineering API calls using Developer Tools. |
Performance | Slower: Renders full pages, consuming more CPU and memory. | Faster: Direct HTTP requests with minimal overhead. |
Data Output | HTML: Requires parsing with BeautifulSoup or similar. | JSON: Clean, structured data, easier to process. |
Scalability | Limited: Resource-intensive, especially for large-scale scraping. | High: Lightweight requests scale better with proxy rotation. |
Handling JavaScript | Excellent: Fully renders JavaScript and handles user interactions. | Limited: Relies on accessible API endpoints. |
Anti-Scraping Resistance | Moderate: Headless mode and proxies (e.g., OkeyProxy) reduce detection, but still vulnerable to CAPTCHAs. | Higher Risk: Direct API calls may trigger rate limits or blocks without proxies. |
Use Case Fit | Best for complex sites with heavy JavaScript or user interactions (e.g., infinite scroll, forms). | Best for sites with accessible APIs and structured data needs. |
Resource Usage | High: Selenium uses significant memory; requests-html is lighter. | Low: Minimal resource consumption for HTTP requests. |
Pros and Cons of Browser Automation:
● Pros:
○ Handles complex JavaScript and user interactions (e.g., clicking, scrolling).
○ Aligns with browser Developer Tools for easy selector identification (XPath, CSS).
○ Works when APIs are inaccessible or obfuscated.
● Cons:
○ Resource-intensive and slower due to full-page rendering.
○ Higher risk of detection without proxy rotation (e.g., via OkeyProxy).
○ Requires parsing HTML, which can be error-prone with dynamic class names.
Pros and Cons of API Scraping:
● Pros:
○ Clean JSON output eliminates HTML parsing.
○ Faster and less resource-intensive.
○ Lower server load, more ethical when rate-limited properly.
● Cons:
○ Requires reverse-engineering skills to identify API endpoints.
○ APIs may change, require authentication, or be rate-limited.
○ Less effective for sites without accessible APIs.
Tip: OkeyProxy’s residential proxies enhance both methods by rotating IPs to avoid blocks.
FAQs
1. What should I do if Selenium times out waiting for elements?
Increase the WebDriverWait timeout (e.g., from 10 to 20 seconds) or use EC.presence_of_element_located with more specific selectors (e.g., XPath or CSS). Debug by taking screenshots with driver.save_screenshot("debug.png") to inspect the page state. Ensure your proxy (like OkeyProxy) provides stable connections to avoid network delays.
2. How do I integrate proxies with Selenium or requests-html?
For Selenium, add proxy settings to ChromeOptions:
python
options = webdriver.ChromeOptions()
options.add_argument("--proxy-server=http://proxy.okeyproxy.com:port")
For requests-html, pass proxies to the session:
python
session.get(url, proxies={"http": "http://your_okeyproxy_username:[email protected]:port"})
OkeyProxy’s dashboard simplifies proxy configuration for seamless integration.
3. Why is my script failing due to changing class names or CAPTCHAs?
Dynamic class names require robust selectors (e.g., partial matches with XPath: //div[contains(@class, "product")]). For CAPTCHAs, use OkeyProxy’s residential proxies to mimic real user traffic, reducing detection. Test selectors in Developer Tools to ensure stability.
4. When should I use browser automation vs. API scraping?
Use browser automation for sites with complex JavaScript or user interactions (e.g., infinite scroll, form submissions). Use API scraping for cleaner, faster data extraction when endpoints are accessible. For large-scale scraping, OkeyProxy ensures reliable access to both methods.
5. How do I troubleshoot API requests that return errors?
Check the response status code (response.status_code) and headers in Developer Tools. Ensure correct parameters and authentication tokens. If blocked, integrate OkeyProxy’s proxies to rotate IPs. Log errors (print(response.text)) to diagnose issues like rate limits or invalid endpoints.
How To Choose The Right Approach
● Browser Automation (Selenium/requests-html): Ideal for complex JavaScript, user interactions, or when APIs are inaccessible.
● API Scraping: Best for clean JSON data, faster execution, and lower server impact, provided you can identify endpoints.
OkeyProxy enhances both approaches by providing reliable proxy solutions to avoid blocks and ensure consistent data access. Explore their services at https://www.okeyproxy.com to streamline your scraping projects.
Conclusion
Scraping dynamic pages unlocks valuable data for analysis, from product prices to market trends. Browser automation with Selenium or requests-html is robust for complex sites, while API scraping offers efficiency and clean data.
OkeyProxy’s proxy services ensure uninterrupted access, making your scraping reliable and scalable. Start with browser automation for simplicity, then explore APIs for performance, and leverage OkeyProxy to overcome blocks and restrictions.
Ready to scale your web scraping? Discover our affordable proxy solutions tailored for dynamic web scraping.