This browser does not support JavaScript

Web Scraping Dynamic Pages with Python: Selenium & APIs

Tutorial
OkeyProxy

Web scraping dynamic pages requires specialized techniques to extract data that appears after the initial page load. Our tutorial will guide you through two methods: browser automation with Selenium and requests-html, and reverse-engineering JavaScript to directly fetch data from API calls. 

Web Scrape Dynamic Pages with Python

Method 1: Browser Automation with Selenium and requests-html

This method uses browser automation tools to render JavaScript and capture the fully loaded HTML, mimicking how a user interacts with a website.

Step 1: Set Up Your Environment

Install Python (3.8 or higher) and the necessary libraries:

 ● Selenium: For robust browser automation.

 ● requests-html: A lightweight alternative for rendering JavaScript.

 ● BeautifulSoup: For parsing HTML.

Run:

bash

pip install selenium requests-html beautifulsoup4

For Selenium, download ChromeDriver from chromedriver.chromium.org, ensuring it matches your Chrome version. Place it in your system’s PATH or specify its path in your script. To avoid IP blocks when scraping, consider integrating OkeyProxy’s residential proxies for seamless access to dynamic content. 

Step 2: Scrape with Selenium

Here’s a sample script to scrape product names and prices from a dynamic e-commerce site:

python

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from bs4 import BeautifulSoup

import csv

 

# Set up Chrome WebDriver

driver_path = "path/to/chromedriver" # Replace with your ChromeDriver path

service = Service(driver_path)

options = webdriver.ChromeOptions()

options.add_argument("--headless") # Run in headless mode

driver = webdriver.Chrome(service=service, options=options)

 

try:

 # Navigate to the target page

 url = "https://example.com/products" # Replace with your target URL

 driver.get(url)

 

 # Wait for dynamic content

 WebDriverWait(driver, 10).until(

 EC.presence_of_element_located((By.CLASS_NAME, "product-item"))

 )

 

 # Scroll to load more content (if needed)

 driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

 time.sleep(2) # Adjust based on site

 

 # Parse page source with BeautifulSoup

 soup = BeautifulSoup(driver.page_source, "html.parser")

 products = soup.find_all("div", class_="product-item")

 

 # Save to CSV

 with open("products_selenium.csv", "w", newline="", encoding="utf-8") as file:

 writer = csv.writer(file)

 writer.writerow(["Product Name", "Price"])

 for product in products:

 try:

 name = product.find("h2", class_="product-name").text.strip()

 price = product.find("span", class_="product-price").text.strip()

 writer.writerow([name, price])

 print(f"Product: {name}, Price: {price}")

 except AttributeError:

 print("Skipping product: Missing data")

 

finally:

 driver.quit()

This script uses headless mode for efficiency, waits for elements, scrolls for lazy-loaded content, and saves data to a CSV. To enhance reliability, OkeyProxy’s proxies can be integrated to rotate IPs and avoid detection.

Step 3: Alternative with requests-html

For simpler sites, requests-html is a lightweight option. Here’s an example:

python

from requests_html import HTMLSession

from bs4 import BeautifulSoup

import csv

 

# Initialize session

session = HTMLSession()

 

# Fetch and render page

url = "https://example.com/products" # Replace with your target URL

response = session.get(url)

response.html.render(sleep=2, timeout=20) # Render JavaScript

 

# Parse with BeautifulSoup

soup = BeautifulSoup(response.html.html, "html.parser")

products = soup.find_all("div", class_="product-item")

 

# Save to CSV

with open("products_requests_html.csv", "w", newline="", encoding="utf-8") as file:

 writer = csv.writer(file)

 writer.writerow(["Product Name", "Price"])

 for product in products:

 try:

 name = product.find("h2", class_="product-name").text.strip()

 price = product.find("span", class_="product-price").text.strip()

 writer.writerow([name, price])

 print(f"Product: {name}, Price: {price}")

 except AttributeError:

 print("Skipping product: Missing data")

 

session.close()

Pros:

 ● Handles complex JavaScript and user interactions.

 ● Aligns with browser developer tools for easy selector identification.

Cons:

 ● Resource-intensive (Selenium more than requests-html).

 ● Slower due to full-page rendering.

 ● May face IP blocks without proxy rotation.

Tip: Use requests-html for lightweight tasks and Selenium for complex interactions. Pair with OkeyProxy to manage IP rotation effortlessly.

Why OkeyProxy?

OkeyProxy provides Residential Proxies, Rotating Proxies, Static Proxies, and Datacenter Proxies with over 150 million IPs across 200+ countries, supporting HTTP(S) and SOCKS5 for web scraping, ad verification, and more. Its user-friendly dashboard integrates with tools like Selenium, ensuring 99.9% uptime and GDPR compliance. Learn more.

Method 2: Reverse-Engineer JavaScript for API Calls

This method involves identifying and replicating the API calls that load dynamic data, often yielding clean JSON output.

Step 1: Inspect Network Activity

1.  Open the target website in Chrome and press F12 to access Developer Tools.

 2.  Go to the “Network” tab, filter by “XHR” or “Fetch” to find API requests.

 3.  Reload the page and identify JSON-returning requests (e.g., https://example.com/api/products?page=1).

 4.  Note the URL, headers, and parameters.

Example JSON response:

json

{

 "products": [

 {"name": "Widget A", "price": "$19.99"},

 {"name": "Widget B", "price": "$29.99"}

 ]

}

Step 2: Replicate the API Call

Use the requests library to fetch data directly:

python

import requests

import csv

import time

 

# API endpoint and headers

url = "https://example.com/api/products" # Replace with actual API URL

headers = {

 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/91.0.4472.124",

 "Accept": "application/json"

}

 

# Make the request

response = requests.get(url, headers=headers, params={"page": 1})

response.raise_for_status()

 

# Parse JSON data

data = response.json()

products = data.get("products", [])

 

# Save to CSV

with open("products_api.csv", "w", newline="", encoding="utf-8") as file:

 writer = csv.writer(file)

 writer.writerow(["Product Name", "Price"])

 for product in products:

 name = product.get("name", "N/A")

 price = product.get("price", "N/A")

 writer.writerow([name, price])

 print(f"Product: {name}, Price: {price}")

 time.sleep(1) # Avoid rate limits

Step 3: Handle Pagination

For paginated APIs, loop through pages:

python

page = 1

products = []

while True:

 response = requests.get(url, headers=headers, params={"page": page})

 if response.status_code != 200:

 break

 data = response.json()

 if not data.get("products"):

 break

 products.extend(data["products"])

 page += 1

 time.sleep(1) # Avoid rate limits

Step 4: Enhance with Proxies

To prevent rate-limiting or IP bans, integrate OkeyProxy’s residential proxies:

python

proxies = {

 "http": "http://your_okeyproxy_username:[email protected]:port",

 "https": "http://your_okeyproxy_username:[email protected]:port"

}

response = requests.get(url, headers=headers, params={"page": 1}, proxies=proxies)

Pros:

 ● Clean JSON data, no HTML parsing needed.

 ● Faster and less resource-intensive.

 ● Lower server load.

Cons:

 ● Requires reverse-engineering skills.

 ● APIs may change or require authentication.

 ● Risk of rate-limiting without proper proxy management.

Tip: Use OkeyProxy to rotate IPs and maintain uninterrupted access to APIs. Explore our plans and start scraping reliably.

Comparison of Methods

The table below compares browser automation (Selenium/requests-html) and API scraping based on key factors:

Factor Browser Automation (Selenium/requests-html) API Scraping
Ease of Setup Moderate: Requires installing libraries and browser drivers (Selenium) or lightweight setup (requests-html). Harder: Requires reverse-engineering API calls using Developer Tools.
Performance Slower: Renders full pages, consuming more CPU and memory. Faster: Direct HTTP requests with minimal overhead.
Data Output HTML: Requires parsing with BeautifulSoup or similar. JSON: Clean, structured data, easier to process.
Scalability Limited: Resource-intensive, especially for large-scale scraping. High: Lightweight requests scale better with proxy rotation.
Handling JavaScript Excellent: Fully renders JavaScript and handles user interactions. Limited: Relies on accessible API endpoints.
Anti-Scraping Resistance Moderate: Headless mode and proxies (e.g., OkeyProxy) reduce detection, but still vulnerable to CAPTCHAs. Higher Risk: Direct API calls may trigger rate limits or blocks without proxies.
Use Case Fit Best for complex sites with heavy JavaScript or user interactions (e.g., infinite scroll, forms). Best for sites with accessible APIs and structured data needs.
Resource Usage High: Selenium uses significant memory; requests-html is lighter. Low: Minimal resource consumption for HTTP requests.

 Pros and Cons of Browser Automation:

 ● Pros: 

 ○ Handles complex JavaScript and user interactions (e.g., clicking, scrolling).

 ○ Aligns with browser Developer Tools for easy selector identification (XPath, CSS).

 ○ Works when APIs are inaccessible or obfuscated.

 ● Cons: 

 ○ Resource-intensive and slower due to full-page rendering.

 ○ Higher risk of detection without proxy rotation (e.g., via OkeyProxy).

 ○ Requires parsing HTML, which can be error-prone with dynamic class names.

Pros and Cons of API Scraping:

 ● Pros: 

 ○ Clean JSON output eliminates HTML parsing.

 ○ Faster and less resource-intensive.

 ○ Lower server load, more ethical when rate-limited properly.

 ● Cons: 

 ○ Requires reverse-engineering skills to identify API endpoints.

 ○ APIs may change, require authentication, or be rate-limited.

 ○ Less effective for sites without accessible APIs.

Tip: OkeyProxy’s residential proxies enhance both methods by rotating IPs to avoid blocks.

FAQs

1.  What should I do if Selenium times out waiting for elements? 

Increase the WebDriverWait timeout (e.g., from 10 to 20 seconds) or use EC.presence_of_element_located with more specific selectors (e.g., XPath or CSS). Debug by taking screenshots with driver.save_screenshot("debug.png") to inspect the page state. Ensure your proxy (like OkeyProxy) provides stable connections to avoid network delays.

2.  How do I integrate proxies with Selenium or requests-html? 

For Selenium, add proxy settings to ChromeOptions: 

python

options = webdriver.ChromeOptions()

options.add_argument("--proxy-server=http://proxy.okeyproxy.com:port")

For requests-html, pass proxies to the session: 

python

session.get(url, proxies={"http": "http://your_okeyproxy_username:[email protected]:port"})

OkeyProxy’s dashboard simplifies proxy configuration for seamless integration.

3.  Why is my script failing due to changing class names or CAPTCHAs? 

Dynamic class names require robust selectors (e.g., partial matches with XPath: //div[contains(@class, "product")]). For CAPTCHAs, use OkeyProxy’s residential proxies to mimic real user traffic, reducing detection. Test selectors in Developer Tools to ensure stability.

4.  When should I use browser automation vs. API scraping? 

Use browser automation for sites with complex JavaScript or user interactions (e.g., infinite scroll, form submissions). Use API scraping for cleaner, faster data extraction when endpoints are accessible. For large-scale scraping, OkeyProxy ensures reliable access to both methods.

5.  How do I troubleshoot API requests that return errors?

Check the response status code (response.status_code) and headers in Developer Tools. Ensure correct parameters and authentication tokens. If blocked, integrate OkeyProxy’s proxies to rotate IPs. Log errors (print(response.text)) to diagnose issues like rate limits or invalid endpoints. 

How To Choose The Right Approach

 ● Browser Automation (Selenium/requests-html): Ideal for complex JavaScript, user interactions, or when APIs are inaccessible.

 ● API Scraping: Best for clean JSON data, faster execution, and lower server impact, provided you can identify endpoints.

OkeyProxy enhances both approaches by providing reliable proxy solutions to avoid blocks and ensure consistent data access. Explore their services at https://www.okeyproxy.com to streamline your scraping projects.

Conclusion

Scraping dynamic pages unlocks valuable data for analysis, from product prices to market trends. Browser automation with Selenium or requests-html is robust for complex sites, while API scraping offers efficiency and clean data. 

OkeyProxy’s proxy services ensure uninterrupted access, making your scraping reliable and scalable. Start with browser automation for simplicity, then explore APIs for performance, and leverage OkeyProxy to overcome blocks and restrictions.

Ready to scale your web scraping? Discover our affordable proxy solutions tailored for dynamic web scraping.