This browser does not support JavaScript

How to Scrape Walmart Data with OkeyProxy: A Clear, Step-by-Step Guide

教程
OkeyProxy

Scraping Walmart’s catalog unlocks powerful insights—price tracking, product research, trend analysis—but also comes with anti‑bot hurdles like CAPTCHAs, rate limits, and IP churn. This guide walks you through every step, with two extraction methods and full OkeyProxy integration, so you can copy-paste and run, then customize, optimize, and scale.

How to Scrape Walmart Data with Python & OkeyProxy

Why Scrape Walmart? 

Scraping Walmart provides actionable insights that power business and research decisions. Here are the top reasons why users want this data:

Price Monitoring: Automatically spot discounts and flash sales.

Competitive Analysis: Collect SKU, category, stock data to benchmark rivals.

Trend Tracking: Follow bestseller movements and review counts over time.

 

Why You Need Proxies? No More Request Hustle

Repeated requests from a single IP get blocked. OkeyProxy’s rotating residential proxies keep you under the radar.

Prerequisites: Get Your Environment Ready

1. Install Python 3.8+

Download from python.org.

2. Create & Activate a Virtual Environment

Keeps dependencies tidy:

bash

 

python -m venv venv

# macOS/Linux

source venv/bin/activate

# Windows

venv\Scripts\activate

3. Install Key Libraries

bash

 

pip install requests beautifulsoup4 pandas httpx parsel

  • requests: Beginner-friendly HTTP client
  • beautifulsoup4: HTML parsing
  • pandas: Data storage/export
  • httpx + parsel: Advanced HTTP/2 support & structured parsing

Pro Tip: Use VS Code or PyCharm with Python linting to catch syntax issues early.

Inspecting Walmart Pages: Find the JSON Path

Understanding Walmart’s page layout is key to extracting data efficiently. Walmart uses Next.js, embedding much of its product info in JSON within the HTML:

1. Open any Walmart product page in your browser.

2. Right-click → View Page Source.

3. Search for <script id="__NEXT_DATA__">.

4. Inside you’ll find

JSON path: data["props"]["pageProps"]["initialData"]["data"]["product"]

  • This holds all core fields: name, price, availability, etc.

Beginner Note: If that JSON isn’t there, you’ll use Method B (HTML fallback).

Method A: JSON Extraction (Preferred)

This method grabs clean, structured data in one go—zero HTML parsing hustle.

python

 

import json, time

import httpx

from parsel import Selector

 

# Configure OkeyProxy and headers

OKEY_PROXY = {

    "http":  "http://USER:[email protected]:10000",

    "https": "http://USER:[email protected]:10000"

}

HEADERS = {

    "User-Agent": (

        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

        "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"

    )

}

 

def fetch_product_json(url, retries=3):

    for attempt in range(1, retries + 1):

        try:

            client = httpx.Client(proxies=OKEY_PROXY, headers=HEADERS, http2=True, timeout=10)

            r = client.get(url)

            r.raise_for_status()

            # Extract JSON from Next.js script

            sel = Selector(r.text)

            raw = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()

            data = json.loads(raw)

            p = data["props"]["pageProps"]["initialData"]["data"]["product"]

            return {

                "id": p["itemId"],

                "name": p["productName"],

                "price": p["offers"]["buyBox"]["price"]["value"],

                "availability": p.get("availabilityStatus", "Unknown"),

            }

        except Exception as e:

            print(f"Attempt {attempt} failed: {e}")

            time.sleep(2 ** attempt)  # exponential backoff

    raise RuntimeError("All JSON extraction attempts failed.")

 

# Test it out

print(fetch_product_json("https://www.walmart.com/ip/Apple-AirPods-Pro/520468661"))

Tip for Beginners: Just replace USER:PASS with your credentials, then run the script to see your first record!

Method B: HTML Parsing (Fallback)

Use this only if Method A fails or you need extra fields not in JSON.

python

 

import requests

from bs4 import BeautifulSoup

 

def fetch_product_html(url):

    resp = requests.get(url, headers=HEADERS, proxies=OKEY_PROXY, timeout=10)

    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "html.parser")

    name = soup.find("h1", itemprop="name").get_text(strip=True)

    price = soup.find("span", itemprop="price").get_text(strip=True)

    rating_el = soup.select_one(".stars-container .visuallyhidden")

    rating = rating_el.get_text(strip=True) if rating_el else "N/A"

    return {"name": name, "price": price, "rating": rating}

 

# Quick check

print(fetch_product_html("https://www.walmart.com/ip/SAMSUNG-58-Class-4K-TV/820835173"))

Pro Tip: Inspect elements (F12) to update CSS selectors when Walmart’s HTML changes.

Integrating OkeyProxy for Reliability

1. Sign Up at OkeyProxyChoose a rotating residential proxy plan → Receive your proxy host, port, and credentials.

2. Expect 1–2 sec latency per request—rotation happens automatically.

3. Detect Blocks/CAPTCHAs:

python

 

if "Robot or human" in r.text or r.status_code in (403, 429):

    raise RuntimeError("Blocked or CAPTCHA encountered.")

Test Your Proxy: Run a few manual requests and confirm no “Robot or human” text appears.

Saving & Verifying Your Data

python

 

import pandas as pd

import time, random

 

urls = [

    "https://www.walmart.com/ip/Apple-AirPods-Pro/520468661",

    "https://www.walmart.com/ip/SAMSUNG-58-Class-4K-TV/820835173",

    # add more URLs…

]

 

records = []

for url in urls:

    try:

        rec = fetch_product_json(url)

    except Exception:

        rec = fetch_product_html(url)

    records.append(rec)

    time.sleep(2 + random.random())  # simple rate‑limit

 

df = pd.DataFrame(records)

df.to_csv("walmart_data.csv", index=False)

print("Data written to walmart_data.csv")

Beginner Tip: Open walmart_data.csv in Excel or Google Sheets to confirm your columns.

Scaling Up: Concurrency & Scheduling

Concurrency (Advanced)

python

 

import asyncio, httpx

 

async def fetch_async(url):

    async with httpx.AsyncClient(proxies=OKEY_PROXY, headers=HEADERS) as client:

        r = await client.get(url)

        # parse JSON or HTML...

        return r.text

 

async def main(urls):

    results = await asyncio.gather(*(fetch_async(u) for u in urls))

    print(f"Fetched {len(results)} pages.")

 

asyncio.run(main(urls))

Scheduling with Cron (Example)

To run daily at 2 AM, add to your crontab (crontab -e):

cron

 

0 2 * * * /usr/bin/python3 /path/to/your/script.py >> /path/to/logfile.log 2>&1

  • Task Scheduler (Windows): Use the GUI to schedule python C:\path\to\script.py.
  • Cloud: AWS Lambda + EventBridge or GCP Cloud Scheduler with Pub/Sub trigger.

Best Practices & Compliance Checklist

Check Walmart robots.txt  for disallowed paths.

Respect Rate Limits: Aim for ≤ 1 req/sec to reduce blocks.

Log Errors: Capture HTTP codes (403, 429) and exception traces.

Review TOS: Scraping may violate Walmart’s Terms of Service—consult legal counsel for commercial use.

Data Ethics: Avoid collecting personal or sensitive data.

Common Concerns Addressed

“Will I Get Blocked?”

With OkeyProxy’s rotating pool, realistic headers, and backoff retries, block risks drop dramatically. Always start with small batches.

“Is It Legal?”

Scraping publicly accessible product data is generally acceptable—but check robots.txt and seek legal advice for large-scale operations.

“Too Hard for Beginners?”

Follow each numbered step. Copy-paste the code, replace placeholders, and run. Pros can adjust or add more fields as needed.

Conclusion

Scraping Walmart data is a valuable skill that’s within reach—whether you’re a novice or a veteran developer. By combining JSON extraction, HTML fallback, and OkeyProxy’s rotating residential proxies, you’ll build a reliable, scalable scraper. Start small, refine your code, and scale up with confidence. Happy scraping!

Ready to supercharge your scraper? Register and try free today, never worry about blocks again!