How to Scrape Walmart Data with OkeyProxy: A Clear, Step-by-Step Guide
Scraping Walmart’s catalog unlocks powerful insights—price tracking, product research, trend analysis—but also comes with anti‑bot hurdles like CAPTCHAs, rate limits, and IP churn. This guide walks you through every step, with two extraction methods and full OkeyProxy integration, so you can copy-paste and run, then customize, optimize, and scale.

Why Scrape Walmart?
Scraping Walmart provides actionable insights that power business and research decisions. Here are the top reasons why users want this data:
Price Monitoring: Automatically spot discounts and flash sales.
Competitive Analysis: Collect SKU, category, stock data to benchmark rivals.
Trend Tracking: Follow bestseller movements and review counts over time.
Why You Need Proxies? No More Request Hustle
Repeated requests from a single IP get blocked. OkeyProxy’s rotating residential proxies keep you under the radar.
Prerequisites: Get Your Environment Ready
1. Install Python 3.8+
Download from python.org.
2. Create & Activate a Virtual Environment
Keeps dependencies tidy:
bash
python -m venv venv
# macOS/Linux
source venv/bin/activate
# Windows
venv\Scripts\activate
3. Install Key Libraries
bash
pip install requests beautifulsoup4 pandas httpx parsel
- requests: Beginner-friendly HTTP client
- beautifulsoup4: HTML parsing
- pandas: Data storage/export
- httpx + parsel: Advanced HTTP/2 support & structured parsing
Pro Tip: Use VS Code or PyCharm with Python linting to catch syntax issues early.
Inspecting Walmart Pages: Find the JSON Path
Understanding Walmart’s page layout is key to extracting data efficiently. Walmart uses Next.js, embedding much of its product info in JSON within the HTML:
1. Open any Walmart product page in your browser.
2. Right-click → View Page Source.
3. Search for <script id="__NEXT_DATA__">.
4. Inside you’ll find
JSON path: data["props"]["pageProps"]["initialData"]["data"]["product"]
- This holds all core fields: name, price, availability, etc.
Beginner Note: If that JSON isn’t there, you’ll use Method B (HTML fallback).
Method A: JSON Extraction (Preferred)
This method grabs clean, structured data in one go—zero HTML parsing hustle.
python
import json, time
import httpx
from parsel import Selector
# Configure OkeyProxy and headers
OKEY_PROXY = {
"http": "http://USER:[email protected]:10000",
"https": "http://USER:[email protected]:10000"
}
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
)
}
def fetch_product_json(url, retries=3):
for attempt in range(1, retries + 1):
try:
client = httpx.Client(proxies=OKEY_PROXY, headers=HEADERS, http2=True, timeout=10)
r = client.get(url)
r.raise_for_status()
# Extract JSON from Next.js script
sel = Selector(r.text)
raw = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
data = json.loads(raw)
p = data["props"]["pageProps"]["initialData"]["data"]["product"]
return {
"id": p["itemId"],
"name": p["productName"],
"price": p["offers"]["buyBox"]["price"]["value"],
"availability": p.get("availabilityStatus", "Unknown"),
}
except Exception as e:
print(f"Attempt {attempt} failed: {e}")
time.sleep(2 ** attempt) # exponential backoff
raise RuntimeError("All JSON extraction attempts failed.")
# Test it out
print(fetch_product_json("https://www.walmart.com/ip/Apple-AirPods-Pro/520468661"))
Tip for Beginners: Just replace USER:PASS with your credentials, then run the script to see your first record!
Method B: HTML Parsing (Fallback)
Use this only if Method A fails or you need extra fields not in JSON.
python
import requests
from bs4 import BeautifulSoup
def fetch_product_html(url):
resp = requests.get(url, headers=HEADERS, proxies=OKEY_PROXY, timeout=10)
resp.raise_for_status()
soup = BeautifulSoup(resp.text, "html.parser")
name = soup.find("h1", itemprop="name").get_text(strip=True)
price = soup.find("span", itemprop="price").get_text(strip=True)
rating_el = soup.select_one(".stars-container .visuallyhidden")
rating = rating_el.get_text(strip=True) if rating_el else "N/A"
return {"name": name, "price": price, "rating": rating}
# Quick check
print(fetch_product_html("https://www.walmart.com/ip/SAMSUNG-58-Class-4K-TV/820835173"))
Pro Tip: Inspect elements (F12) to update CSS selectors when Walmart’s HTML changes.
Integrating OkeyProxy for Reliability
1. Sign Up at OkeyProxy → Choose a rotating residential proxy plan → Receive your proxy host, port, and credentials.
2. Expect 1–2 sec latency per request—rotation happens automatically.
3. Detect Blocks/CAPTCHAs:
python
if "Robot or human" in r.text or r.status_code in (403, 429):
raise RuntimeError("Blocked or CAPTCHA encountered.")
Test Your Proxy: Run a few manual requests and confirm no “Robot or human” text appears.
Saving & Verifying Your Data
python
import pandas as pd
import time, random
urls = [
"https://www.walmart.com/ip/Apple-AirPods-Pro/520468661",
"https://www.walmart.com/ip/SAMSUNG-58-Class-4K-TV/820835173",
# add more URLs…
]
records = []
for url in urls:
try:
rec = fetch_product_json(url)
except Exception:
rec = fetch_product_html(url)
records.append(rec)
time.sleep(2 + random.random()) # simple rate‑limit
df = pd.DataFrame(records)
df.to_csv("walmart_data.csv", index=False)
print("Data written to walmart_data.csv")
Beginner Tip: Open walmart_data.csv in Excel or Google Sheets to confirm your columns.
Scaling Up: Concurrency & Scheduling
Concurrency (Advanced)
python
import asyncio, httpx
async def fetch_async(url):
async with httpx.AsyncClient(proxies=OKEY_PROXY, headers=HEADERS) as client:
r = await client.get(url)
# parse JSON or HTML...
return r.text
async def main(urls):
results = await asyncio.gather(*(fetch_async(u) for u in urls))
print(f"Fetched {len(results)} pages.")
asyncio.run(main(urls))
Scheduling with Cron (Example)
To run daily at 2 AM, add to your crontab (crontab -e):
cron
0 2 * * * /usr/bin/python3 /path/to/your/script.py >> /path/to/logfile.log 2>&1
- Task Scheduler (Windows): Use the GUI to schedule python C:\path\to\script.py.
- Cloud: AWS Lambda + EventBridge or GCP Cloud Scheduler with Pub/Sub trigger.
Best Practices & Compliance Checklist
Check Walmart robots.txt for disallowed paths.
Respect Rate Limits: Aim for ≤ 1 req/sec to reduce blocks.
Log Errors: Capture HTTP codes (403, 429) and exception traces.
Review TOS: Scraping may violate Walmart’s Terms of Service—consult legal counsel for commercial use.
Data Ethics: Avoid collecting personal or sensitive data.
Common Concerns Addressed
“Will I Get Blocked?”
With OkeyProxy’s rotating pool, realistic headers, and backoff retries, block risks drop dramatically. Always start with small batches.
“Is It Legal?”
Scraping publicly accessible product data is generally acceptable—but check robots.txt and seek legal advice for large-scale operations.
“Too Hard for Beginners?”
Follow each numbered step. Copy-paste the code, replace placeholders, and run. Pros can adjust or add more fields as needed.
Conclusion
Scraping Walmart data is a valuable skill that’s within reach—whether you’re a novice or a veteran developer. By combining JSON extraction, HTML fallback, and OkeyProxy’s rotating residential proxies, you’ll build a reliable, scalable scraper. Start small, refine your code, and scale up with confidence. Happy scraping!
Ready to supercharge your scraper? Register and try free today, never worry about blocks again!








