Scrape Jobs from the Internet with OkeyProxy
Whether you’re building a niche job board, tracking hiring trends, or generating sales leads, scraping job listings is an invaluable skill. However, public websites often enforce rate limits, geo-restrictions, and CAPTCHAs to block heavy traffic. By integrating rotating residential proxies, you can seamlessly overcome these barriers, capture up-to-date job data, and scale your operation without interruptions.

Why Use Proxies for Job Scraping?
Avoid IP Bans: Rotating through OkeyProxy’s large IP pool prevents sites from throttling or blocking your requests.
Geo-Targeted Results: Some boards show region-specific vacancies. Switch proxy locations (e.g., US, UK, SG) to gather localized listings.
Higher Throughput: Distribute requests across many IPs for faster data collection.
Reduced Bot Detection: Residential IPs mimic real users, minimizing CAPTCHAs and anti-bot triggers.
User Scenarios & Key Concerns
When scraping jobs, users commonly need to:
1. Multi-Site Aggregation: Combine listings from major boards (Indeed, LinkedIn) and company career pages.
2. Data Freshness: Schedule hourly or daily runs to catch newly posted jobs.
3. Regional Filtering: Target remote roles or specific cities by rotating proxy regions.
4. Dynamic Content: Handle JavaScript-rendered pages that load listings asynchronously.
5. Budget vs. Complexity: Balance OkeyProxy usage costs against development time and maintenance overhead.
Overview of Approaches
| Method | Pros | Cons |
| No-Code Tools | GUI-driven, quick setup | Limited customization, CAPTCHA hurdles |
| Managed APIs | Auto-handles JavaScript & bans | Higher per-request fees |
| In-House Scrapers | Full control, extensible pipelines | Requires dev resources, ongoing maintenance |
In this guide, we build an in-house Python scraper powered by OkeyProxy’s rotating residential proxies—combining flexibility with cost-efficiency.
Step-By-Step: Build Your Job Scraper
Step 1. Prepare Your Environment
1. Install Python & Create Virtual Environment
bash
# Ensure you have Python 3.8+
python3 --version
# Create and activate a virtual environment
python3 -m venv env
source env/bin/activate # macOS/Linux
.\env\Scripts\activate # Windows PowerShell
2. Install Core Libraries
- requests for HTTP calls
- beautifulsoup4 for HTML parsing
- pyppeteer for JavaScript‐rendered pages (optional)
- ratelimit for polite request throttling
bash
pip install requests beautifulsoup4 pyppeteer ratelimit
3. Store Your OkeyProxy Credentials Securely
Create a file .env (add to .gitignore) with:
ini
OKP_USER=your_user
OKP_PASS=your_pass
OKP_HOST=proxy.okeyproxy.com
OKP_PORT=8000
Load environment variables in Python using python-dotenv (install with pip install python-dotenv).
Step 2. Configure & Verify Rotating Proxies
1. Helper Function to Build Proxy Dict
python
import os
from ratelimit import limits, sleep_and_retry
from dotenv import load_dotenv
load_dotenv()
USER = os.getenv("OKP_USER")
PASS = os.getenv("OKP_PASS")
HOST = os.getenv("OKP_HOST")
PORT = os.getenv("OKP_PORT")
def get_proxy(region_code=None):
"""
Returns a proxies dict for requests. Region_code like 'us', 'de', 'sg'.
"""
host = f"{region_code}.{HOST}" if region_code else HOST
proxy_url = f"http://{USER}:{PASS}@{host}:{PORT}"
return {"http": proxy_url, "https": proxy_url}
2. Verify Connectivity
python
import requests
try:
resp = requests.get("https://httpbin.org/ip", proxies=get_proxy(), timeout=5)
print("Connected via proxy, your IP:", resp.json()["origin"])
except Exception as e:
print("Proxy connection failed:", e)
Step 3. Crawl & Discover Target Pages
Static List vs. Sitemap vs. Search API
- Static list: Hard-code known job URLs for MVP.
- Sitemap parsing: Use requests + xml.etree.ElementTree to extract <loc> entries.
- SERP API: Query Google/Bing to find new job-board pages on the fly.
Example: Extract URLs from Sitemap
python
import xml.etree.ElementTree as ET
def fetch_sitemap_urls(sitemap_url):
resp = requests.get(sitemap_url, proxies=get_proxy(), timeout=10)
root = ET.fromstring(resp.content)
return [elem.text for elem in root.iter("{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]
Step 4. Extract Job Data from Pages
1. Polite Fetching with Rate Limiting
python
from ratelimit import limits, sleep_and_retry
@sleep_and_retry
@limits(calls=1, period=1)
def safe_get(url, render=False):
if render:
html = asyncio.get_event_loop().run_until_complete(fetch_rendered(url))
else:
r = requests.get(url, proxies=get_proxy(), timeout=10)
r.raise_for_status()
html = r.text
return html
2. HTML Parsing (BeautifulSoup)
python
from bs4 import BeautifulSoup
def parse_jobs(html):
soup = BeautifulSoup(html, "html.parser")
records = []
for card in soup.select("div.job-card"):
records.append({
"title": card.select_one("h2.title").get_text(strip=True),
"company": card.select_one("div.company").get_text(strip=True),
"location": card.select_one("div.location").get_text(strip=True),
"date": card.select_one("time")["datetime"]
})
return records
3. JS-Rendered Pages (Pyppeteer)
python
import asyncio
from pyppeteer import launch
async def fetch_rendered(url):
browser = await launch(headless=True, args=["--no-sandbox"])
page = await browser.newPage()
await page.goto(url, timeout=60000)
await page.waitForSelector("div.job-card")
content = await page.content()
await browser.close()
return content
Step 5. Scale with Concurrency
1. Thread-Based Concurrency (Beginner):
python
from concurrent.futures import ThreadPoolExecutor
def scrape_and_parse(url):
html = safe_get(url)
return parse_job_cards(html)
with ThreadPoolExecutor(max_workers=5) as executor:
results = executor.map(scrape_and_parse, list_of_urls)
all_jobs = [job for sublist in results for job in sublist]
2. AsyncIO + aiohttp (Advanced): Non-blocking fetches combined with JS rendering
Step 6. Data Clean, Dedupe & Store
1. Normalize Fields
Trim whitespace, unify date formats (YYYY-MM-DD), standardize location strings.
2. Deduplication
python
seen = set()
unique_jobs = []
for job in all_jobs:
key = (job["title"], job["company"], job["location"])
if key not in seen:
seen.add(key)
unique_jobs.append(job)
3. Export to CSV / JSON
python
import json, csv
# CSV
with open("jobs.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=unique_jobs[0].keys())
writer.writeheader()
writer.writerows(unique_jobs)
# JSON
with open("jobs.json", "w", encoding="utf-8") as f:
json.dump(unique_jobs, f, ensure_ascii=False, indent=2)
Step 7. Automate & Maintenance
Cron Jobs (Linux/macOS) or Task Scheduler (Windows) to run your script hourly/day.
Health Checks: Alert when error rate >10% or proxy failures spike.
Selector Validation: Periodic test runs to detect site layout changes.
Advanced Tips & Best Practices
Geo-Rotation: Pass region codes (e.g., "us", "de") to get_proxy() for localized data.
Exponential Backoff: On HTTP 429/500, retry after increasing delays.
Containerization: Dockerize your scraper for consistent deployments.
Dashboarding: Store data in a database (PostgreSQL, MongoDB) and visualize it using Grafana or Metabase.
Legal & Ethical Considerations
robots.txt & ToS: Always check and honor each site’s crawling directives.
Privacy Laws: Avoid collecting personal identifiers; focus on publicly posted job details.
Rate Limits: Keep request rates reasonable—1–2 requests per second per endpoint as a starting point.
Conclusion
By following these steps—environment setup, proxy configuration, polite crawling, HTML/JS parsing, concurrency, and data storage—you’ll build a robust job-scraping pipeline that scales with OkeyProxy. Start small, iterate on your selectors and proxy rotations, then expand to capture comprehensive, up-to-date job postings across the web.
Ready to start? Sign up and get your free trial of rotating residential proxies today!








