How to Scrape Twitter (X) Data in 2025 for AI Model Training: Step-by-Step Guide

Tutorial

OkeyProxy

Twitter, now known as X, offers a rich source of real-world, user-generated text data that reflects how people communicate across different topics, cultures, and languages.

With millions of short-form posts created daily, the platform provides a dynamic stream of content ideal for training AI models in natural language processing (NLP), sentiment analysis, conversational modeling, and event detection. Tweets often include informal language, slang, emojis, and hashtags, making them particularly useful for building models that need to understand contemporary, informal, and context-sensitive communication.

In addition, metadata such as timestamps, user interactions, and conversation threads provide valuable context that can enhance model performance.

For these reasons, scraping Twitter data, can significantly benefit AI research and application development.

This tutorial provides a comprehensive, step-by-step guide to scraping X data.

Scrape X (Twitter) Data

Why This Matters in 2025

The explosion of generative AI and large language models (LLMs) has made high-quality, diverse datasets more valuable than ever. X's real-time, conversational data is ideal for:

● AI Model Training: Fine-tuning LLMs for natural language understanding, dialogue generation, or domain-specific tasks.

● Sentiment Analysis: Gauging public opinion on brands, products, or events.

● Market Intelligence: Tracking trends, competitor strategies, or consumer preferences.

● Research: Analyzing social movements, political discourse, or cultural shifts.

With X's evolving platform and stricter API limits, scraping remains a viable alternative for accessing public data, provided it’s done ethically and in compliance with terms of service.

Tools and Setup

To scrape X data, you’ll need the following tools:

● Python 3.11+: The backbone of our script.

● Libraries: requests: For HTTP requests.

○ BeautifulSoup: For parsing HTML.

○ pandas: For data structuring and export.

○ time and random: For adding delays to avoid rate limits.

○ fake-useragent: For rotating user agents.

○ OkeyProxy (or similar proxy service): For IP rotation to bypass blocks.

● Browser Developer Tools: To inspect X’s HTML structure.

● Optional: A virtual environment to keep dependencies isolated.

Installation Commands

bash

pip install requests beautifulsoup4 pandas fake-useragent

Proxy Setup with OkeyProxy

1. Sign up for an account at OkeyProxy Rotating Residential Proxies and obtain your API key or proxy list.

2. Configure proxies in your script (example provided in the code walkthrough).

3. Verify proxy connectivity:

python

import requests

proxies = {"http": "http://your_proxy:port", "https": "http://your_proxy:port"}

response = requests.get("http://ipinfo.io/ip", proxies=proxies)

print(response.text) # Should return proxy IP

Setup Verification

● Ensure Python and libraries are installed: pip list.

● Test internet connectivity and proxy setup.

● Open X in a browser and confirm you can access public posts without authentication.

Technical Deep Dive

X’s front-end in 2025 is a mix of static HTML and JavaScript-rendered content, making scraping a blend of traditional HTML parsing and handling dynamic elements.

Key points:

● HTML Structure: Posts are typically nested in <article> tags with specific classes (e.g., css-1dbjc4n for post containers).

● Dynamic Loading: X uses infinite scroll, loading new posts via JavaScript as users scroll.

● Key Selectors:

○ Post text: Often found in <div> or <span> elements with classes like css-901oao.

○ Username: Usually in <a> tags with user handle data.

○ Timestamp: Found in <time> tags or similar.

● Challenges: Rate limits, CAPTCHAs, and frequent DOM changes require robust error handling and selector updates.

Finding Selectors

1. Open X in Chrome or Firefox.

2. Right-click a post and select “Inspect” to open Developer Tools.

3. Identify the parent <article> tag and child elements for text, username, etc.

4. Note class names or attributes (e.g., data-testid="tweet").

5. Test selectors using BeautifulSoup or browser console (e.g., document.querySelectorAll('article')).

Code Walkthrough

Below is a Python script to scrape public X posts based on a search query. It includes user-agent rotation, proxy integration, pagination, and error handling.

<xaiartifact artifact_id="e0dcf97e-397c-401b-a733-984e8f38f2a1" artifact_version_id="b08f0cb3-4dc5-4028-ae2c-51612ac4aa5d" title="x_scraper.py" contenttype="text/python"> import requests from bs4 import BeautifulSoup from fake_useragent import UserAgent import pandas as pd import time import random import logging <h1>Setup logging</h1> <p>logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')</p> <h1>Initialize user-agent rotator</h1> <p>ua = UserAgent()</p> <h1>Proxy configuration (replace with your OkeyProxy credentials)</h1> <p>proxies = { "http": "<a href="http://username:[email protected]:port">http://username:[email protected]:port</a>", "https": "<a href="http://username:[email protected]:port">http://username:[email protected]:port</a>" }</p> <p>def get_x_posts(query, max_posts=100): """Scrape X posts for a given search query.""" base_url = "<a href="https://x.com/search">https://x.com/search</a>" headers = {"User-Agent": ua.random} posts = [] page = 1</p> <p>while len(posts) < max_posts: try:</p> <h1>Construct search URL with pagination</h1> <p>url = f"{base_url}?q={query}&src=typed_query&f=live" if page > 1: url += f"&page={page}"</p> <h1>Make request with retries</h1> <p>for attempt in range(3): response = requests.get(url, headers=headers, proxies=proxies, timeout=10) if response.status_code == 200: break logging.warning(f"Attempt {attempt + 1} failed with status {response.status_code}") time.sleep(random.uniform(2, 5)) # Random delay to avoid rate limits else: logging.error("Max retries reached. Skipping page.") break</p> <h1>Parse HTML</h1> <p>soup = BeautifulSoup(response.text, 'html.parser') post_elements = soup.select('article[data-testid="tweet"]')</p> <p>for post in post_elements: try:</p> <h1>Extract post details</h1> <p>text = post.select_one('div.css-901oao').text.strip() username = post.select_one('a[role="link"]').text.strip() timestamp = post.select_one('time').get('datetime', 'N/A') posts.append({"username": username, "text": text, "timestamp": timestamp}) except AttributeError as e: logging.warning(f"Failed to parse post: {e}") continue</p> <p>logging.info(f"Scraped {len(post_elements)} posts from page {page}") if not post_elements or len(posts) >= max_posts: break</p> <p>page += 1 time.sleep(random.uniform(1, 3)) # Avoid rate limits</p> <p>except Exception as e: logging.error(f"Error on page {page}: {e}") break</p> <p>return posts[:max_posts]</p> <p>def save_to_csv(posts, filename="x_posts.csv"): """Save scraped posts to CSV.""" df = pd.DataFrame(posts) df.to_csv(filename, index=False) logging.info(f"Saved {len(posts)} posts to {filename}")</p> <p>if <strong>name</strong> == "<strong>main</strong>": query = "AI model training" # Replace with your search term posts = get_x_posts(query, max_posts=100) save_to_csv(posts)</p></xaiartifact>

Code Explanation

● User-Agent Rotation: Uses fake-useragent to rotate user agents per request.

● Proxy Integration: Configures OkeyProxy for IP rotation to avoid blocks.

● Pagination: Handles multiple pages by appending &page={page} to the URL.

● Error Handling: Retries failed requests up to three times and logs errors.

● Delays: Random delays (1–5 seconds) prevent rate limiting.

● Data Extraction: Uses BeautifulSoup to parse post elements and extract text, username, and timestamp.

AI & Machine Learning Use Case

Structuring and Exporting Data

The script exports data to a CSV file with columns for username, text, and timestamp.

This structured format is ideal for:

● Preprocessing: Clean text by removing emojis, URLs, or mentions using libraries like re or nltk.

● Feature Engineering: Extract hashtags, mentions, or sentiment scores for model input.

● Export: The CSV can be loaded into pandas or PyTorch/TensorFlow datasets.

Training/Fine-Tuning Models

X data is perfect for:

● Chatbots: Fine-tune LLMs like LLaMA or BERT on conversational data for better dialogue generation.

● Sentiment Analysis: Train classifiers to predict sentiment on product-related posts.

● Domain-Specific Models: Scrape posts with hashtags like #AI or #FinTech to create specialized datasets for niche models.

Domain-Specific Scraping

To improve model accuracy, focus on targeted queries (e.g., #MachineLearning, @Influencer). This ensures your dataset aligns with your AI model’s domain, reducing noise and improving performance.

Common Challenges & Troubleshooting

1. Rate Limits

Challenge: X imposes strict rate limits to prevent excessive requests from a single IP address, often resulting in HTTP 429 (Too Many Requests) errors. This can halt scraping operations, especially when collecting large datasets for AI model training.

OkeyProxy provides a pool of rotating residential and datacenter IPs, allowing you to distribute requests across multiple IP addresses to bypass rate limits. By rotating IPs, X perceives requests as coming from different users, reducing the likelihood of hitting limits.

Implementation:

● Sign Up and Obtain Proxy List: Register with OkeyProxy and choose a plan (e.g., residential proxies for better anonymity). You’ll receive access to a proxy pool or a list of IPs with credentials.

● Configure Rotating Proxies: OkeyProxy supports rotating IPs via an API or a sticky session. Use the API to fetch a new IP for each request or set a rotation interval (e.g., every 5 minutes). python

import requests

from random import uniform

import time

# OkeyProxy API endpoint for rotating IPs

proxy_api = "http://api.okeyproxy.com/v1/get_proxy?key=YOUR_API_KEY"

def get_proxy():

response = requests.get(proxy_api)

proxy_data = response.json()

return {

"http": f"http://{proxy_data['username']}:{proxy_data['password']}@{proxy_data['ip']}:{proxy_data['port']}",

"https": f"http://{proxy_data['username']}:{proxy_data['password']}@{proxy_data['ip']}:{proxy_data['port']}"

}

# Example request with rotating proxy

def make_request(url, headers):

proxies = get_proxy()

for attempt in range(3):

try:

response = requests.get(url, headers=headers, proxies=proxies, timeout=10)

if response.status_code == 200:

return response

print(f"Attempt {attempt + 1} failed with status {response.status_code}")

time.sleep(uniform(2, 5)) # Random delay

except Exception as e:

print(f"Error: {e}")

time.sleep(uniform(2, 5))

return None

# Usage

headers = {"User-Agent": UserAgent().random}

url = "https://x.com/search?q=AI&src=typed_query&f=live"

response = make_request(url, headers)

● Random Delays: Combine IP rotation with random delays (1–5 seconds) to mimic human behavior and further reduce rate limit triggers.

● Monitoring: Log the number of requests per IP and monitor for 429 errors. OkeyProxy’s dashboard provides usage analytics to track proxy performance.

OkeyProxy's large pool of residential IPs (often millions) ensures high anonymity and low block rates. The service also offers geo-targeting, allowing you to use IPs from specific regions to match X’s expected user demographics.

2. CAPTCHAs

Challenge: Frequent requests from a single IP or repetitive request patterns can trigger X’s CAPTCHA challenges, stopping your scraper until manual intervention or CAPTCHA-solving services are used.

OkeyProxy’s residential proxies mimic real user IPs, making it harder for X to flag your requests as bots. Rotating user agents alongside proxies further enhances anonymity.

Implementation:

● Integrate Residential Proxies: Residential proxies are less likely to trigger CAPTCHAs compared to datacenter proxies. Configure your script to use OkeyProxy’s residential proxy pool:

python

proxies = {

"http": "http://username:[email protected]:port",

"https": "http://username:[email protected]:port"

}

● Rotate User Agents: Use the fake-useragent library to change user agents per request, reducing the chance of pattern detection. python

Copy

from fake_useragent import UserAgent

ua = UserAgent()

headers = {"User-Agent": ua.random}

● Proxy Rotation Frequency: Set OkeyProxy to rotate IPs every 1–5 requests or after a fixed interval (e.g., 10 minutes). This can be done via OkeyProxy’s API or by cycling through a pre-fetched proxy list.

● CAPTCHA Monitoring: If CAPTCHAs still appear, integrate a CAPTCHA-solving service like 2Captcha or Anti-CAPTCHA with OkeyProxy. However, residential proxies typically minimize CAPTCHA triggers.

OkeyProxy's residential proxies are sourced from real devices, making them less detectable by X’s anti-bot systems. The ability to rotate IPs frequently ensures your scraper remains under the radar.

3. Dynamic Content

Challenge: X’s posts often load dynamically via JavaScript, meaning a simple requests.get() may not retrieve all content, as the HTML returned lacks fully rendered posts.

While OkeyProxy itself doesn’t directly address dynamic content, it complements tools like selenium or playwright by providing IPs to prevent blocks during browser-based scraping. These tools render JavaScript, ensuring all posts are accessible.

Implementation:

● Use Selenium with OkeyProxy:

python

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from fake_useragent import UserAgent

# Configure proxy

proxy = "username:[email protected]:port"

chrome_options = Options()

chrome_options.add_argument(f'--proxy-server=http://{proxy}')

chrome_options.add_argument(f'--user-agent={UserAgent().random}')

# Initialize browser

driver = webdriver.Chrome(options=chrome_options)

driver.get("https://x.com/search?q=AI&src=typed_query&f=live")

# Scroll to load dynamic content

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(3) # Wait for content to load

# Parse with BeautifulSoup

soup = BeautifulSoup(driver.page_source, 'html.parser')

posts = soup.select('article[data-testid="tweet"]')

driver.quit()

● Proxy Rotation with Selenium: Fetch a new proxy from OkeyProxy’s API for each browser session to avoid IP-based blocks.

● Optimize Performance: Use headless mode (chrome_options.add_argument('--headless')) to reduce resource usage, but test rendering first, as some dynamic elements may behave differently.

Browser-based scraping with selenium or playwright is resource-intensive and prone to IP bans due to prolonged sessions. OkeyProxy’s rotating IPs ensure your browser instances remain unblocked, even during long scraping sessions.

4. Selector Changes

Challenge: X frequently updates its DOM structure, causing selectors (e.g., article[data-testid="tweet"]) to break, leading to failed scrapes.

While OkeyProxy doesn’t directly address selector changes, its reliable proxy infrastructure ensures uninterrupted access to X during testing and selector updates. This allows you to quickly re-inspect the site and adjust selectors without IP bans disrupting your workflow.

Implementation:

● Regular Selector Updates: Use browser Developer Tools to re-inspect X’s HTML after a scrape fails.

○ Update selectors in your script (e.g., change div.css-901oao to a new class like div.css-1xyz).

○ Test with a single request using OkeyProxy to confirm the new selector works:

python

proxies = get_proxy() # From OkeyProxy API

response = requests.get("https://x.com/search?q=AI", headers={"User-Agent": UserAgent().random}, proxies=proxies)

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.select_one('article[data-testid="tweet"]')) # Test selector

● Automate Selector Testing: Write a script to log selector failures and alert you when elements are missing, then use OkeyProxy to fetch fresh pages for inspection.

Frequent DOM changes require repeated site access to test selectors. OkeyProxy’s IP rotation prevents blocks during this iterative process, ensuring you can reliably access X’s front-end.

Additional Tips for Using OkeyProxy

● Choose the Right Proxy Type: Residential proxies are ideal for X due to their high anonymity. Datacenter proxies are cheaper but more likely to be flagged.

● Geo-Targeting: If scraping region-specific data (e.g., posts from the US), configure OkeyProxy to use IPs from that region to avoid geo-restrictions.

● Monitor Usage: Use OkeyProxy’s dashboard to track bandwidth and request limits, ensuring you stay within your plan’s quota.

● Combine with Other Tools: Pair OkeyProxy with fake-useragent and random delays for a robust anti-detection setup.

Manual vs Proxy-Enabled vs API Comparison

Method	Pros	Cons
Manual	Free, simple for small-scale scraping	Slow, prone to rate limits, manual selector updates
Proxy-Enabled	Bypasses rate limits, scalable, anonymous	Costs for proxy services, requires setup
API	Official, reliable, structured data	Expensive, strict limits, requires authentication

For most AI training use cases, proxy-enabled scraping strikes a balance between cost and scalability.

What Is OkeyProxy?

OkeyProxy is a leading proxy service provider offering rotating residential and datacenter proxies to ensure uninterrupted web scraping. Its robust IP pool and high-speed connections make it ideal for scaling Reddit scraping projects while avoiding blocks.

With OkeyProxy, you can rotate IPs seamlessly, ensuring your scraper runs smoothly even at scale.

Try OkeyProxy Today: Sign up for a free trial to enhance your scraping workflow.

FAQs

1. Why am I getting blocked while scraping X (Twitter)?

You're likely hitting rate limits; use rotating proxies like OkeyProxy and add random delays to avoid detection.

2. How do I set up a proxy with the scraper?

Just replace the proxy details in the script with your OkeyProxy credentials to rotate IPs and stay anonymous.

3. The script runs, but I get no tweets. What is wrong?

X's layout may have changed; inspect the page with Developer Tools and update your selectors accordingly.

4. Can this work with hashtags or specific topics?

Yes, just change the search query in the script to target hashtags, keywords, or usernames.

5. What if tweets aren’t fully loading in the HTML?

Use Selenium or Playwright with OkeyProxy to render JavaScript and access all dynamic content.

Conclusion

Scraping X data in 2025 is a powerful way to gather real-time, conversational datasets for AI model training, sentiment analysis, or market research. By combining robust tools like Python, BeautifulSoup, and proxy services like OkeyProxy, you can build a scalable, ethical scraping pipeline.

Always prioritize responsible scraping, respect platform policies, and ensure compliance with data privacy laws.

With the right approach, X data can unlock new possibilities for your AI projects.

For further reading, explore OkeyProxy’s blog or try their free trial to scale your scraping efforts.

Ethical Note: Ensure compliance with X’s terms and local data privacy laws. Only scrape publicly available data and avoid sharing sensitive user information.

< Previous Next >

How to Scrape Twitter (X) Data in 2025 for AI Model Training: Step-by-Step Guide

Why This Matters in 2025

Tools and Setup

Installation Commands

Proxy Setup with OkeyProxy

Setup Verification

Technical Deep Dive

Finding Selectors

Code Walkthrough

Code Explanation

AI & Machine Learning Use Case

Structuring and Exporting Data

Training/Fine-Tuning Models

Domain-Specific Scraping

Common Challenges & Troubleshooting

1. Rate Limits

2. CAPTCHAs

3. Dynamic Content

4. Selector Changes

Additional Tips for Using OkeyProxy

Manual vs Proxy-Enabled vs API Comparison

What Is OkeyProxy?

FAQs

1. Why am I getting blocked while scraping X (Twitter)?

2. How do I set up a proxy with the scraper?

3. The script runs, but I get no tweets. What is wrong?

4. Can this work with hashtags or specific topics?

5. What if tweets aren’t fully loading in the HTML?

Conclusion

Start Your Free Trial Now!