Skip to main content
Thirdwatchthirdwatch
Social media

Build a Twitter Data Pipeline for Research (2026 Guide)

Build a Twitter/X research data pipeline using Thirdwatch. Tweet ingestion + entity extraction + Postgres + analysis recipes.

Apr 27, 2026 · 6 min read · 1,314 words
See the scraper →

Thirdwatch's Twitter Scraper makes research-pipeline building a structured workflow — keyword + handle + hashtag ingestion, entity extraction, sentiment-enrichment, Postgres-friendly schema. Built for social-science researchers, finance research, marketing-research teams, and academic-research labs that need scalable Twitter datasets.

Why build a Twitter research pipeline

Twitter remains the canonical public-discourse dataset. According to academic studies indexed in Web of Science, more than 50,000 peer-reviewed papers use Twitter data as primary or supplementary research source — making it the most-studied social-media platform for academic and applied research. For social-science labs, finance-research functions, and marketing-research teams, a structured Twitter data pipeline is the foundation of most discourse-analysis work.

The job-to-be-done is structured. A social-science research lab ingests 5M tweets per study covering 200 keywords across 24 months. A finance-research function tracks $TICKER cashtag mentions for sentiment-driven backtest research. A marketing-research team studies brand-mention sentiment across 50 brands quarterly. An academic discourse-analysis project ingests political conversations across 12 election cycles. All reduce to query + ingestion + enrichment + persistence.

How does this compare to the alternatives?

Three options for Twitter research data:

Approach Cost per 10M tweets Reliability Setup time Maintenance
Twitter API v2 Academic Research Free (when available) Authoritative, gated 4–12 weeks (approval) Requires re-application
Brandwatch / Sprinklr (research tier) $50K–$200K/year High, with sentiment tools Days Vendor contract
Thirdwatch Twitter Scraper Pay per result HTTP via syndication API 5 minutes Thirdwatch tracks Twitter changes

Twitter API Academic Research is the gold standard for academic projects when accessible. Brandwatch/Sprinklr offer commercial-research depth. The Twitter Scraper actor page gives you raw access at the lowest unit cost without gatekeeping.

How to build a research pipeline in 4 steps

Step 1: How do I authenticate against Apify?

Sign in at apify.com (free tier, no credit card), open Settings → Integrations, and copy your personal API token:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I ingest tweets matching a keyword?

Pass keyword query and a high maxResultsPerQuery.

import os, requests, datetime, json, pathlib

ACTOR = "thirdwatch~twitter-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

KEYWORDS = ["climate change", "renewable energy", "carbon tax",
            "fossil fuel", "ev adoption", "solar power",
            "wind energy", "battery storage"]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={"queries": KEYWORDS, "maxResults": 8000,
          "maxResultsPerQuery": 1000},
    timeout=3600,
)
records = resp.json()
ts = datetime.datetime.utcnow().strftime("%Y%m%d")
pathlib.Path(f"raw/twitter-{ts}.jsonl").write_text(
    "\n".join(json.dumps(r) for r in records)
)
print(f"{ts}: {len(records)} tweets ingested")

8 keywords × 1000 = up to 8000 tweets per daily snapshot.

Step 3: How do I extract entities and enrich?

Parse mentions, hashtags, URLs, then add language and sentiment.

import pandas as pd, re
from langdetect import detect

df = pd.DataFrame(records)
df["createdAt"] = pd.to_datetime(df.createdAt)

MENTION_RE = re.compile(r"@(\w+)")
HASHTAG_RE = re.compile(r"#(\w+)")
URL_RE = re.compile(r"https?://\S+")
CASHTAG_RE = re.compile(r"\$([A-Z]{2,5})\b")

def extract(t):
    if not isinstance(t, str):
        return [], [], [], []
    return (MENTION_RE.findall(t), HASHTAG_RE.findall(t),
            URL_RE.findall(t), CASHTAG_RE.findall(t))

df[["mentions", "hashtags", "urls", "cashtags"]] = df.text.apply(
    lambda t: pd.Series(extract(t))
)

def safe_lang(t):
    try:
        return detect(t) if t and len(t) > 10 else "unknown"
    except Exception:
        return "unknown"
df["lang"] = df.text.apply(safe_lang)

# Simple sentiment via keyword scoring (replace with FinBERT for real work)
POS = {"great", "good", "win", "rise", "growth", "innovate"}
NEG = {"bad", "loss", "lose", "drop", "fail", "scandal", "fraud"}
def score(t):
    if not isinstance(t, str):
        return 0
    words = set(t.lower().split())
    return len(words & POS) - len(words & NEG)
df["sentiment_score"] = df.text.apply(score)

print(df[["text", "lang", "sentiment_score", "hashtags"]].head(10))

For production, swap the keyword sentiment with FinBERT or LLM-based classification.

Step 4: How do I persist to Postgres for analysis?

Indexed schema for query-friendly research access.

import psycopg2

with psycopg2.connect(...) as conn, conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS tweets (
            id BIGINT PRIMARY KEY,
            text TEXT,
            author_username TEXT,
            author_followers BIGINT,
            created_at TIMESTAMPTZ,
            like_count INT,
            reply_count INT,
            retweet_count INT,
            view_count BIGINT,
            mentions TEXT[],
            hashtags TEXT[],
            cashtags TEXT[],
            lang TEXT,
            sentiment_score INT,
            search_query TEXT
        );
        CREATE INDEX IF NOT EXISTS idx_tweets_created_at ON tweets(created_at);
        CREATE INDEX IF NOT EXISTS idx_tweets_hashtags_gin ON tweets USING GIN(hashtags);
        CREATE INDEX IF NOT EXISTS idx_tweets_search_query ON tweets(search_query);
    """)
    for _, t in df.iterrows():
        cur.execute(
            """INSERT INTO tweets VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
               ON CONFLICT (id) DO NOTHING""",
            (int(t.id), t.text, t.authorUsername, int(t.authorFollowers or 0),
             t.createdAt, int(t.likeCount or 0), int(t.replyCount or 0),
             int(t.retweetCount or 0), int(t.viewCount or 0),
             list(t.mentions), list(t.hashtags), list(t.cashtags),
             t.lang, int(t.sentiment_score), t.searchString)
        )

print(f"Persisted {len(df)} tweets")

GIN indexes on hashtag/mention arrays enable fast cross-tweet research queries.

Sample output

A single tweet record looks like this. Five rows weigh ~5 KB.

{
  "id": "1781234567890123456",
  "text": "Excited to see EV adoption accelerate in 2026 — every major automaker now committed #climate #ev",
  "authorUsername": "policyresearcher",
  "authorFollowers": 12500,
  "createdAt": "2026-04-27T14:30:00Z",
  "likeCount": 245,
  "replyCount": 18,
  "retweetCount": 87,
  "viewCount": 12500,
  "lang": "en"
}

id is the canonical natural key — stable across snapshots. The four engagement metrics (likes, replies, retweets, views) provide the core research-relevant signals. createdAt enables temporal research analysis.

Common pitfalls

Three things go wrong in research pipelines. Sampling bias — Twitter's syndication API returns a non-uniform sample weighted toward verified and high-engagement accounts; for representative research, document the sampling bias explicitly and apply re-weighting where necessary. Tweet-deletion handling — accounts delete tweets at varying rates; persist raw scraped payloads alongside derived fields so research can re-link to the original event-time content. Language-detection error — short tweets and emoji-heavy tweets confuse language-detection libraries; for accurate per-language research, supplement with multilingual transformer-based detection.

Thirdwatch's actor handles the scraping plumbing so you can focus on the data. Pair Twitter with Reddit Scraper and Instagram Scraper for cross-platform discourse research. A fourth subtle issue worth flagging: Twitter's viewCount field has been increasingly noisy since the visibility-restriction changes of 2023 — accounts under reduced visibility see proportionally undercounted views vs equivalent-engagement unrestricted accounts; for academic discourse-analysis, weight engagement (likes + replies + retweets) more heavily than views as the canonical reach proxy. A fifth pattern unique to research-pipeline work: maintaining IRB-compliant data minimization while preserving research utility requires careful schema design — for human-subjects research, anonymize authorUsername to a hashed ID before persisting and store the username-to-hash mapping separately under access controls. A sixth and final pitfall: longitudinal research projects spanning multiple years face Twitter platform-change drift (2022 algorithm changes, 2023 verification changes, 2025 ad-policy changes); for time-series comparability, document the platform-change events and consider time-window-specific normalizations rather than treating multi-year tweet datasets as comparable across the full span. A seventh and final pattern worth flagging for production teams: data-pipeline cost optimization. The actor's pricing scales linearly with record volume, so for high-cadence operations (hourly polling on large watchlists), the dominant cost driver is the size of the watchlist rather than the per-record fee. For cost-disciplined teams, tier the watchlist (Tier 1 hourly, Tier 2 daily, Tier 3 weekly) rather than running everything at the highest cadence — typical 60-80% cost reduction with minimal signal loss. Combine tiered cadence with explicit dedup keys and incremental snapshot diffing to keep storage and downstream-compute proportional to new signal rather than total watchlist size. This is the difference between a lean research pipeline and one that quietly burns ten times the budget for the same actionable output. An eighth subtle issue worth flagging: snapshot-storage strategy materially affects long-term pipeline economics. Raw JSON snapshots compressed with gzip typically run 4-8x smaller than uncompressed; for multi-year retention, always compress at write-time. For high-frequency snapshots, partition storage by date prefix (snapshots/YYYY/MM/DD/) to enable fast date-range queries and incremental processing rather than full-scan re-aggregation. Most production pipelines keep 90 days of raw snapshots at full fidelity + 12 months of derived per-record aggregates + indefinite retention of derived metric time-series — three retention tiers managed separately.

Related use cases

Frequently asked questions

What research uses Twitter data?

Three major research surfaces: (1) social-science research (polarization, misinformation, network dynamics) — Twitter as the canonical public-discourse dataset; (2) financial-research (FinTwit sentiment, earnings reactions); (3) marketing-and-brand-research (consumer attitudes, viral-content mechanics). For all three, Twitter remains the highest-velocity, broadest public-discourse surface available without official-API gatekeeping.

What pipeline architecture works for research?

Four-stage pipeline: (1) ingestion (actor pulls tweets matching keywords or handles); (2) entity extraction (parsing mentions, hashtags, URLs, cashtags); (3) enrichment (sentiment, language detection, topic clustering); (4) persistence (Postgres + indexed for query). For longitudinal research, keep raw payloads alongside derived fields for re-processing as ML models improve.

How fresh does research data need to be?

For real-time research (event-tracking, breaking-news studies), hourly cadence is justified. For social-science longitudinal research, daily or weekly snapshots are sufficient. For financial-research (earnings reactions, news-driven trades), 15-minute cadence around key events. The actor's syndication-API path supports any of these without rate-limit issues.

How big can a research dataset get?

Practically, 50M-200M tweets is the sweet spot for academic research datasets. Beyond that, storage and processing costs dominate. Pay-per-result pricing with volume tiers means cost scales linearly with dataset size and drops at higher tiers — affordable for grant-funded projects but still meaningful at the highest scales. Most academic teams scope to 1-10M tweets per study question.

Can I publish research using scraped Twitter data?

Yes, with caveats. Most academic publishers accept research using Twitter data sourced via web scraping for non-commercial research, particularly when paired with proper IRB approval and tweet-ID-only citation (rather than full-text re-publication, which Twitter's TOS prohibits). Always consult your institution's compliance office for specific publication paths.

How does this compare to Twitter API v2 Academic Research access?

Twitter API v2 Academic Research access (when available) provides authoritative, TOS-compliant access at no cost for vetted academic users. Application + approval typically takes 4-12 weeks and access has been intermittent in 2023-2026. For projects within Academic Research access, that path is preferred. For projects outside it (most commercial research, time-sensitive academic projects), the actor is the available alternative.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.