Skip to main content
Thirdwatchthirdwatch
Social media

Build a LinkedIn Thought-Leadership Dataset for AI Research

Assemble a clean B2B LinkedIn post corpus for AI fine-tuning, qualitative research, or content analysis — schema, sampling, and ethics. Thirdwatch's actor.

May 12, 2026 · 6 min read · 1,309 words
See the scraper →

Thirdwatch's LinkedIn Post Scraper is the cleanest way to assemble a B2B thought-leadership dataset from public LinkedIn posts — full post bodies, author identity and headline, engagement counts, edit status, and media URLs in a stable per-post schema. Built for AI teams fine-tuning B2B voice models, researchers studying B2B content patterns, and founders mining the executive-publishing layer of LinkedIn for narrative intelligence. This is the practical guide.

Why build a LinkedIn thought-leadership corpus

LinkedIn is the single largest published archive of professional voice on the open web. Estimates vary, but LinkedIn's own platform data places member-generated posts at the scale of millions per day. For AI teams, that is a uniquely structured training distribution — long-form professional prose, clearly attributable to a single author, with built-in engagement labels that proxy for quality. For qualitative researchers, it is the largest naturally-occurring corpus of B2B narrative there is. For founders studying how a category's thought leaders frame the problem they are about to attack, it is the raw material for narrative positioning work.

The job-to-be-done is consistent: assemble a clean, deduplicated, license-aware sample of N posts across M authors in a category. The actor solves the per-post extraction layer; the rest of the work is sourcing, sampling, and storage hygiene. Done right, you end up with a corpus that is reproducible (rows keyed on stable IDs), refreshable (re-scrape to catch edits), and ethically defensible (public posts only, with documented sampling).

How does this compare to alternatives?

Three realistic paths for a thought-leadership dataset:

Approach Reliability Setup time Maintenance
Buy a pre-built B2B content dataset Coverage opaque, expensive at scale Days (vendor procurement) Vendor-managed
DIY LinkedIn scraping + IP pool Brittle, weeks of anti-bot engineering 2-3 weeks Ongoing
Thirdwatch LinkedIn Post Scraper + your sourcing layer Production-tested, transparent per-row pricing A morning to wire up Thirdwatch tracks LinkedIn changes

For research-quality datasets the actor is usually the right primitive — you control sampling, you own the data, you can document the methodology, and per-row pricing makes a 50,000-post corpus economically reasonable. See the LinkedIn Post Scraper actor page for the live spec.

How to build a thought-leadership dataset in 6 steps

Step 1: How do I authenticate against Apify?

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

API tokens live in Settings → Integrations on apify.com.

Step 2: How do I define the sampling frame?

Define your corpus shape before scraping. The four decisions that matter:

  1. Category — fintech founders, B2B SaaS marketing leaders, AI researchers, etc.
  2. Author list — 30 to 5,000 LinkedIn handles, depending on depth-vs-breadth.
  3. Time window — last 12 months is typical; longer windows skew toward earlier-platform formats.
  4. Per-author cap — 20 to 200 posts per author to prevent prolific posters from dominating the corpus.
sampling_frame = {
    "category": "b2b-saas-founders",
    "authors": ["author-handle-1", "author-handle-2", "..."],
    "time_window": "12mo",
    "max_per_author": 50,
}

Document this in a SAMPLING.md alongside the dataset — future-you and any reviewer will need it.

Step 3: How do I source post URLs?

The actor enriches URLs; sourcing them is a separate layer. Practical patterns:

# Pseudocode for URL sourcing
post_urls = []
for handle in sampling_frame["authors"]:
    # Source via:
    # (a) profile activity scrape (recent posts surface as activity URNs)
    # (b) Google search: site:linkedin.com/posts {handle}
    # (c) curated newsroom/blog feeds that link to the LinkedIn post
    urls = source_recent_post_urls(handle, max_n=sampling_frame["max_per_author"])
    post_urls.extend(urls)

print(f"Sourced {len(post_urls)} URLs across {len(sampling_frame['authors'])} authors")

A good rule: over-source URLs by ~30% and let downstream filters drop the unfit ones (off-topic, too short, off-period).

Step 4: How do I batch-enrich URLs through the actor?

Chunk the URL list into 200-post runs (the per-run cap) and fire them in parallel. For a 10,000-post corpus, that's 50 parallel runs — Apify's API handles this comfortably.

import os, requests, pandas as pd
from concurrent.futures import ThreadPoolExecutor

ACTOR = "thirdwatch~linkedin-post-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

def enrich_chunk(urls):
    r = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
        params={"token": TOKEN},
        json={"postUrls": urls, "maxPosts": len(urls)},
        timeout=900,
    )
    return r.json()

chunks = [post_urls[i:i+200] for i in range(0, len(post_urls), 200)]
with ThreadPoolExecutor(max_workers=8) as ex:
    all_rows = []
    for rows in ex.map(enrich_chunk, chunks):
        all_rows.extend(rows)

df = pd.DataFrame(all_rows)
df = df.drop_duplicates("post_id")  # dedup across reshares
print(f"Corpus: {len(df)} unique posts, {df.author_name.nunique()} authors")

drop_duplicates("post_id") collapses reshare wrappers and accidental duplicate URLs in your sourcing list.

Step 5: How do I apply a thought-leadership quality filter?

Not every public post is a thought-leadership post. A useful filter combines four signals:

df["text_len"] = df["text"].fillna("").str.len()
df["reaction_to_comment_ratio"] = (
    df["reactions_count"].fillna(0) / (df["comments_count"].fillna(0) + 1)
)

corpus = df[
    (df["author_is_company"] == False) &        # individual voice
    (df["text_len"] >= 800) &                   # substance, not one-liners
    (df["reaction_to_comment_ratio"] < 30) &    # discussion-driving
    (df["reactions_count"] >= 20)               # not buried-and-forgotten
].copy()

print(f"After thought-leadership filter: {len(corpus)} posts "
      f"({100*len(corpus)/len(df):.0f}% of raw corpus)")

Most B2B handles will see 20-40% of their posts pass this filter — the sweet spot for thought-leadership content vs daily chatter.

Step 6: How do I persist with reproducibility in mind?

For research-grade work, version your dataset so analyses against it are reproducible.

import json, hashlib, pathlib, datetime as dt

snapshot_dir = pathlib.Path("datasets/linkedin-thought-leadership")
snapshot_dir.mkdir(parents=True, exist_ok=True)

today = dt.date.today().isoformat()
csv_path = snapshot_dir / f"corpus-{today}.csv"
corpus.to_csv(csv_path, index=False)

corpus_hash = hashlib.sha256(csv_path.read_bytes()).hexdigest()[:16]
manifest = {
    "snapshot_date": today,
    "row_count": int(len(corpus)),
    "author_count": int(corpus.author_name.nunique()),
    "sampling_frame": sampling_frame,
    "sha256_prefix": corpus_hash,
}
(snapshot_dir / f"manifest-{today}.json").write_text(
    json.dumps(manifest, indent=2)
)
print(f"Snapshot {today} written ({corpus_hash})")

The manifest + hash pair means you can publish "the corpus we analyzed was the 2026-05-12 snapshot, SHA256 prefix 7c8a..." and another researcher can verify they have the same data.

Sample output

A typical row from a B2B thought-leadership corpus, identifiers redacted:

{
  "url": "https://www.linkedin.com/feed/update/urn:li:activity:74440XXXXX0000000/",
  "urn": "urn:li:activity:74440XXXXX0000000",
  "post_id": "74440XXXXX0000000",
  "text": "Most B2B SaaS pricing pages are optimized for the wrong reader. The buyer reading your pricing page in 2026 is not a procurement gatekeeper — it is an engineer or a director-level operator who already knows the category and is comparing three vendors at once. Three things that work...",
  "posted_relative": "2w",
  "edited": false,
  "author_name": "[REDACTED]",
  "author_headline": "Co-founder & CEO at [REDACTED] · ex-[REDACTED]",
  "author_profile_url": "https://www.linkedin.com/in/redacted-handle",
  "author_is_company": false,
  "reactions_count": 612,
  "comments_count": 47,
  "reposts_count": null,
  "media": []
}

This row passes the thought-leadership filter cleanly: individual author, text over 800 characters, reaction-to-comment ratio of ~13:1 (discussion-driving), 600+ reactions (not buried). At scale, a 10,000-row dataset of similar posts is the right primitive for fine-tuning a B2B voice model or running a qualitative content analysis.

Common pitfalls

Four issues recur. Sampling bias from your sourcing layer — if you only source URLs via Google search, you over-represent posts that ranked well organically. Use multiple sourcing paths and document them. PII leakage in post text — B2B authors occasionally name customers, employees, or deal sizes; for AI training, run a named-entity redaction pass first. Engagement counts as labels are not stable — counts keep accumulating after you scrape; freeze labels at snapshot time. Reshare wrappers inflate row counts — dedup on post_id, not url.

Thirdwatch's actor uses production-grade anti-bot tooling and rotates outbound IPs by default, which makes the parallel-chunked enrichment pattern above practical at 10,000-post scale without hand-rolling extra infrastructure. Two more notes: edited posts shift text retroactively, so persist a scraped_at column and treat the corpus as a slowly-changing dimension; and authors who delete their LinkedIn accounts cause all their post URLs to 404 on re-scrape — if dataset stability matters more than freshness, freeze a snapshot rather than refreshing in place.

Related use cases

Frequently asked questions

Is the actor suitable for building a training corpus for LLM fine-tuning?

Yes. The actor returns clean post text, author identity, headline, and engagement counts in a per-post row schema — the exact shape supervised fine-tuning pipelines expect. The full post body (not the truncated og:description) is captured, which matters for thought-leadership posts where the value sits in the back half of the text.

What sample size do I need for a defensible B2B thought-leadership corpus?

It depends on use case. For qualitative content analysis, 500-1,000 posts across 30-50 creators is usually enough. For supervised LLM fine-tuning on B2B voice, 10,000-50,000 high-engagement posts gives stable results. For pretraining-scale work, you are in 1M+ territory and the cost-per-post matters more than per-creator depth.

How do I avoid duplication when posts get reshared?

Dedup on the post_id field. When a creator reshares their own old post or another creator reshares a viral post, LinkedIn assigns a new urn:li:activity to the reshare wrapper but the original post_id stays consistent. Persist your dataset keyed on post_id and you'll naturally collapse reshare chains to the canonical original.

Can I filter for high-quality thought-leadership posts specifically?

Yes, with a few signal combinations. Posts where author_is_company is false (individual voice, not brand-page), text length over 800 characters (substance, not one-liners), reactions_count to comments_count ratio under 30:1 (discussion-driving rather than passive-like), and edited is false (first-draft clarity) approximate a thought-leadership filter well.

What about ethics and consent for training on public LinkedIn posts?

Public posts are visible to anonymous visitors and published with the expectation of public reading. That does not automatically mean they are appropriate to ingest into AI training without scrutiny. Standard research-ethics practice is to redact personally identifying information beyond the author headline, exclude posts that mention private individuals, and document your sampling frame. This guide is not legal or ethics-board advice.

How do I keep the dataset fresh over time?

Schedule incremental re-scrapes of recent URLs to catch edits and updated engagement counts, plus a weekly fresh pull of new post URLs from your sourcing layer. Persist with INSERT OR REPLACE keyed on url so existing rows refresh in place. For long-running corpora, version snapshots quarterly so you can reproduce earlier analyses against a fixed dataset state.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.