Skip to main content
Thirdwatchthirdwatch
General

Research Competitor Content with Google Search (2026)

Map competitor content footprint via Google Search using Thirdwatch. Site: queries + topic-cluster mapping + content-gap analysis.

Apr 27, 2026 · 5 min read · 1,216 words
See the scraper →

Thirdwatch's Google Search Scraper lets content-strategy teams, SEO research functions, and competitive-intelligence analysts map competitor content footprints — site: query batches, topic-cluster mapping, content-gap detection. Built for content-strategy planning, SEO competitive research, and topic-coverage gap analysis at scale.

Why research competitor content with Google Search

Content strategy is competitive. According to Ahrefs' 2024 SEO report, the top-ranking page for any commercial keyword captures 30%+ of organic traffic, and content gaps (topics where competitors rank but you don't) typically account for 40-60% of viable organic-growth opportunity. For content-strategy teams, SEO research functions, and competitive-intelligence analysts, mapping competitor content footprint at scale is the foundation of opportunity-discovery work.

The job-to-be-done is structured. A content-strategy team maps competitor topic coverage across 5-10 direct competitors quarterly. An SEO research function builds content-gap reports for SMB clients. A competitive-intelligence analyst tracks topic-cluster expansion across competitor portfolios for strategy briefings. A platform-builder integrates competitor content discovery into their SEO SaaS feature set. All reduce to site:competitor.com query batches + topic clustering + gap detection.

How does this compare to the alternatives?

Three options for competitor content mapping:

Approach Cost per 10,000 pages Reliability Setup time Maintenance
Ahrefs / SEMrush $99–$999/month per seat High, with traffic estimates Hours Per-seat license
Manual sitemap parsing Effectively unbounded analyst time Sitemap may be incomplete Continuous Per-competitor
Thirdwatch Google Search Scraper Pay per record Production-tested with search-engine-friendly proxy 5 minutes Thirdwatch tracks Google changes

Ahrefs/SEMrush offer comprehensive SEO data at the high end. The Google Search Scraper actor page gives you the surface-level content map at the lowest unit cost — useful as cost-optimized first pass before deeper Ahrefs analysis on shortlisted opportunities.

How to research competitor content in 4 steps

Step 1: How do I authenticate against Apify?

Sign in at apify.com (free tier, no credit card), open Settings → Integrations, and copy your personal API token:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I pull competitor site: queries?

Pass site:competitor.com queries scoped to topic keywords.

import os, requests, pandas as pd
from itertools import product

ACTOR = "thirdwatch~google-search-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

COMPETITORS = ["stripe.com", "adyen.com", "checkout.com", "square.com"]
TOPICS = ["payments", "pricing", "fraud detection", "subscriptions",
          "tutorial", "case study", "integration"]

queries = [f"site:{c} {t}" for c, t in product(COMPETITORS, TOPICS)]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={"queries": queries, "maxResults": 50},
    timeout=3600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} indexed pages across {len(COMPETITORS)} competitors")

4 competitors × 7 topics = 28 queries × 50 results = up to 1,400 pages — well within budget for a quarterly competitive sweep.

Step 3: How do I cluster pages into topic pillars?

TF-IDF + KMeans clustering of page titles + descriptions.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

df["text"] = df.title.fillna("") + " " + df.snippet.fillna("")
df["competitor"] = df.url.str.extract(r"https?://(?:www\.)?([^/]+)")

vectorizer = TfidfVectorizer(max_features=500, stop_words="english", ngram_range=(1, 2))
X = vectorizer.fit_transform(df.text)

kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
df["topic"] = kmeans.fit_predict(X)

topic_terms = []
for i, center in enumerate(kmeans.cluster_centers_):
    top_terms = [vectorizer.get_feature_names_out()[idx]
                 for idx in center.argsort()[-5:][::-1]]
    topic_terms.append((i, ", ".join(top_terms)))

print("Topic clusters:")
for t, terms in topic_terms:
    print(f"  Topic {t}: {terms}")

per_competitor_topic = df.groupby(["competitor", "topic"]).size().unstack(fill_value=0)
print(per_competitor_topic)

Topic-cluster volume per competitor reveals content-strategy bets.

Step 4: How do I detect content gaps?

Run topic queries without the site: operator, check competitor rankings vs yours.

TOPIC_QUERIES = [f"{t} for ecommerce" for t in TOPICS]
resp2 = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={"queries": TOPIC_QUERIES, "maxResults": 20},
    timeout=3600,
)

gap_df = pd.DataFrame(resp2.json())
gap_df["domain"] = gap_df.url.str.extract(r"https?://(?:www\.)?([^/]+)")
YOUR_DOMAIN = "yoursite.com"

gaps = []
for query, grp in gap_df.groupby("searchString"):
    domains = grp.domain.tolist()
    competitor_count = sum(1 for d in domains if d in COMPETITORS)
    if competitor_count >= 2 and YOUR_DOMAIN not in domains:
        gaps.append({"query": query, "competitor_count": competitor_count})

print(f"{len(gaps)} content gaps where 2+ competitors rank but you don't")
print(pd.DataFrame(gaps).sort_values("competitor_count", ascending=False))

Gaps with 3+ competitors ranking are highest-priority — strong evidence the topic is worth pursuing.

Sample output

A single Google Search result looks like this. Five rows weigh ~5 KB.

{
  "title": "Pricing | Stripe",
  "url": "https://stripe.com/pricing",
  "snippet": "Stripe's pricing is simple, transparent, and integrated...",
  "displayedUrl": "stripe.com › pricing",
  "position": 1
}

url is the canonical natural key. position indicates ranking rank (1 = top result). snippet provides ~160 chars of page content for topic-clustering. title enables exact-match domain identification.

Common pitfalls

Three things go wrong in competitor-content research. site: query coverage gaps — Google's site: returns at most 100-200 pages per competitor regardless of true index size; for comprehensive coverage, supplement with sitemap-parsing. Topic-cluster instability — KMeans with k=10 can produce different clusters across runs; for stable topic IDs across snapshots, fix random_state and consider hierarchical clustering for more interpretable hierarchies. Subdomain confusion — competitors often have blog.competitor.com and docs.competitor.com separately; site:competitor.com may not catch subdomains by default; explicitly query each subdomain.

Thirdwatch's actor handles the scraping plumbing so you can focus on the data. Pair Google Search with Google News Scraper for competitor PR + content cross-research. A fourth subtle issue worth flagging: Google's index of any given site lags by 1-4 weeks for fresh content — so a competitor that just published 20 new pages this week will show only ~5 indexed today; for currency-sensitive content gap analysis, supplement search-result scraping with direct sitemap.xml fetching from competitor domains. A fifth pattern unique to content-research work: large competitors run multiple content properties (Stripe has stripe.com, stripe.com/blog, increment.com, stripe.com/atlas-magazine, etc.) — site:stripe.com misses content on the affiliated properties; for true content-footprint mapping, build a per-competitor property-list and query each separately. A sixth and final pitfall: site: queries are subject to Google's regional ranking — a US-based scraping IP sees the US-region index, while an EU IP sees EU. For multi-region competitive coverage, run separate batches per target region with explicit country parameter on the actor. A seventh and final pattern worth flagging for production teams: data-pipeline cost optimization. The actor's pricing scales linearly with record volume, so for high-cadence operations (hourly polling on large watchlists), the dominant cost driver is the size of the watchlist rather than the per-record fee. For cost-disciplined teams, tier the watchlist (Tier 1 hourly, Tier 2 daily, Tier 3 weekly) rather than running everything at the highest cadence — typical 60-80% cost reduction with minimal signal loss. Combine tiered cadence with explicit dedup keys and incremental snapshot diffing to keep storage and downstream-compute proportional to new signal rather than total watchlist size. This is the difference between a $200/month research pipeline and a $2,000/month one for the same actionable output. An eighth subtle issue worth flagging: snapshot-storage strategy materially affects long-term pipeline economics. Raw JSON snapshots compressed with gzip typically run 4-8x smaller than uncompressed; for multi-year retention, always compress at write-time. For high-frequency snapshots, partition storage by date prefix (snapshots/YYYY/MM/DD/) to enable fast date-range queries and incremental processing rather than full-scan re-aggregation. Most production pipelines keep 90 days of raw snapshots at full fidelity + 12 months of derived per-record aggregates + indefinite retention of derived metric time-series — three retention tiers managed separately.

Related use cases

Frequently asked questions

How does Google site: search work for content research?

The `site:competitor.com` operator returns Google's indexed pages from a specific domain. Combined with topic keywords (`site:competitor.com pricing` or `site:competitor.com tutorial`), it surfaces every indexed page where the competitor covers that topic. For content-strategy teams, this maps competitor topic coverage at scale — without needing access to their CMS or sitemap.

What's a content-gap analysis?

A content-gap analysis lists topics where competitors rank but you don't. For each topic in your target keyword universe, run a Google query and check whether your domain appears in the top 20 results. Topics where 2+ competitors rank but you're absent are gaps — high-priority for content production. Most content-strategy teams find 50-200 viable gaps per quarter on a refreshed analysis.

How do I cluster competitor content into topics?

Pull `site:competitor.com` pages, extract titles + meta descriptions, run NLP topic-modeling (BERTopic or simpler TF-IDF + clustering). Most competitor sites cluster into 5-15 distinct content pillars. Cross-reference pillar coverage against your own to identify under-served topics. For SEO-research SaaS builders, topic clustering is the canonical content-gap signal.

How fresh does competitor content data need to be?

Quarterly cadence catches meaningful competitor content shifts — most companies publish 1-5 new pages per week, so a quarter captures 20-60 new pieces per competitor. For aggressive content competitors (publishing daily), monthly cadence is justified. For long-term strategic-research, annual cadence suffices.

Can I detect competitor topic-cluster shifts?

Yes. Snapshot competitor content quarterly and compute topic-cluster volume deltas. A competitor going from 5 articles on topic X to 25 in a quarter signals strategic doubling-down on that topic — usually because they're seeing organic traffic growth or competitive necessity. For your own strategy, this is high-signal: their bet on X is empirical evidence X is worth pursuing.

How does this compare to Ahrefs or SEMrush?

Ahrefs and SEMrush bundle backlink data + organic traffic estimates with content-research at $99-$999/month per seat. Their organic-rankings index is broader and more accurate than scraping live Google Search. The actor gives you raw search results — for surface-level competitor content mapping, the actor is materially cheaper. For full SEO competitive analysis (backlinks, traffic, paid keywords), Ahrefs/SEMrush are the canonical tools.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.