Research Competitor Content with Google Search (2026)
Map competitor content footprint via Google Search using Thirdwatch. Site: queries + topic-cluster mapping + content-gap analysis.

Thirdwatch's Google Search Scraper lets content-strategy teams, SEO research functions, and competitive-intelligence analysts map competitor content footprints —
site:query batches, topic-cluster mapping, content-gap detection. Built for content-strategy planning, SEO competitive research, and topic-coverage gap analysis at scale.
Why research competitor content with Google Search
Content strategy is competitive. According to Ahrefs' 2024 SEO report, the top-ranking page for any commercial keyword captures 30%+ of organic traffic, and content gaps (topics where competitors rank but you don't) typically account for 40-60% of viable organic-growth opportunity. For content-strategy teams, SEO research functions, and competitive-intelligence analysts, mapping competitor content footprint at scale is the foundation of opportunity-discovery work.
The job-to-be-done is structured. A content-strategy team maps competitor topic coverage across 5-10 direct competitors quarterly. An SEO research function builds content-gap reports for SMB clients. A competitive-intelligence analyst tracks topic-cluster expansion across competitor portfolios for strategy briefings. A platform-builder integrates competitor content discovery into their SEO SaaS feature set. All reduce to site:competitor.com query batches + topic clustering + gap detection.
How does this compare to the alternatives?
Three options for competitor content mapping:
| Approach | Cost per 10,000 pages | Reliability | Setup time | Maintenance |
|---|---|---|---|---|
| Ahrefs / SEMrush | $99–$999/month per seat | High, with traffic estimates | Hours | Per-seat license |
| Manual sitemap parsing | Effectively unbounded analyst time | Sitemap may be incomplete | Continuous | Per-competitor |
| Thirdwatch Google Search Scraper | Pay per record | Production-tested with search-engine-friendly proxy | 5 minutes | Thirdwatch tracks Google changes |
Ahrefs/SEMrush offer comprehensive SEO data at the high end. The Google Search Scraper actor page gives you the surface-level content map at the lowest unit cost — useful as cost-optimized first pass before deeper Ahrefs analysis on shortlisted opportunities.
How to research competitor content in 4 steps
Step 1: How do I authenticate against Apify?
Sign in at apify.com (free tier, no credit card), open Settings → Integrations, and copy your personal API token:
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"Step 2: How do I pull competitor site: queries?
Pass site:competitor.com queries scoped to topic keywords.
import os, requests, pandas as pd
from itertools import product
ACTOR = "thirdwatch~google-search-scraper"
TOKEN = os.environ["APIFY_TOKEN"]
COMPETITORS = ["stripe.com", "adyen.com", "checkout.com", "square.com"]
TOPICS = ["payments", "pricing", "fraud detection", "subscriptions",
"tutorial", "case study", "integration"]
queries = [f"site:{c} {t}" for c, t in product(COMPETITORS, TOPICS)]
resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={"queries": queries, "maxResults": 50},
timeout=3600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} indexed pages across {len(COMPETITORS)} competitors")4 competitors × 7 topics = 28 queries × 50 results = up to 1,400 pages — well within budget for a quarterly competitive sweep.
Step 3: How do I cluster pages into topic pillars?
TF-IDF + KMeans clustering of page titles + descriptions.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
df["text"] = df.title.fillna("") + " " + df.snippet.fillna("")
df["competitor"] = df.url.str.extract(r"https?://(?:www\.)?([^/]+)")
vectorizer = TfidfVectorizer(max_features=500, stop_words="english", ngram_range=(1, 2))
X = vectorizer.fit_transform(df.text)
kmeans = KMeans(n_clusters=10, random_state=42, n_init=10)
df["topic"] = kmeans.fit_predict(X)
topic_terms = []
for i, center in enumerate(kmeans.cluster_centers_):
top_terms = [vectorizer.get_feature_names_out()[idx]
for idx in center.argsort()[-5:][::-1]]
topic_terms.append((i, ", ".join(top_terms)))
print("Topic clusters:")
for t, terms in topic_terms:
print(f" Topic {t}: {terms}")
per_competitor_topic = df.groupby(["competitor", "topic"]).size().unstack(fill_value=0)
print(per_competitor_topic)Topic-cluster volume per competitor reveals content-strategy bets.
Step 4: How do I detect content gaps?
Run topic queries without the site: operator, check competitor rankings vs yours.
TOPIC_QUERIES = [f"{t} for ecommerce" for t in TOPICS]
resp2 = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={"queries": TOPIC_QUERIES, "maxResults": 20},
timeout=3600,
)
gap_df = pd.DataFrame(resp2.json())
gap_df["domain"] = gap_df.url.str.extract(r"https?://(?:www\.)?([^/]+)")
YOUR_DOMAIN = "yoursite.com"
gaps = []
for query, grp in gap_df.groupby("searchString"):
domains = grp.domain.tolist()
competitor_count = sum(1 for d in domains if d in COMPETITORS)
if competitor_count >= 2 and YOUR_DOMAIN not in domains:
gaps.append({"query": query, "competitor_count": competitor_count})
print(f"{len(gaps)} content gaps where 2+ competitors rank but you don't")
print(pd.DataFrame(gaps).sort_values("competitor_count", ascending=False))Gaps with 3+ competitors ranking are highest-priority — strong evidence the topic is worth pursuing.
Sample output
A single Google Search result looks like this. Five rows weigh ~5 KB.
{
"title": "Pricing | Stripe",
"url": "https://stripe.com/pricing",
"snippet": "Stripe's pricing is simple, transparent, and integrated...",
"displayedUrl": "stripe.com › pricing",
"position": 1
}url is the canonical natural key. position indicates ranking rank (1 = top result). snippet provides ~160 chars of page content for topic-clustering. title enables exact-match domain identification.
Common pitfalls
Three things go wrong in competitor-content research. site: query coverage gaps — Google's site: returns at most 100-200 pages per competitor regardless of true index size; for comprehensive coverage, supplement with sitemap-parsing. Topic-cluster instability — KMeans with k=10 can produce different clusters across runs; for stable topic IDs across snapshots, fix random_state and consider hierarchical clustering for more interpretable hierarchies. Subdomain confusion — competitors often have blog.competitor.com and docs.competitor.com separately; site:competitor.com may not catch subdomains by default; explicitly query each subdomain.
Thirdwatch's actor handles the scraping plumbing so you can focus on the data. Pair Google Search with Google News Scraper for competitor PR + content cross-research. A fourth subtle issue worth flagging: Google's index of any given site lags by 1-4 weeks for fresh content — so a competitor that just published 20 new pages this week will show only ~5 indexed today; for currency-sensitive content gap analysis, supplement search-result scraping with direct sitemap.xml fetching from competitor domains. A fifth pattern unique to content-research work: large competitors run multiple content properties (Stripe has stripe.com, stripe.com/blog, increment.com, stripe.com/atlas-magazine, etc.) — site:stripe.com misses content on the affiliated properties; for true content-footprint mapping, build a per-competitor property-list and query each separately. A sixth and final pitfall: site: queries are subject to Google's regional ranking — a US-based scraping IP sees the US-region index, while an EU IP sees EU. For multi-region competitive coverage, run separate batches per target region with explicit country parameter on the actor. A seventh and final pattern worth flagging for production teams: data-pipeline cost optimization. The actor's pricing scales linearly with record volume, so for high-cadence operations (hourly polling on large watchlists), the dominant cost driver is the size of the watchlist rather than the per-record fee. For cost-disciplined teams, tier the watchlist (Tier 1 hourly, Tier 2 daily, Tier 3 weekly) rather than running everything at the highest cadence — typical 60-80% cost reduction with minimal signal loss. Combine tiered cadence with explicit dedup keys and incremental snapshot diffing to keep storage and downstream-compute proportional to new signal rather than total watchlist size. This is the difference between a $200/month research pipeline and a $2,000/month one for the same actionable output. An eighth subtle issue worth flagging: snapshot-storage strategy materially affects long-term pipeline economics. Raw JSON snapshots compressed with gzip typically run 4-8x smaller than uncompressed; for multi-year retention, always compress at write-time. For high-frequency snapshots, partition storage by date prefix (snapshots/YYYY/MM/DD/) to enable fast date-range queries and incremental processing rather than full-scan re-aggregation. Most production pipelines keep 90 days of raw snapshots at full fidelity + 12 months of derived per-record aggregates + indefinite retention of derived metric time-series — three retention tiers managed separately.
Related use cases
Frequently asked questions
How does Google site: search work for content research?
The `site:competitor.com` operator returns Google's indexed pages from a specific domain. Combined with topic keywords (`site:competitor.com pricing` or `site:competitor.com tutorial`), it surfaces every indexed page where the competitor covers that topic. For content-strategy teams, this maps competitor topic coverage at scale — without needing access to their CMS or sitemap.
What's a content-gap analysis?
A content-gap analysis lists topics where competitors rank but you don't. For each topic in your target keyword universe, run a Google query and check whether your domain appears in the top 20 results. Topics where 2+ competitors rank but you're absent are gaps — high-priority for content production. Most content-strategy teams find 50-200 viable gaps per quarter on a refreshed analysis.
How do I cluster competitor content into topics?
Pull `site:competitor.com` pages, extract titles + meta descriptions, run NLP topic-modeling (BERTopic or simpler TF-IDF + clustering). Most competitor sites cluster into 5-15 distinct content pillars. Cross-reference pillar coverage against your own to identify under-served topics. For SEO-research SaaS builders, topic clustering is the canonical content-gap signal.
How fresh does competitor content data need to be?
Quarterly cadence catches meaningful competitor content shifts — most companies publish 1-5 new pages per week, so a quarter captures 20-60 new pieces per competitor. For aggressive content competitors (publishing daily), monthly cadence is justified. For long-term strategic-research, annual cadence suffices.
Can I detect competitor topic-cluster shifts?
Yes. Snapshot competitor content quarterly and compute topic-cluster volume deltas. A competitor going from 5 articles on topic X to 25 in a quarter signals strategic doubling-down on that topic — usually because they're seeing organic traffic growth or competitive necessity. For your own strategy, this is high-signal: their bet on X is empirical evidence X is worth pursuing.
How does this compare to Ahrefs or SEMrush?
Ahrefs and SEMrush bundle backlink data + organic traffic estimates with content-research at $99-$999/month per seat. Their organic-rankings index is broader and more accurate than scraping live Google Search. The actor gives you raw search results — for surface-level competitor content mapping, the actor is materially cheaper. For full SEO competitive analysis (backlinks, traffic, paid keywords), Ahrefs/SEMrush are the canonical tools.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.