Find YouTube Content Gaps from Transcript Analysis (2026)
Analyze YouTube video transcripts to find underserved topics and content gaps in any niche. Extract captions at scale and run keyword frequency analysis.

Thirdwatch's YouTube Transcripts Scraper extracts transcripts from YouTube videos at scale, returning full text and timestamps that power content gap analysis. Pull captions from competitor channels, run keyword frequency analysis on the transcript corpus, and identify underserved topics in any niche. No API key needed. Built for SEO strategists, content marketers, and creators looking for data-driven topic opportunities.
Why analyze YouTube transcripts to find content gaps
Content gap analysis on YouTube traditionally relies on title and description text — a few hundred characters per video. But the real substance of a video is in what the creator says, which is 8,000-15,000 words for a typical 20-minute tutorial or commentary video. According to Pew Research, 85% of YouTube creators producing educational content enable captions, making transcript text the largest untapped signal for topic analysis on the platform.
The job is straightforward. A SaaS content team wants to know which product features competitors discuss on YouTube but they have not covered. A niche creator wants to find topics that existing channels mention briefly but never dedicate a full video to. An SEO strategist building a video content calendar wants data on which subtopics have high search volume but low video coverage. All of these reduce to: extract transcripts from a corpus of videos, compute what topics are discussed and at what depth, and compare against your own coverage. Title-only analysis misses the nuance; transcript-level analysis catches it.
How does this compare to the alternatives?
| Approach | Reliability | Setup time | Maintenance |
|---|---|---|---|
| Manual video watching + note-taking | Perfect understanding, does not scale | Hours per video | Not feasible beyond 20 videos |
| Title/description keyword tools (TubeBuddy, vidIQ) | Surface-level only | 10 minutes | Subscription fees |
| Thirdwatch YouTube Transcripts Scraper + keyword analysis | Full transcript depth at scale | 30 minutes | Thirdwatch maintains extraction |
Title-based tools like vidIQ and TubeBuddy analyze what creators put in titles and tags, but many videos cover topics that never appear in the title. A video titled "My Morning Routine" might spend 10 minutes discussing a specific productivity tool — invisible to title-based analysis but clear in the transcript. Transcript-level gap analysis catches these hidden topic signals.
How to find content gaps from YouTube transcripts in 4 steps
Step 1: How do I collect competitor video IDs for analysis?
Start with the YouTube Scraper to pull video metadata from competitor channels or search results. Then feed the video IDs into the transcript extractor:
import os, requests, pandas as pd
from collections import Counter
TRANSCRIPT_ACTOR = "thirdwatch~youtube-transcripts-scraper"
TOKEN = os.environ["APIFY_TOKEN"]
# Video IDs from a prior YouTube Scraper run
# (e.g., top 100 videos from 5 competitor channels in your niche)
competitor_video_ids = [
"VIDEO_ID_1", "VIDEO_ID_2", "VIDEO_ID_3",
"VIDEO_ID_4", "VIDEO_ID_5", "VIDEO_ID_6",
# ... up to 100-200 video IDs
]For best results, collect 50-200 videos from 5-10 channels in your niche. Include both top performers (by views) and recent uploads (last 6 months) to capture both evergreen and trending topics.
Step 2: How do I extract transcripts from the competitor corpus?
Pass the video IDs to the actor. For gap analysis, you only need the full text — disable timestamps to reduce payload size:
resp = requests.post(
f"https://api.apify.com/v2/acts/{TRANSCRIPT_ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={
"videoIds": competitor_video_ids,
"languageCode": "en",
"preferManual": True,
"includeTimestamps": False,
"maxResults": 200,
},
timeout=600,
)
records = resp.json()
transcripts = [r for r in records if "error" not in r]
errors = [r for r in records if "error" in r]
print(f"Extracted {len(transcripts)} transcripts "
f"({len(errors)} videos had no captions)")
df = pd.DataFrame(transcripts)
df["word_count"] = df["transcript_text"].apply(lambda t: len(t.split()))
print(f"Total corpus: {df.word_count.sum():,} words")A 100-video corpus of 10-15 minute videos yields roughly 800,000-1,500,000 words — a substantial text corpus for keyword frequency analysis.
Step 3: How do I extract topic frequencies from the transcript corpus?
Run n-gram frequency analysis on the full corpus to identify the most discussed topics. Filter common stop words and normalize for corpus size:
import re
STOP_WORDS = set(["the", "a", "an", "is", "are", "was", "were", "be",
"been", "being", "have", "has", "had", "do", "does", "did", "will",
"would", "could", "should", "may", "might", "shall", "can", "to",
"of", "in", "for", "on", "with", "at", "by", "from", "as", "into",
"through", "during", "before", "after", "and", "but", "or", "not",
"no", "so", "if", "than", "too", "very", "just", "about", "up",
"out", "that", "this", "it", "i", "you", "we", "they", "he", "she",
"my", "your", "our", "their", "what", "which", "who", "when",
"where", "how", "all", "each", "every", "both", "few", "more",
"most", "other", "some", "such", "only", "own", "same", "then",
"now", "here", "there", "also", "like", "going", "really", "know",
"think", "get", "got", "one", "make", "want", "right"])
def extract_bigrams(text):
words = re.findall(r'[a-z]+', text.lower())
words = [w for w in words if w not in STOP_WORDS and len(w) > 2]
return [f"{words[i]} {words[i+1]}" for i in range(len(words) - 1)]
# Count bigrams across entire corpus
corpus_bigrams = Counter()
for _, row in df.iterrows():
corpus_bigrams.update(extract_bigrams(row["transcript_text"]))
# Top 50 topic phrases in competitor content
print("Top 50 topics discussed by competitors:")
for phrase, count in corpus_bigrams.most_common(50):
videos_mentioning = sum(1 for _, r in df.iterrows()
if phrase in r["transcript_text"].lower())
print(f" {phrase:30s} mentions={count:5d} "
f"videos={videos_mentioning}/{len(df)}")Bigrams like "email marketing", "landing page", "conversion rate" reveal what competitors talk about most. The videos_mentioning count shows topic breadth — a phrase in 80% of videos is table stakes; a phrase in only 10% might be an emerging subtopic.
Step 4: How do I identify actual content gaps from the frequency data?
Compare the competitor topic list against your own channel's transcripts. Topics with high competitor frequency but zero or low coverage on your channel are gaps:
# Extract your own channel's transcripts the same way
my_video_ids = ["MY_VID_1", "MY_VID_2", "MY_VID_3"] # your channel
resp_mine = requests.post(
f"https://api.apify.com/v2/acts/{TRANSCRIPT_ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={
"videoIds": my_video_ids,
"languageCode": "en",
"preferManual": True,
"includeTimestamps": False,
"maxResults": 100,
},
timeout=600,
)
my_transcripts = [r for r in resp_mine.json() if "error" not in r]
my_corpus = " ".join(t["transcript_text"] for t in my_transcripts).lower()
# Find gaps: high competitor frequency, absent from your content
gaps = []
for phrase, count in corpus_bigrams.most_common(200):
competitor_coverage = sum(1 for _, r in df.iterrows()
if phrase in r["transcript_text"].lower())
my_mentions = my_corpus.count(phrase)
if competitor_coverage >= len(df) * 0.2 and my_mentions < 3:
gaps.append({
"topic": phrase,
"competitor_mentions": count,
"competitor_videos": competitor_coverage,
"your_mentions": my_mentions,
})
gaps_df = pd.DataFrame(gaps).sort_values("competitor_videos", ascending=False)
print(f"\nContent gaps ({len(gaps_df)} topics competitors cover that you don't):")
print(gaps_df.head(20).to_string(index=False))Each row in the output is a validated content gap: a topic that at least 20% of competitor videos discuss but your channel barely mentions. Prioritize by competitor_videos (breadth of coverage) and cross-reference with search volume data from Google Trends for validation.
Sample output
A single transcript record used as input to the analysis. With includeTimestamps set to false, the payload is lean — just the text and metadata.
{
"video_id": "VIDEO_ID_1",
"video_url": "https://www.youtube.com/watch?v=VIDEO_ID_1",
"language_code": "en",
"language_name": "English",
"is_auto_generated": true,
"auto_translated": false,
"available_languages": [
{"code": "en", "name": "English (auto-generated)", "is_auto_generated": true}
],
"transcript_text": "today we're going to talk about content marketing strategies that actually work in 2026 the first thing you need to understand is ...",
"segment_count": 342,
"total_duration_seconds": 1247.6,
"data_source": "youtube_innertube"
}is_auto_generated: true flags that this transcript may have word-level inaccuracies — important to know before computing keyword frequencies. total_duration_seconds: 1247 (about 21 minutes) and segment_count: 342 indicate dense conversational content with roughly one caption line every 3.6 seconds. The transcript_text contains the full 6,000-8,000 word text that feeds into the bigram analysis.
Common pitfalls
Three traps in transcript-based content gap analysis. Stop word pollution — without aggressive stop word filtering, your top bigrams will be "going to", "want to", "need to" instead of actual topics. Use a domain-specific stop word list and filter to bigrams where both words are meaningful. Auto-caption noise — auto-generated captions misspell proper nouns and technical terms, fragmenting your keyword counts. "machine learning" might appear as "machine learning" in some videos and "machine earning" in others. Normalize text with a spell-checker before counting, or filter to is_auto_generated: false videos for cleaner data. Small sample bias — analyzing fewer than 30 videos from a single channel tells you about that creator's vocabulary, not the niche's topics. Pull from at least 5 channels and 50+ videos for statistically meaningful patterns.
Thirdwatch's actor returns the is_auto_generated flag on every record, letting you weight manual captions higher in your analysis. Structured error records for no-caption videos ensure your coverage metrics are accurate rather than silently inflated.
Related use cases
- YouTube Transcripts Scraper on Apify Store
- Scrape YouTube transcripts for content research
- Build a video content dataset from YouTube captions
- Extract YouTube podcast transcripts at scale
- Research YouTube keyword competition
- Build a YouTube content trend pipeline
- The complete guide to scraping social media data
- All Thirdwatch use-case guides
Frequently asked questions
How many videos do I need for meaningful content gap analysis?
A minimum of 50-100 videos from 5-10 channels in your niche produces reliable keyword frequency distributions. Fewer than 30 videos risks noise from individual creator style rather than true topic patterns.
Can I compare transcript topics across competing channels?
Yes. Extract transcripts per channel, compute keyword frequencies independently, then compare. Topics that appear in high-performing channels but are absent in yours represent direct content gaps.
Does this work for non-English niches?
Yes. Set languageCode to any ISO 639-1 code. The keyword extraction and frequency analysis work identically on any language, though you may need language-specific stop word lists for filtering.
How do I know if a content gap is actually an opportunity?
Cross-reference transcript gaps with search volume data from Google Trends or a keyword tool. A topic absent from competitor transcripts but with rising search volume is a validated opportunity.
Can auto-generated captions skew keyword analysis?
Yes, especially for technical terms and proper nouns. Filter to videos with is_auto_generated set to false for higher accuracy, or apply spell-check normalization before computing frequencies.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.