Social media

Build a Video Content Dataset from YouTube Captions (2026)

Build a searchable dataset of YouTube video content by extracting captions at scale. Full transcript text, timestamps, language metadata. Python recipes.

May 26, 2026 · 6 min read · 1,287 words

See the scraper →

Thirdwatch's YouTube Transcripts Scraper extracts captions and transcripts from YouTube videos and returns structured records with full text, per-segment timestamps, language metadata, and caption type. Built for developers building searchable video content datasets, RAG pipelines, topic classifiers, and analytics dashboards over YouTube libraries. No API key needed. Pass video IDs or URLs, choose a language, and receive JSON records ready for embedding or analysis. Handles manual and auto-generated captions across all languages YouTube supports.

Why build a video content dataset from YouTube captions

Over 500 hours of video are uploaded to YouTube every minute. For researchers, product teams, and AI engineers, that volume represents an enormous corpus of spoken knowledge — tutorials, lectures, product reviews, industry commentary — trapped in a format that cannot be searched, filtered, or analyzed with standard data tools.

Building a transcript dataset converts that video content into rows and columns. A product team can search 1,000 competitor demo videos for feature mentions. An AI engineer can index 10,000 conference talks into a vector store for semantic retrieval. A market researcher can track how often specific topics appear across an industry's YouTube presence over time. The use case is always the same: make video content queryable as text. YouTube's official Data API captions endpoint requires OAuth and restricts access to your own channels, which rules it out for third-party corpus building. The Thirdwatch actor provides unrestricted transcript extraction from any public video with structured output designed for dataset construction.

How does this compare to the alternatives?

Approach	Reliability	Setup time	Maintenance
DIY youtube-transcript-api + pandas	Manual retry and proxy management	3-6 hours	You fix breakage on YouTube changes
OpenAI Whisper (audio transcription)	High accuracy, very slow at scale	4-8 hours	GPU costs, audio download pipeline
Thirdwatch YouTube Transcripts Scraper	Handles retries, errors, proxy	5 minutes	Thirdwatch maintains

Running Whisper on downloaded audio produces excellent transcripts but requires downloading video, extracting audio, running inference on GPU, and managing storage. For videos that already have captions, extracting the existing text is orders of magnitude faster and cheaper. The Thirdwatch actor returns both human-written and auto-generated captions, and you can check is_auto_generated to decide whether the quality meets your threshold before falling back to Whisper for caption-less videos.

How to build a YouTube caption dataset in 4 steps

Step 1: How do I collect video IDs for my corpus?

Start by pulling video IDs from the YouTube Scraper, a channel export, or a curated list. This example uses a predefined list, but in production you would feed IDs from a search or channel crawl:

import os, requests, pandas as pd

ACTOR = "thirdwatch~youtube-transcripts-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

# Video IDs from a prior YouTube search or channel crawl
video_ids = [
    "dQw4w9WgXcQ", "9bZkp7q19f0", "kJQP7kiw5Fk",
    "RgKAFK5djSk", "JGwWNGJdvx8", "OPf0YbXqDm0",
]

For automated pipelines, pull video IDs from the YouTube Scraper's dataset and feed them directly into the transcript extraction step.

Step 2: How do I extract transcripts in bulk?

Pass the video IDs to the actor and collect structured transcript records. Set includeTimestamps based on whether you need per-segment timing or just the full text:

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "videoIds": video_ids,
        "languageCode": "en",
        "preferManual": True,
        "includeTimestamps": True,
        "maxResults": 100,
    },
    timeout=600,
)
records = resp.json()

# Separate successes from errors
transcripts = [r for r in records if "error" not in r]
errors = [r for r in records if "error" in r]
print(f"Success: {len(transcripts)}, Errors: {len(errors)}")

The actor returns structured error records (not exceptions) for videos without captions, private videos, and region-locked content. This lets you track coverage rates across your corpus.

Step 3: How do I structure the data into a searchable dataset?

Convert the transcript records into a pandas DataFrame. The transcript_text field is your primary search column; metadata fields let you filter and slice:

df = pd.DataFrame(transcripts)

# Add derived columns
df["word_count"] = df["transcript_text"].apply(lambda t: len(t.split()))
df["has_manual_captions"] = ~df["is_auto_generated"]
df["language_count"] = df["available_languages"].apply(len)

# Filter to substantial content (skip 30-second clips)
df = df[df["total_duration_seconds"] > 120]

print(f"Dataset: {len(df)} videos, "
      f"{df.word_count.sum():,} total words, "
      f"{df.word_count.median():.0f} median words/video")
print(df[["video_id", "language_code", "is_auto_generated",
          "word_count", "total_duration_seconds", "segment_count"]].head(10))

A 500-video corpus of 10-minute tutorials typically yields 1.5-2 million searchable words — enough to power semantic search, topic clustering, or keyword frequency analysis.

Step 4: How do I load transcript text into a vector store for RAG?

The transcript_text field drops directly into any embedding pipeline. For large videos, chunk the text by segments using the segments array:

# Chunk by 60-second windows using segment timestamps
def chunk_by_time(segments, window_seconds=60):
    chunks, current, start = [], [], 0
    for seg in segments:
        if seg["start"] - start > window_seconds and current:
            chunks.append({
                "text": " ".join(s["text"] for s in current),
                "start": current[0]["start"],
                "end": current[-1]["start"] + current[-1]["duration"],
            })
            current, start = [], seg["start"]
        current.append(seg)
    if current:
        chunks.append({
            "text": " ".join(s["text"] for s in current),
            "start": current[0]["start"],
            "end": current[-1]["start"] + current[-1]["duration"],
        })
    return chunks

# Build chunks for each video
for _, row in df.iterrows():
    chunks = chunk_by_time(row["segments"], window_seconds=60)
    for chunk in chunks:
        # embed chunk["text"] and store with metadata:
        # video_id, start_time, end_time
        pass
    print(f"  {row['video_id']}: {len(chunks)} chunks")

Each chunk carries timestamp metadata, so your RAG system can cite not just the video but the exact time range where the answer appears.

Sample output

A single record from the dataset. The segments array is truncated for readability — real records contain one entry per caption line.

{
    "video_id": "dQw4w9WgXcQ",
    "video_url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "language_code": "en",
    "language_name": "English",
    "is_auto_generated": false,
    "auto_translated": false,
    "available_languages": [
        {"code": "en", "name": "English", "is_auto_generated": false},
        {"code": "es", "name": "Spanish", "is_auto_generated": false}
    ],
    "transcript_text": "We're no strangers to love You know the rules and so do I ...",
    "segment_count": 58,
    "total_duration_seconds": 212.48,
    "segments": [
        {"text": "We're no strangers to love", "start": 18.8, "duration": 7.0},
        {"text": "You know the rules and so do I", "start": 25.8, "duration": 3.5}
    ],
    "data_source": "youtube_innertube"
}

transcript_text is a single joined string — ready for embedding without any post-processing. segment_count and total_duration_seconds help you estimate content density (words per minute) for quality filtering. The available_languages array documents every caption track on the video, so you can build multilingual datasets by re-running with different languageCode values.

Common pitfalls

Three issues affect caption-based dataset construction at scale. Caption coverage gaps — roughly 15-20% of YouTube videos have no captions at all (disabled by uploader or too new for auto-generation). Plan for this by tracking your error-record ratio and budgeting Whisper transcription for the gap. Auto-generated noise — auto-captions introduce word-level errors that compound in keyword frequency analysis. Filter on is_auto_generated and weight manual captions higher, or run a spell-check normalization pass on auto-generated text before analysis. Duplicate videos — the same content often appears across multiple channels (reposts, clips, compilations). Deduplicate by transcript_text similarity or video duration rather than title alone.

Thirdwatch's actor returns structured error records for every failure mode — no captions, private video, region lock, age restriction — so your dataset pipeline handles gaps gracefully without crashing or leaving silent holes. The preferManual flag ensures human-written captions take priority when both types exist. For production pipelines, monitor the error-to-success ratio per batch to calibrate your Whisper fallback budget and set realistic coverage expectations for your corpus.

Related use cases

Frequently asked questions

How many transcripts can I extract in one run?

The actor supports up to 10,000 transcripts per run via the maxResults parameter. For larger corpora, split video IDs across multiple runs and merge the datasets downstream. Each run has a 600-second default timeout.

Can I feed transcript text directly into a vector database?

Yes. The transcript_text field is a single joined string designed for embedding. Set includeTimestamps to false to drop the segments array and reduce payload size by roughly 60 percent.

What format does the segments array use for timestamps?

Each segment contains text (the caption line), start (seconds from video start as a float), and duration (length of that segment in seconds). These map directly to subtitle overlay timecodes.

Does the actor handle age-restricted or private videos?

Private and age-restricted videos return structured error records with codes like private_video or age_restricted_or_login_required. Your pipeline can filter these without crashing.

Can I get captions from live stream recordings?

Yes, if the live stream has been archived with captions enabled. The actor treats archived live streams identically to regular uploads. Ongoing live streams without published captions return no_captions_available.

Scrape YouTube Transcripts for Content Research (2026 Guide)Extract YouTube Podcast Transcripts at Scale for Content Find YouTube Content Gaps from Transcript Analysis (2026)

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.