Social media

Extract YouTube Podcast Transcripts at Scale for Content

Pull full transcripts from YouTube podcasts and interviews for repurposing into blog posts, newsletters, and social content. Timestamps included. Python.

May 26, 2026 · 6 min read · 1,269 words

See the scraper →

Thirdwatch's YouTube Transcripts Scraper extracts full transcripts from YouTube podcasts and long-form interviews with per-segment timestamps. Returns the complete text plus timing data for repurposing into blog posts, newsletters, social threads, and show notes. Handles manual and auto-generated captions across all languages. Built for content teams, podcast producers, and growth marketers who turn video into multi-format content at scale. A 90-minute podcast yields 12,000 words of structured text in seconds, with timing data for precise clip references.

Why extract YouTube podcast transcripts for repurposing

The podcast-to-YouTube pipeline has become the dominant distribution model for long-form content. According to Edison Research, over 40% of weekly podcast listeners now consume shows on YouTube. Creators like Lex Fridman, Andrew Huberman, and thousands of niche podcasters publish full episodes as YouTube videos, making the platform the largest single archive of spoken long-form content.

The repurposing opportunity is massive but manual. A 90-minute podcast interview contains 12,000-15,000 words of spoken content — enough for 8-10 blog posts, 30 social media snippets, a newsletter series, and an FAQ page. But extracting that text by hand means watching the video, pausing, typing, and rewinding. At scale, a content team monitoring 20 competitor podcasts cannot afford manual transcription. The job-to-be-done is mechanical: pull the transcript text, preserve timestamps so clips can be referenced, and feed the text into a repurposing pipeline. YouTube already has the captions. The actor extracts them and returns structured data ready for downstream content workflows.

How does this compare to the alternatives?

Approach	Reliability	Setup time	Maintenance
Manual transcription (Rev, Descript)	High accuracy, expensive at scale	Minutes per video	Per-video cost adds up
Whisper / AssemblyAI (audio transcription)	High accuracy, requires audio download	4-8 hours pipeline setup	GPU or API costs
Thirdwatch YouTube Transcripts Scraper	Uses existing YouTube captions	5 minutes	Thirdwatch maintains

Manual transcription services like Rev charge per audio minute and take hours for turnaround. Audio-to-text AI like Whisper requires downloading the video, extracting audio, and running inference. For videos that already have captions on YouTube — which includes the vast majority of popular podcasts — extracting the existing caption text is instantaneous, free of audio processing costs, and returns the same content. The Thirdwatch actor is the fastest path from YouTube podcast URL to structured transcript text.

How to extract podcast transcripts in 4 steps

Step 1: How do I set up the Apify connection?

Create an account at apify.com and grab your API token from Settings, then Integrations:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I extract a full podcast transcript with timestamps?

Pass podcast video URLs and keep includeTimestamps enabled. For podcasts, you almost always want timestamps to reference specific moments:

import os, requests, json

ACTOR = "thirdwatch~youtube-transcripts-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "videoUrls": [
            "https://www.youtube.com/watch?v=PODCAST_VIDEO_ID_1",
            "https://www.youtube.com/watch?v=PODCAST_VIDEO_ID_2",
        ],
        "languageCode": "en",
        "preferManual": True,
        "includeTimestamps": True,
        "maxResults": 10,
    },
    timeout=600,
)
episodes = resp.json()

for ep in episodes:
    if "error" not in ep:
        words = len(ep["transcript_text"].split())
        mins = ep["total_duration_seconds"] / 60
        print(f"  {ep['video_id']}: {words:,} words, "
              f"{mins:.0f} min, {ep['segment_count']} segments, "
              f"manual={not ep['is_auto_generated']}")

A typical 60-minute podcast yields 8,000-10,000 words and 400-600 caption segments. The preferManual flag ensures you get the higher-quality human-written captions when the podcaster has uploaded them.

Step 3: How do I split a podcast transcript into repurposable sections?

Use timestamps to segment the transcript into topical chunks. Podcast episodes naturally break at topic transitions, which often align with pauses or timestamp gaps:

def split_into_sections(segments, gap_threshold=5.0, min_words=200):
    """Split transcript at long pauses (likely topic changes)."""
    sections, current = [], []
    for i, seg in enumerate(segments):
        current.append(seg)
        if i < len(segments) - 1:
            gap = segments[i + 1]["start"] - (seg["start"] + seg["duration"])
            if gap > gap_threshold:
                text = " ".join(s["text"] for s in current)
                if len(text.split()) >= min_words:
                    sections.append({
                        "text": text,
                        "start_time": current[0]["start"],
                        "end_time": current[-1]["start"] + current[-1]["duration"],
                        "word_count": len(text.split()),
                    })
                    current = []
    # Flush remaining
    if current:
        text = " ".join(s["text"] for s in current)
        sections.append({
            "text": text,
            "start_time": current[0]["start"],
            "end_time": current[-1]["start"] + current[-1]["duration"],
            "word_count": len(text.split()),
        })
    return sections

for ep in episodes:
    if "error" not in ep:
        sections = split_into_sections(ep["segments"])
        print(f"\n{ep['video_id']}: {len(sections)} sections")
        for i, s in enumerate(sections):
            start_min = s["start_time"] / 60
            print(f"  Section {i+1}: {s['word_count']} words "
                  f"(starts at {start_min:.1f} min)")

Each section becomes a candidate for a standalone blog post, social media thread, or newsletter excerpt — with exact timestamps for linking back to the video moment.

Step 4: How do I schedule weekly extraction for a podcast series?

Combine with the YouTube Scraper to detect new uploads, then schedule transcript extraction:

curl -X POST "https://api.apify.com/v2/schedules?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "weekly-podcast-transcripts",
    "cronExpression": "0 8 * * 1",
    "timezone": "America/New_York",
    "isEnabled": true,
    "actions": [{
      "type": "RUN_ACTOR",
      "actorId": "thirdwatch~youtube-transcripts-scraper",
      "runInput": {
        "videoUrls": [
          "https://www.youtube.com/watch?v=LATEST_EPISODE_ID"
        ],
        "languageCode": "en",
        "preferManual": true,
        "includeTimestamps": true,
        "maxResults": 5
      }
    }]
  }'

Add an ACTOR.RUN.SUCCEEDED webhook to pipe the transcript data into your CMS, Notion database, or content repurposing tool. Every Monday morning, the latest episode transcript lands in your pipeline ready for repurposing.

Sample output

A podcast transcript record. The segments array is truncated — a real 60-minute episode has 400-600 segments.

{
    "video_id": "PODCAST_VIDEO_ID_1",
    "video_url": "https://www.youtube.com/watch?v=PODCAST_VIDEO_ID_1",
    "language_code": "en",
    "language_name": "English",
    "is_auto_generated": false,
    "auto_translated": false,
    "available_languages": [
        {"code": "en", "name": "English", "is_auto_generated": false},
        {"code": "en", "name": "English (auto-generated)", "is_auto_generated": true}
    ],
    "transcript_text": "Welcome back to the show today we have an incredible guest ...",
    "segment_count": 487,
    "total_duration_seconds": 3847.2,
    "segments": [
        {"text": "Welcome back to the show", "start": 0.0, "duration": 2.1},
        {"text": "today we have an incredible guest", "start": 2.1, "duration": 2.8},
        {"text": "who's been building in the AI space for over a decade", "start": 4.9, "duration": 3.2}
    ],
    "data_source": "youtube_innertube"
}

is_auto_generated: false confirms this podcast has human-written captions — publication-ready quality. segment_count: 487 at total_duration_seconds: 3847 means roughly one caption line every 8 seconds, which is standard density for conversational speech. The transcript_text field contains the full joined text — 8,000+ words ready for repurposing.

Common pitfalls

Three issues affect podcast transcript extraction at scale. Speaker attribution — YouTube captions do not include speaker labels. For interview-format podcasts, you need a separate diarization step or manual annotation to mark who said what. The timestamp segments help align turns if you have speaker-change data from another source. Auto-caption quality on fast speech — podcast conversations with overlapping speakers, rapid back-and-forth, or heavy jargon produce lower-quality auto-captions. Always check is_auto_generated and consider manual editing for publication-critical content. Episode discovery — the actor extracts transcripts but does not discover new episodes. Pair it with the YouTube Scraper to monitor channels for new uploads and feed video IDs into the transcript pipeline automatically.

Thirdwatch's actor handles rate limiting with automatic backoff and returns structured error records for videos without captions. The preferManual flag ensures you always get the best available caption quality. For high-volume podcast monitoring pipelines, batch video IDs from channel feeds and process them in groups of 50-100. Track your success rate per channel — some podcasters disable captions entirely, and discovering this early saves wasted API calls downstream.

Related use cases

Frequently asked questions

How long does it take to extract a 2-hour podcast transcript?

Seconds, not hours. The actor reads existing caption data from YouTube's servers rather than transcribing audio. A 2-hour podcast with 800 caption segments returns in the same time as a 3-minute video. The bottleneck is HTTP round-trips, not audio processing.

Are podcast transcripts accurate enough for direct publishing?

Human-written captions are publication-ready. Auto-generated captions are roughly 85-95 percent accurate for clear speech but need editing for proper nouns, technical terms, and speaker attribution. Check the is_auto_generated field to know which you got.

Can I identify different speakers in the transcript?

YouTube captions do not include speaker labels. For multi-speaker podcasts, you will need a downstream diarization step. The timestamp segments help align speaker turns if you have separate audio diarization data.

Does this work for YouTube-exclusive podcasts like Joe Rogan or Lex Fridman?

Yes, as long as the video is public and has captions enabled. Most major YouTube podcasters enable either manual or auto-generated captions. The actor returns whichever track is available.

Can I schedule daily extraction for a podcast that uploads weekly?

Yes. Pair the YouTube Transcripts Scraper with the YouTube Scraper to detect new uploads, then feed new video IDs into a scheduled transcript extraction run. Apify's scheduling API supports cron expressions for any cadence.

Scrape YouTube Transcripts for Content Research (2026 Guide)Build a Video Content Dataset from YouTube Captions (2026)Find YouTube Content Gaps from Transcript Analysis (2026)

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.