Skip to main content
Thirdwatchthirdwatch
Other

Scrape IMDb Movies and TV Ratings for Data Research (2026)

Extract IMDb movie and TV show data — ratings, votes, cast, genres, plot — using Thirdwatch's IMDb Scraper. Python recipes for analysts and developers.

May 26, 2026 · 6 min read · 1,268 words
See the scraper →

Thirdwatch's IMDb Scraper extracts structured movie and TV show data from IMDb — title, year, rating, votes, genres, director, cast, plot, runtime, poster, and content rating. Search by name or pass IMDb URLs directly. No login, no API key. Built for data analysts, app developers, researchers, and content platforms that need IMDb metadata programmatically.

Why scrape IMDb for movie and TV data research

IMDb is the largest entertainment database on the public web. According to IMDb's own press page, the site indexes over 13 million titles and receives hundreds of millions of monthly visitors. For anyone building entertainment analytics, recommendation engines, or content catalogs, IMDb is the canonical source of ratings, cast, and genre data.

The problem: IMDb retired its affiliate API years ago. The official IMDb Non-Commercial Datasets offer bulk TSV dumps refreshed daily, but they lack search, lack plot summaries, and require downloading multi-gigabyte files to extract a handful of titles. OMDb API offers per-title lookup but caps free usage at 1,000 calls per day and omits fields like cast lists and posters at the free tier.

The job-to-be-done is targeted extraction. A streaming analytics team needs ratings and genres for 500 titles in their catalog. A data science student is building a genre-classification model and needs plot summaries plus ratings for training data. A content aggregator wants fresh poster URLs and cast lists for its database. All reduce to query-or-URL pulls returning structured rows.

How does this compare to the alternatives?

Three paths to structured IMDb data:

Approach Reliability Setup time Maintenance Limitations
IMDb Non-Commercial Datasets (bulk TSV) High (official) 1-2 hours Manual re-download No search, no plots, no posters, multi-GB files
OMDb API (free tier) Moderate 10 minutes API key management 1,000 calls/day cap, subset of fields
Thirdwatch IMDb Scraper High 5 minutes Thirdwatch maintains Up to 100 titles per run

The bulk datasets work for academic research that needs the full catalog. OMDb works for one-off lookups. The IMDb Scraper fills the middle ground: on-demand, structured, no daily cap, and returning fields (poster, cast, plot) that bulk dumps omit.

How to scrape IMDb movies and TV shows in 4 steps

Step 1: How do I set up my Apify API token?

Sign up at apify.com (free tier available, no credit card required). Go to Settings, then Integrations, and copy your personal API token. All examples below assume it is stored in an environment variable:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I search IMDb by movie or TV show name?

Pass your search term in searchQuery and set maxResults to control how many titles come back. The actor searches IMDb and returns the top matches.

import os, requests, pandas as pd

ACTOR = "thirdwatch~imdb-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "searchQuery": "inception",
        "maxResults": 10,
    },
    timeout=300,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} titles returned")
print(df[["title", "year", "rating", "votes", "genres"]].head())

This returns up to 10 titles matching "inception" with all fields populated. The actor resolves each search result to its full title page, so you get complete data — not just search-card summaries.

Step 3: How do I scrape specific IMDb URLs directly?

If you already have IMDb title URLs (from a spreadsheet, a partner feed, or your own catalog), pass them in the urls field to skip the search step entirely.

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "urls": [
            {"url": "https://www.imdb.com/title/tt0111161/"},
            {"url": "https://www.imdb.com/title/tt0068646/"},
            {"url": "https://www.imdb.com/title/tt0468569/"},
        ],
        "maxResults": 10,
    },
    timeout=300,
)
movies = resp.json()
for m in movies:
    print(f"{m['title']} ({m['year']}) - {m['rating']}/10 ({m['votes']} votes)")

This fetches The Shawshank Redemption, The Godfather, and The Dark Knight directly. No search ambiguity — you get exactly the titles you asked for.

Step 4: How do I analyze ratings and genres across a dataset?

Once you have a DataFrame, standard pandas operations give you genre breakdowns, rating distributions, and vote-weighted rankings.

import ast

# Explode genres for per-genre analysis
df["genre_list"] = df["genres"].apply(
    lambda g: g if isinstance(g, list) else []
)
exploded = df.explode("genre_list")

genre_stats = exploded.groupby("genre_list").agg(
    count=("title", "count"),
    avg_rating=("rating", "mean"),
    avg_votes=("votes", "mean"),
).sort_values("avg_rating", ascending=False)

print(genre_stats.head(10))

# Top-rated titles weighted by vote count
df["weighted_score"] = df["rating"] * df["votes"].apply(
    lambda v: min(v / 100000, 5) if v else 0
)
print(df.nlargest(10, "weighted_score")[["title", "year", "rating", "votes"]])

This gives you genre-level averages and a vote-weighted ranking that penalizes titles with very few votes — useful for filtering out obscure entries with artificially high ratings.

Sample output

A single record from the dataset looks like this:

{
    "title": "The Shawshank Redemption",
    "year": "1994",
    "rating": 9.3,
    "votes": 2800000,
    "genres": ["Drama"],
    "director": "Frank Darabont",
    "cast": ["Tim Robbins", "Morgan Freeman", "Bob Gunton"],
    "plot": "Over the course of several years, two convicts form a friendship...",
    "runtime": "142 min",
    "poster": "https://m.media-amazon.com/images/M/MV5B...",
    "content_rating": "R",
    "url": "https://www.imdb.com/title/tt0111161/"
}

rating is the familiar 1-10 IMDb score. votes is the raw vote count — use it to weight ratings in aggregations. genres is an array, so multi-genre titles like "Action, Sci-Fi" return cleanly. cast is an ordered array of main cast members. poster is a direct CDN URL to the title's poster image. content_rating follows the MPAA system (G, PG, PG-13, R) for US releases.

Common pitfalls

Three issues come up in production IMDb pipelines. Search ambiguity — searching "the office" returns both the US and UK versions plus related titles. If you need a specific version, pass the IMDb URL directly via the urls field to avoid guesswork. Missing fields on old titles — pre-1960s films and obscure titles may have empty cast, runtime, or content_rating fields because IMDb itself has incomplete records. Build your downstream pipeline to handle nulls gracefully. Vote-count bias — popular blockbusters dominate vote counts by orders of magnitude. A title with 2.8 million votes and a 9.3 rating is statistically much more reliable than a title with 200 votes and a 9.5. Apply a Bayesian average or minimum-vote threshold before ranking.

Thirdwatch's actor handles the page-fetching and parsing so you get clean structured JSON without maintaining your own scraper against IMDb's markup changes.

Integration with content and analytics pipelines

IMDb data plugs directly into several high-value workflows. Streaming analytics teams join IMDb ratings against their internal viewership data to identify titles where audience satisfaction (high rating, high votes) diverges from consumption volume — a signal that content is under-promoted or mis-categorized. Recommendation engine developers use genres, cast overlap, and director information as features in collaborative filtering models; the structured arrays from this actor feed directly into feature engineering without manual parsing.

Content aggregators building public-facing movie databases use the poster URL, plot summary, and content rating fields to populate frontend cards without licensing IMDb's commercial data feed. Academic researchers studying genre evolution, rating inflation, or cast network effects pull targeted slices of IMDb data through this actor rather than downloading the full 7 GB bulk dataset. For YouTube content strategy, pair IMDb trending-title data with YouTube search data to identify which movies and shows are generating the most search interest on video platforms — a useful signal for review channels and commentary creators planning their content calendar.

Related use cases

Frequently asked questions

Does the IMDb Scraper require an API key or login?

No. The actor scrapes publicly available IMDb pages. You only need an Apify account token to trigger runs. No IMDb credentials, no third-party API key.

Can I scrape every movie on IMDb with this actor?

No. This is an on-demand scraper — you provide search queries or IMDb URLs and the actor returns those titles. It is not a bulk-dump tool for the entire IMDb catalog.

How fresh are IMDb ratings returned by the actor?

Each run fetches the current rating and vote count at run time from the live IMDb page. Ratings reflect whatever IMDb shows at the moment the actor processes the title.

What fields are returned for each title?

Title, year, IMDb rating, vote count, genres, director, cast list, plot summary, runtime, poster URL, content rating, and the IMDb URL. Some fields may be empty for obscure or very old titles.

Can I use IMDb data commercially?

You are responsible for complying with IMDb's terms of use. Typical uses — internal analytics, recommendation testing, academic research — are fine. Redistribution of full datasets is restricted by IMDb.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.