Skip to main content
Thirdwatchthirdwatch
Other

Build a Movie Recommendation Dataset With IMDb Data (2026)

Create a recommendation engine training dataset from IMDb with ratings, genres, cast, and plot summaries. Full Python pipeline using Thirdwatch IMDb Scraper.

May 26, 2026 · 6 min read · 1,334 words
See the scraper →

Thirdwatch's IMDb Scraper returns structured movie and TV metadata — genres, cast, director, plot, rating, votes, runtime, content rating — that feeds directly into recommendation model training pipelines. Search by title or pass IMDb URLs. No login, no API key. Built for ML engineers, data scientists, and app developers building content-based or hybrid recommendation systems.

Why scrape IMDb for a recommendation dataset

Recommendation engines need structured metadata to compute similarity. A content-based recommender uses genre overlap, shared cast, director match, and plot-keyword similarity to suggest titles. A collaborative-filtering model uses rating distributions. Both need clean, per-title feature vectors.

According to a 2024 ACM survey on recommendation systems, content-based filtering using metadata features (genre, cast, keywords) remains the dominant cold-start strategy even as deep-learning approaches improve collaborative filtering. The reason is practical: you can recommend a brand-new title the moment it has metadata, without waiting for user interaction data.

IMDb is the richest public source for these features. The official IMDb Non-Commercial Datasets provide ratings and basic metadata in bulk TSV form, but they omit plot summaries, cast lists, poster URLs, and content ratings — exactly the fields that power content-based models. The Thirdwatch IMDb Scraper fills this gap by returning all twelve fields per title on demand.

How does this compare to the alternatives?

Approach Fields for ML Setup time Refresh frequency Limitations
IMDb Non-Commercial Datasets Rating, votes, genres, runtime 1-2 hours Daily bulk download No plot, no cast, no poster, no content rating
OMDb API (free) Most fields 10 minutes On demand 1,000 calls/day, missing cast at free tier
MovieLens dataset User ratings only 30 minutes Static snapshots No title metadata, research-use only
Thirdwatch IMDb Scraper All 12 fields 5 minutes On demand 100 titles per run

For ML training data, the critical gap in bulk datasets is the absence of text fields (plot, cast names) and image URLs (poster). The IMDb Scraper returns these alongside the standard numeric features, so you can build content-based and hybrid models from a single data source.

How to build a recommendation dataset in 4 steps

Step 1: How do I collect titles across multiple genres?

Start by running genre-specific searches to build a balanced dataset. Recommendation models perform poorly when training data is skewed toward one genre.

import os, requests, pandas as pd, time

ACTOR = "thirdwatch~imdb-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

GENRES_QUERIES = [
    "best action movies",
    "best drama movies",
    "best comedy movies",
    "best horror movies",
    "best sci-fi movies",
    "best thriller movies",
    "best romance movies",
    "best documentary films",
]

all_titles = []
for query in GENRES_QUERIES:
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
        params={"token": TOKEN},
        json={
            "searchQuery": query,
            "maxResults": 15,
        },
        timeout=300,
    )
    all_titles.extend(resp.json())
    time.sleep(2)

df = pd.DataFrame(all_titles)
df = df.drop_duplicates(subset=["url"])
print(f"{len(df)} unique titles across {len(GENRES_QUERIES)} genre queries")

Eight queries at 15 results each yields roughly 80-100 unique titles after dedup — a solid seed dataset for prototyping a recommender.

Step 2: How do I enrich with specific IMDb URLs from my catalog?

If your app already has a title catalog with IMDb IDs, pass them directly via urls to fill in the metadata columns.

catalog_urls = [
    {"url": "https://www.imdb.com/title/tt0111161/"},
    {"url": "https://www.imdb.com/title/tt0068646/"},
    {"url": "https://www.imdb.com/title/tt0071562/"},
    {"url": "https://www.imdb.com/title/tt0468569/"},
    {"url": "https://www.imdb.com/title/tt0050083/"},
    {"url": "https://www.imdb.com/title/tt0108052/"},
]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "urls": catalog_urls,
        "maxResults": 50,
    },
    timeout=300,
)
catalog_df = pd.DataFrame(resp.json())
df = pd.concat([df, catalog_df]).drop_duplicates(subset=["url"]).reset_index(drop=True)
print(f"Dataset now has {len(df)} titles with full metadata")

Step 3: How do I build feature vectors for content-based similarity?

Genre one-hot encoding plus TF-IDF on plot text gives you a basic but effective similarity matrix.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# One-hot encode genres
all_genres = set()
for g in df["genres"]:
    if isinstance(g, list):
        all_genres.update(g)

for genre in sorted(all_genres):
    df[f"genre_{genre}"] = df["genres"].apply(
        lambda g: 1 if isinstance(g, list) and genre in g else 0
    )

# TF-IDF on plot summaries
df["plot_clean"] = df["plot"].fillna("")
tfidf = TfidfVectorizer(max_features=500, stop_words="english")
plot_matrix = tfidf.fit_transform(df["plot_clean"])

# Combine genre features + plot TF-IDF
import numpy as np
from scipy.sparse import hstack, csr_matrix

genre_cols = [c for c in df.columns if c.startswith("genre_")]
genre_matrix = csr_matrix(df[genre_cols].values)
combined = hstack([genre_matrix * 2.0, plot_matrix])  # weight genres higher

similarity = cosine_similarity(combined)
print(f"Similarity matrix shape: {similarity.shape}")

The genre weight multiplier (2.0) ensures that genre match dominates over plot-keyword overlap — tune this based on your use case.

Step 4: How do I generate recommendations from the similarity matrix?

Given a seed title, pull the top-N most similar titles from the matrix.

def recommend(title_name, n=5):
    idx = df[df["title"].str.contains(title_name, case=False, na=False)].index
    if len(idx) == 0:
        return f"Title '{title_name}' not found in dataset"
    idx = idx[0]
    scores = list(enumerate(similarity[idx]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)[1:n+1]
    results = []
    for i, score in scores:
        results.append({
            "title": df.iloc[i]["title"],
            "year": df.iloc[i]["year"],
            "rating": df.iloc[i]["rating"],
            "genres": df.iloc[i]["genres"],
            "similarity": round(score, 3),
        })
    return results

for rec in recommend("Inception", n=5):
    print(f"  {rec['title']} ({rec['year']}) - {rec['rating']}/10 "
          f"[similarity: {rec['similarity']}]")

This is a minimal content-based recommender. For production, add collaborative signals (user-item interactions), cast overlap scoring, and director-match bonuses.

Sample output

Two records from the dataset showing the feature richness:

[
    {
        "title": "Inception",
        "year": "2010",
        "rating": 8.8,
        "votes": 2400000,
        "genres": ["Action", "Adventure", "Sci-Fi"],
        "director": "Christopher Nolan",
        "cast": ["Leonardo DiCaprio", "Joseph Gordon-Levitt", "Elliot Page"],
        "plot": "A thief who steals corporate secrets through the use of dream-sharing technology...",
        "runtime": 148,
        "poster_url": "https://m.media-amazon.com/images/M/MV5B...",
        "content_rating": "PG-13",
        "url": "https://www.imdb.com/title/tt1375666/"
    },
    {
        "title": "Interstellar",
        "year": "2014",
        "rating": 8.7,
        "votes": 2100000,
        "genres": ["Adventure", "Drama", "Sci-Fi"],
        "director": "Christopher Nolan",
        "cast": ["Matthew McConaughey", "Anne Hathaway", "Jessica Chastain"],
        "plot": "When Earth becomes uninhabitable in the future, a farmer and ex-NASA pilot...",
        "runtime": 169,
        "poster_url": "https://m.media-amazon.com/images/M/MV5B...",
        "content_rating": "PG-13",
        "url": "https://www.imdb.com/title/tt0816692/"
    }
]

These two titles share a director, two genres, similar ratings, and thematically related plots — a content-based recommender would correctly surface one when the other is the seed. The cast array enables cast-overlap scoring. The plot text feeds TF-IDF or embedding models. The content_rating field can serve as a hard filter (exclude R-rated titles for a family app).

Common pitfalls

Three issues trip up recommendation dataset builders. Genre imbalance — if you only scrape "best movies" queries, your dataset skews toward Drama and Thriller. Run genre-specific queries and verify the genre distribution before training. A 10:1 Drama-to-Horror ratio will make your model terrible at recommending horror. Missing plot text — obscure titles and some older films have empty or very short plot summaries on IMDb. This creates zero-vectors in TF-IDF space, which makes them "similar" to each other despite being unrelated. Filter to titles with plot length above 20 words before computing similarity. Stale ratings — IMDb ratings shift over time, especially for recent releases. A title scraped at 7.2 on release week might stabilize at 6.8 after a month. For popularity-weighted models, schedule periodic re-scrapes of your catalog.

A fourth issue is content rating filtering. The content_rating field (PG-13, R, TV-MA, etc.) is essential for family-safe recommendation apps but missing from bulk IMDb datasets. If your app serves minors or families, use this field as a hard pre-filter before similarity computation rather than a post-filter, which avoids wasting compute on titles that will be excluded anyway. A fifth consideration: poster_url can be null for very obscure titles or titles in pre-production. If your UI requires an image, implement a fallback placeholder and track the null-poster rate per genre to understand coverage gaps.

Thirdwatch's actor returns all twelve fields in clean JSON, so your pipeline reads directly into pandas without custom parsing or HTML cleanup.

Related use cases

Frequently asked questions

What fields matter most for a content-based recommendation model?

Genres, cast, director, plot, and content_rating carry the strongest signal for content-based filtering. Rating and votes are useful for popularity weighting but not for similarity — two 7.5-rated films can be completely different.

How many titles can I scrape per run?

Up to 100 titles per run via the maxResults parameter. For larger datasets, batch your queries across multiple runs and deduplicate on IMDb URL.

Does the actor return plot summaries long enough for NLP embedding?

Yes. Plot summaries typically run 50-200 words — enough for TF-IDF or sentence-transformer embeddings. Very obscure titles may have shorter or missing plots.

Can I combine this with other data sources for hybrid recommendations?

Absolutely. Pair IMDb metadata with YouTube trailer engagement data or Reddit sentiment for a hybrid signal. Thirdwatch offers scrapers for both platforms.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.