Skip to main content
Thirdwatchthirdwatch
Other

Build an India Entertainment Database with BookMyShow Data

Create a structured database of movies, concerts, comedy shows, and events across India using BookMyShow data. Scheduled scraping, dedup, and enrichment.

May 26, 2026 · 6 min read · 1,350 words
See the scraper →

Thirdwatch's BookMyShow Scraper pulls movies, concerts, comedy shows, sports events, and workshops from BookMyShow across 600+ Indian cities into structured JSON. Scheduled daily, it builds a living database of Indian entertainment with titles, cast, genres, languages, venues, prices, and poster URLs. Built for data teams, aggregator sites, and researchers who need a comprehensive, always-current view of what is showing and what is happening across India.

Why build an entertainment database from BookMyShow

India's entertainment market generates over $27 billion annually according to FICCI's Media and Entertainment report, spanning cinema, live music, comedy, sports, and experiential events. BookMyShow is the dominant platform for discovering and booking these experiences across 600+ Indian cities. Yet no public API exists, and entertainment data in India is fragmented across regional distributors, production houses, and venue operators with no single structured feed.

According to the Internet and Mobile Association of India (IAMAI), online ticketing now accounts for over 70% of movie ticket sales in metro cities. The use cases for a unified entertainment database are concrete. Aggregator sites that combine BookMyShow listings with IMDb ratings and Google Maps venue data. Market researchers analyzing which genres and languages dominate across metros versus tier-2 cities. Advertising teams targeting audiences by event category and city. Travel platforms recommending local entertainment to visitors. Recommendation engines that need fresh catalogues of what is available now. All of these need the same foundation: a clean, deduplicated, regularly refreshed database of every movie and event BookMyShow lists, with structured fields for downstream filtering and joining.

How does this compare to the alternatives?

Approach Data freshness Coverage Setup time Maintenance
Manual BookMyShow browsing Stale within hours One city at a time N/A Continuous manual effort
Paytm Insider / District API Their catalogue only Partial overlap with BookMyShow Partnership required Vendor dependency
Thirdwatch BookMyShow Scraper Real-time per run Full BookMyShow catalogue, 600+ cities 5 minutes Thirdwatch maintains

No other structured BookMyShow data source exists on the market. Paytm Insider and District cover their own events but miss the majority of BookMyShow's cinema and live-event catalogue. Building a scraper in-house requires handling BookMyShow's request protections, which change periodically. The Thirdwatch actor abstracts this entirely, delivering clean JSON you can pipe directly into a database.

How to build an entertainment database in 4 steps

Step 1: How do I set up scheduled scraping across multiple cities?

Create an Apify schedule that runs the BookMyShow Scraper daily across your target cities. This example covers India's top 8 metros.

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

curl -X POST "https://api.apify.com/v2/schedules?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "bookmyshow-daily-8-metros",
    "cronExpression": "0 6 * * *",
    "timezone": "Asia/Kolkata",
    "isEnabled": true,
    "actions": [{
      "type": "RUN_ACTOR",
      "actorId": "thirdwatch~bookmyshow-scraper",
      "runInput": {
        "queries": ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Hyderabad", "Pune", "Kolkata", "Ahmedabad"],
        "mode": "auto",
        "includeDetails": true,
        "maxResults": 500
      }
    }]
  }'

This triggers at 6 AM IST daily, pulling up to 500 movies and events across 8 cities with full detail enrichment. The mode: "auto" setting captures both movies and events in a single run.

Step 2: How do I fetch and deduplicate results into a database?

Use the event_code field as a stable unique identifier. The same movie or event has the same event_code across all cities.

import os, requests, sqlite3, json

TOKEN = os.environ["APIFY_TOKEN"]
ACTOR = "thirdwatch~bookmyshow-scraper"

# Fetch latest run's dataset
resp = requests.get(
    f"https://api.apify.com/v2/acts/{ACTOR}/runs/last/dataset/items",
    params={"token": TOKEN, "limit": 5000},
)
items = resp.json()

conn = sqlite3.connect("entertainment.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS items (
        event_code TEXT PRIMARY KEY,
        type TEXT,
        title TEXT,
        slug TEXT,
        url TEXT,
        languages TEXT,
        genres TEXT,
        rating TEXT,
        release_date TEXT,
        runtime_minutes INTEGER,
        synopsis TEXT,
        director TEXT,
        cast_json TEXT,
        poster_url TEXT,
        category TEXT,
        venue_name TEXT,
        event_date TEXT,
        event_time TEXT,
        duration TEXT,
        price_starting_from TEXT,
        tags TEXT,
        first_seen DATE DEFAULT CURRENT_DATE,
        last_seen DATE DEFAULT CURRENT_DATE
    )
""")

for item in items:
    conn.execute("""
        INSERT INTO items (event_code, type, title, slug, url, languages, genres,
            rating, release_date, runtime_minutes, synopsis, director, cast_json,
            poster_url, category, venue_name, event_date, event_time, duration,
            price_starting_from, tags)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(event_code) DO UPDATE SET last_seen = CURRENT_DATE
    """, (
        item.get("event_code"), item.get("type"), item.get("title"),
        item.get("slug"), item.get("url"),
        json.dumps(item.get("languages", [])), json.dumps(item.get("genres", [])),
        item.get("rating"), item.get("release_date"), item.get("runtime_minutes"),
        item.get("synopsis"), item.get("director"),
        json.dumps(item.get("cast", [])), item.get("poster_url"),
        json.dumps(item.get("category", [])), item.get("venue_name"),
        item.get("event_date"), item.get("event_time"), item.get("duration"),
        item.get("price_starting_from"), json.dumps(item.get("tags", [])),
    ))

conn.commit()
print(f"Upserted {len(items)} items. Total in DB: {conn.execute('SELECT COUNT(*) FROM items').fetchone()[0]}")

The ON CONFLICT clause updates last_seen for existing records and inserts new ones. Over weeks, first_seen and last_seen build a timeline of how long each movie or event stays listed.

Step 3: How do I track city-level availability?

A movie like "Bhooth Bangla" plays in dozens of cities simultaneously. Track which cities each item appears in using a junction table.

conn.execute("""
    CREATE TABLE IF NOT EXISTS item_cities (
        event_code TEXT,
        city TEXT,
        seen_date DATE DEFAULT CURRENT_DATE,
        PRIMARY KEY (event_code, city, seen_date)
    )
""")

for item in items:
    conn.execute("""
        INSERT OR IGNORE INTO item_cities (event_code, city)
        VALUES (?, ?)
    """, (item.get("event_code"), item.get("scraped_city")))

conn.commit()

# Query: which movies are showing in Bengaluru but not Chennai?
blr_only = conn.execute("""
    SELECT i.title, i.languages, i.genres FROM items i
    JOIN item_cities ic ON i.event_code = ic.event_code
    WHERE ic.city = 'bengaluru'
    AND i.event_code NOT IN (
        SELECT event_code FROM item_cities WHERE city = 'chennai'
    )
    AND i.type = 'movie'
""").fetchall()
print(f"Movies in Bengaluru but not Chennai: {len(blr_only)}")

This reveals regional distribution patterns -- which films get wide releases versus limited metro runs.

Step 4: How do I enrich with IMDb ratings and venue data?

Join BookMyShow movies with external data sources by title normalization. The IMDb Scraper and Google Maps Scraper from Thirdwatch provide complementary fields.

# After fetching IMDb data for matched titles
# Join on normalized title
import re

def normalize(title):
    return re.sub(r"[^a-z0-9 ]", "", title.lower()).strip()

# Assuming imdb_df is a DataFrame from the IMDb Scraper
movies_df = pd.read_sql("SELECT * FROM items WHERE type = 'movie'", conn)
movies_df["title_norm"] = movies_df["title"].apply(normalize)

enriched = movies_df.merge(
    imdb_df[["title_norm", "imdb_rating", "imdb_votes", "box_office"]],
    on="title_norm",
    how="left",
)
print(f"Matched {enriched['imdb_rating'].notna().sum()} of {len(enriched)} movies with IMDb ratings")

Sample output

Records from a multi-city run showing the fields available for database ingestion:

[
  {
    "type": "movie",
    "title": "Bhooth Bangla",
    "event_code": "ET00411383",
    "url": "https://in.bookmyshow.com/movies/mumbai/bhooth-bangla/ET00411383",
    "languages": ["Hindi"],
    "genres": ["Comedy", "Horror", "Thriller"],
    "rating": "UA16+",
    "release_date": "16 APR, 2026",
    "runtime_minutes": 165,
    "director": "Priyadarshan",
    "cast": [{"name": "Akshay Kumar", "url": "https://in.bookmyshow.com/akshay-kumar/94"}],
    "poster_url": "https://assets-in.bmscdn.com/discovery-catalog/events/et00411383-ufeqwqauqg-portrait.jpg",
    "scraped_city": "mumbai",
    "data_source": "bookmyshow"
  },
  {
    "type": "event",
    "title": "CALVIN HARRIS - Live in Mumbai",
    "event_code": "ET00462236",
    "category": ["music-shows", "concerts"],
    "genre": "edm",
    "event_date": "Sat 18 Apr 2026",
    "event_time": "4:00 PM",
    "venue_name": "Infinity Bay Sewri",
    "price_starting_from": "₹ 3500 onwards",
    "tags": ["outdoor-events", "fast-filling", "must-attend"],
    "scraped_city": "mumbai",
    "data_source": "bookmyshow"
  }
]

The event_code field is your primary key for deduplication. Movies carry languages (array), genres, cast, director, runtime_minutes, and synopsis. Events carry category, venue_name, event_date, event_time, duration, and price_starting_from. Both types include poster_url for display and scraped_city for geographic attribution.

Common pitfalls

Three issues surface when building entertainment databases from BookMyShow. Duplicate records across cities. A Bollywood film showing in all 8 metros appears 8 times in raw output. Always deduplicate on event_code and store city as a separate dimension. Missing detail fields for early listings. Movies announced but not yet released may lack cast, director, or synopsis. The actor returns whatever BookMyShow publishes; check for null fields before inserting into non-nullable database columns. Event date format variation. event_date comes as human-readable text ("Sat 18 Apr 2026") rather than ISO format. Parse it with Python's dateutil.parser before storing. Movies use release_date in a different format ("16 APR, 2026"), so handle both patterns.

Thirdwatch's actor handles BookMyShow's protections and delivers clean JSON so your pipeline code focuses on storage and enrichment, not scraping mechanics.

Related use cases

Frequently asked questions

How many movies and events does BookMyShow list at any given time?

BookMyShow typically lists 30-80 movies per city depending on size, plus hundreds of live events across music, comedy, sports, and workshops. A full sweep of the top 8 metros captures 200-500 unique movies and 1,000+ events.

Can I deduplicate movies that appear in multiple cities?

Yes. Use the event_code field as a stable unique identifier. The same movie has the same event_code across all cities. Deduplicate on event_code and store city as a related dimension.

Does the actor return historical data or only current listings?

Only current listings at the time of each run. To build historical depth, schedule daily runs and accumulate data in your database over time. Movies typically remain listed for 4-8 weeks after release.

Can I combine BookMyShow data with IMDb ratings?

Yes. Join on movie title after normalizing casing and removing year suffixes. For higher accuracy, use the BookMyShow synopsis and cast fields to cross-validate matches. The IMDb Scraper from Thirdwatch returns ratings, plot, and box-office data.

What database schema works best for this data?

A star schema with a central items table keyed on event_code, a cities dimension table, and junction tables for languages, genres, and cast. Events need additional columns for venue_name, event_date, event_time, and price_starting_from.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.