Track Bollywood Box Office Trends Using BookMyShow Data
Analyze Bollywood movie performance using BookMyShow listings across Indian cities. Track release patterns, genre trends, language distribution, and cast reach.

Thirdwatch's BookMyShow Scraper pulls movie listings from BookMyShow across 600+ Indian cities, returning title, languages, genres, cast, director, runtime, release date, and city-level availability. Tracked over time, this data reveals Bollywood release patterns, genre trends, language distribution, cast dominance, and theatrical longevity. Built for entertainment analysts, film industry researchers, journalists covering Indian cinema, and data teams building box-office intelligence products.
Why track Bollywood box office with BookMyShow data
India's film industry is the world's most prolific by volume. According to UNESCO's Institute for Statistics, India produces over 1,500 feature films annually -- more than any other country. Yet structured box-office data for Indian cinema is notoriously scarce. Unlike the US market where Box Office Mojo and The Numbers publish daily grosses, Indian box-office figures are fragmented across trade analysts, unverified social media accounts, and paywalled industry reports.
BookMyShow fills this gap indirectly. As India's dominant online ticketing platform, its listings reflect what is actually playing in theaters across 600+ cities in real time. A film's distribution breadth (how many cities it appears in), listing longevity (how many weeks it stays listed), language reach (how many language versions are released), and genre categorization are all visible from listing metadata. These proxy signals power concrete analysis: which Bollywood films got wide versus limited releases, which directors command multi-city premieres, how genre preferences differ between metros and tier-2 cities, and which films had staying power versus quick exits from theaters.
How does this compare to the alternatives?
| Approach | Data granularity | Coverage | Cost | Reliability |
|---|---|---|---|---|
| Trade analyst reports (Bollywood Hungama, Sacnilk) | Daily grosses, approximate | Top 20-30 films per week | Free to paywalled | Estimates, not verified |
| Official studio releases | Milestone announcements only | Their own films | Free | Selective, PR-filtered |
| Thirdwatch BookMyShow Scraper | City-level listings, metadata, cast, genre | Every film on BookMyShow, 600+ cities | Pay per result | Structured, automated |
Trade analyst reports provide revenue estimates for top films but lack granularity below the top 30 and do not cover events or regional cinema comprehensively. Studio announcements are PR-filtered milestones, not systematic data. The BookMyShow Scraper provides the broadest structured view of what is actually showing, where, and for how long, which complements revenue estimates with distribution and metadata intelligence.
How to track Bollywood box office in 4 steps
Step 1: How do I capture the current movie landscape across India?
Run the scraper in movies mode across a representative set of Indian cities to get the current theatrical slate.
import os, requests, pandas as pd
ACTOR = "thirdwatch~bookmyshow-scraper"
TOKEN = os.environ["APIFY_TOKEN"]
METROS = ["Mumbai", "Delhi", "Bengaluru", "Chennai", "Hyderabad",
"Pune", "Kolkata", "Ahmedabad", "Jaipur", "Lucknow",
"Kochi", "Chandigarh", "Indore", "Bhopal", "Patna"]
resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={
"queries": METROS,
"mode": "movies",
"includeDetails": True,
"maxResults": 1000,
},
timeout=600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} listings across {df['scraped_city'].nunique()} cities")
print(f"Unique movies: {df['event_code'].nunique()}")Fifteen cities with full detail enrichment captures the theatrical slate for both metros and tier-2 cities. The mode: "movies" setting skips events to focus on cinema only.
Step 2: How do I analyze distribution breadth and release patterns?
Count how many cities each movie appears in. Wide distribution signals studio confidence and marketing spend.
# Deduplicate and count city presence
city_counts = df.groupby("event_code").agg(
title=("title", "first"),
languages=("languages", "first"),
genres=("genres", "first"),
release_date=("release_date", "first"),
director=("director", "first"),
city_count=("scraped_city", "nunique"),
cities=("scraped_city", lambda x: sorted(x.unique().tolist())),
).reset_index()
city_counts = city_counts.sort_values("city_count", ascending=False)
print("TOP 10 WIDEST RELEASES:")
for _, row in city_counts.head(10).iterrows():
langs = ", ".join(row["languages"]) if isinstance(row["languages"], list) else ""
print(f" {row['title']} ({langs}) — {row['city_count']} cities, dir: {row['director']}")
print(f"\nLIMITED RELEASES (3 or fewer cities):")
limited = city_counts[city_counts["city_count"] <= 3]
print(f" {len(limited)} films in limited release")Films in 12+ of 15 tracked cities have wide distribution. Films in 1-3 cities are limited releases -- often regional films, indie productions, or late-run holdovers.
Step 3: How do I track genre and language trends over time?
Schedule weekly runs and aggregate genre and language distributions to identify shifts in the theatrical mix.
import json
# Parse language and genre arrays
movies = df.drop_duplicates(subset="event_code").copy()
movies["lang_list"] = movies["languages"].apply(
lambda x: x if isinstance(x, list) else []
)
movies["genre_list"] = movies["genres"].apply(
lambda x: x if isinstance(x, list) else []
)
# Language distribution
all_langs = [lang for langs in movies["lang_list"] for lang in langs]
lang_dist = pd.Series(all_langs).value_counts()
print("LANGUAGE DISTRIBUTION (unique movies):")
print(lang_dist.head(10).to_string())
# Genre distribution
all_genres = [genre for genres in movies["genre_list"] for genre in genres]
genre_dist = pd.Series(all_genres).value_counts()
print("\nGENRE DISTRIBUTION:")
print(genre_dist.head(10).to_string())
# Multi-language releases (pan-India films)
multi_lang = movies[movies["lang_list"].apply(len) >= 3]
print(f"\nPAN-INDIA FILMS (3+ languages): {len(multi_lang)}")
for _, row in multi_lang.iterrows():
print(f" {row['title']} — {', '.join(row['lang_list'])}")Pan-India films releasing in 3+ languages simultaneously (Hindi, Tamil, Telugu) are an accelerating trend in Indian cinema. Tracking this metric weekly reveals how quickly regional barriers are dissolving.
Step 4: How do I measure cast and director dominance?
Extract cast and director fields to rank talent by theatrical presence.
from collections import Counter
# Director rankings by number of films currently in theaters
directors = movies[movies["director"].notna()]
dir_counts = directors["director"].value_counts()
print("DIRECTORS WITH MOST FILMS CURRENTLY IN THEATERS:")
print(dir_counts.head(10).to_string())
# Actor rankings by film count
actor_counter = Counter()
for cast_list in movies["cast"]:
if isinstance(cast_list, list):
for member in cast_list:
if isinstance(member, dict) and "name" in member:
actor_counter[member["name"]] += 1
print("\nACTORS WITH MOST FILMS CURRENTLY IN THEATERS:")
for name, count in actor_counter.most_common(15):
print(f" {name}: {count} films")Tracked weekly, this builds a longitudinal view of which talent is most active and which directors have the most concurrent releases -- a metric that correlates with commercial viability and star power.
Sample output
A movie record with the fields used for box-office proxy analysis:
{
"type": "movie",
"title": "Bhooth Bangla",
"event_code": "ET00411383",
"url": "https://in.bookmyshow.com/movies/mumbai/bhooth-bangla/ET00411383",
"languages": ["Hindi"],
"genres": ["Comedy", "Horror", "Thriller"],
"rating": "UA16+",
"release_date": "16 APR, 2026",
"runtime_minutes": 165,
"synopsis": "Arjun Acharya, a financially troubled man in his late 30s, unexpectedly inherits his grandfather's massive ancestral palace...",
"director": "Priyadarshan",
"cast": [
{"name": "Akshay Kumar", "url": "https://in.bookmyshow.com/akshay-kumar/94"}
],
"poster_url": "https://assets-in.bmscdn.com/discovery-catalog/events/et00411383-ufeqwqauqg-portrait.jpg",
"scraped_city": "mumbai",
"data_source": "bookmyshow"
}The fields that matter for box-office analysis: event_code for deduplication and time-series tracking, languages for language-market segmentation, genres for category trends, release_date for release-window analysis, cast and director for talent tracking, runtime_minutes for format analysis, and scraped_city for distribution breadth. The rating field (U, UA, UA16+, A) indicates the film's audience age bracket, which correlates with ticket-price tier and target demographic.
Common pitfalls
Three issues affect box-office analysis built on BookMyShow data. Mistaking listing presence for revenue. A film listed in 15 cities is widely distributed, but distribution breadth does not equal ticket sales. Combine BookMyShow data with trade analyst revenue estimates for a complete picture. BookMyShow provides the "where and how long" signals; revenue estimates provide the "how much" signal. Release date format parsing. The release_date field uses BookMyShow's format ("16 APR, 2026"), not ISO 8601. Parse with Python's dateutil.parser or a custom strptime pattern before doing date arithmetic. Some films list "Coming Soon" without a concrete date. Multi-language deduplication. A pan-India film releasing in Hindi, Tamil, and Telugu may have the same event_code across all language versions, or different codes for each version. Always check before assuming one code equals one film.
Thirdwatch's actor handles BookMyShow's request protection and delivers clean JSON so your analysis pipeline operates on structured fields, not raw HTML.
Related use cases
- Scrape BookMyShow movies and events in India
- Build an India entertainment database with BookMyShow
- Monitor BookMyShow ticket availability and pricing
- Scrape IMDb for movie ratings and box office data
- The complete guide to scraping entertainment data
- BookMyShow Scraper on Apify Store
- All Thirdwatch use-case guides
Frequently asked questions
Does BookMyShow provide actual box office revenue numbers?
No. BookMyShow does not publish revenue or ticket-sales figures publicly. The actor captures listing metadata -- city availability, languages, genres, release dates, runtime, cast, and listing duration -- which serve as strong proxies for distribution reach and audience demand.
Can I track how long a movie stays in theaters?
Yes. Schedule daily runs and track the first_seen and last_seen dates for each event_code. Most Bollywood films stay listed for 4-8 weeks. Films that persist beyond 8 weeks are outlier hits. Films that disappear within 1-2 weeks had weak box office performance.
How do I separate Bollywood from regional cinema?
Filter on the languages array. Hindi-language films are predominantly Bollywood. Tamil, Telugu, Kannada, Malayalam, and Bengali films are regional cinema. Many films release in multiple languages simultaneously, which the languages array captures.
Can I compare movie performance across cities?
Yes. Run the scraper across multiple cities in a single query and track which movies appear where. A film listed in 30+ cities has wide distribution; a film in only 3-5 cities has a limited release. City count over time is a reliable proxy for theatrical reach.
How accurate is BookMyShow data as a box office proxy?
BookMyShow is the dominant online ticketing platform in India, processing a significant share of urban movie tickets. Its listing data reflects what theaters are actively showing. It does not capture walk-up ticket sales or non-BookMyShow platforms, but for urban India, it is the most comprehensive single signal source.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.