Compliance & registries

Build an India Company Registry Database From MCA in 2026

Extract structured MCA company master data at scale with Thirdwatch. Build a searchable Indian company registry database with directors, capital, and status.

May 26, 2026 · 6 min read · 1,302 words

See the scraper →

Thirdwatch's MCA India Scraper extracts structured company master data from India's Ministry of Corporate Affairs registry at scale. Pass CINs in bulk, receive JSON records with company name, status, capital, directors with DIN, registered address, and RoC jurisdiction. Built for developers building Indian company databases, enrichment pipelines, and compliance data products.

Why build an India company registry database

India has over 2.5 million active registered companies according to the MCA Annual Report 2024-25, with the total registry exceeding 3 million entities when including struck-off and dormant companies. Any data product targeting Indian business intelligence, credit scoring, or compliance automation needs a structured, queryable mirror of this registry.

The raw MCA data is locked behind a CAPTCHA-gated portal that was not designed for programmatic access. Commercial API providers (Tofler, Signzy, Probe42) charge per lookup and gate director data behind premium tiers. Building a local database from MCA data means you control the refresh cadence, pay once for extraction, and query freely without per-lookup costs downstream.

The typical developer use case: a fintech building KYB verification needs to pre-populate company profiles for instant lookup. A data analytics firm needs a company universe for sector analysis. A compliance SaaS wants to cross-reference director networks. All start with the same step -- bulk CIN extraction into a structured database.

How does this compare to the alternatives?

Approach	Reliability	Setup time	Maintenance	Bulk access
MCA21 portal (manual)	CAPTCHA-limited	Immediate	Per-lookup effort	Not feasible
Commercial API (Tofler, Signzy)	High	Days (contract)	Vendor-managed	Per-lookup pricing
Open datasets (data.gov.in)	Stale, partial	Hours	No updates	CSV dumps, incomplete
Thirdwatch MCA India Scraper	High	5 minutes	Thirdwatch tracks changes	Pay per result, bulk-friendly

The MCA India Scraper is the developer-friendly middle ground: structured output, no subscription, and bulk-capable.

How to build an India company database in 5 steps

Step 1: How do I set up the Apify client?

Install the Python client and authenticate:

pip install apify-client pandas sqlalchemy psycopg2-binary
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Get your token from apify.com under Settings, then Integrations.

Step 2: How do I extract companies in bulk by CIN?

Pass a batch of CINs. The actor processes them sequentially with polite rate-limiting:

from apify_client import ApifyClient
import pandas as pd

client = ApifyClient("apify_api_xxxxxxxxxxxxxxxx")

# Batch of CINs -- source these from data.gov.in company lists,
# industry association directories, or existing CRM records
cins = [
    "L17110MH1973PLC019786",   # Reliance Industries
    "L45200MH1945PLC004520",   # TCS
    "L72910KA1981PLC046065",   # Wipro
    "U72200KA2004PTC035289",   # Infosys BPM
    "L65910MH2000PLC129408",   # ICICI Bank
]

run = client.actor("thirdwatch/mca-india-scraper").call(
    run_input={
        "queries": cins,
        "maxResults": len(cins),
        "includeDirectors": True,
    }
)

items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
df = pd.DataFrame(items)
print(f"Retrieved {len(df)} company records with {sum(len(r.get('directors', [])) for r in items)} directors")

Five CINs return five structured records in under 30 seconds. Scale to hundreds by expanding the cins list.

Step 3: How do I normalize the data for a relational database?

The actor returns directors as a nested array. Normalize into two tables -- companies and directors:

from sqlalchemy import create_engine, text

engine = create_engine("postgresql://user:pass@localhost:5432/company_db")

# Companies table -- flat fields
companies_df = df.drop(columns=["directors"], errors="ignore")
companies_df.to_sql("companies", engine, if_exists="append", index=False)

# Directors table -- explode the nested array
directors_rows = []
for _, row in df.iterrows():
    for d in row.get("directors") or []:
        directors_rows.append({
            "cin": row["cin"],
            "director_name": d.get("name"),
            "din": d.get("din"),
            "designation": d.get("designation"),
        })

directors_df = pd.DataFrame(directors_rows)
directors_df.to_sql("directors", engine, if_exists="append", index=False)
print(f"Inserted {len(companies_df)} companies and {len(directors_df)} directors")

The cin column is the natural primary key for companies. The directors table uses a composite key of cin + din. This schema supports efficient joins for director interlock queries.

Step 4: How do I query director networks across companies?

With normalized data, cross-company director analysis is a standard SQL query:

interlock_query = text("""
    SELECT d.din, d.director_name,
           COUNT(DISTINCT d.cin) AS board_seats,
           ARRAY_AGG(c.company_name) AS companies
    FROM directors d
    JOIN companies c ON d.cin = c.cin
    GROUP BY d.din, d.director_name
    HAVING COUNT(DISTINCT d.cin) > 1
    ORDER BY board_seats DESC
""")

with engine.connect() as conn:
    interlocks = pd.read_sql(interlock_query, conn)
    print(f"{len(interlocks)} directors serve on multiple boards:")
    print(interlocks.head(20))

Director interlocks surface related-party relationships, group company structures, and potential conflicts of interest -- critical for compliance and credit risk systems.

Step 5: How do I schedule incremental refreshes?

Set up a weekly refresh to keep the database current:

# Apify scheduled run -- refresh all CINs weekly
schedule = client.schedules().create(
    name="mca-weekly-refresh",
    cron_expression="0 3 * * 0",  # Sunday 3 AM
    actions=[{
        "type": "RUN_ACTOR",
        "actorId": "thirdwatch/mca-india-scraper",
        "runInput": {
            "queries": cins,  # Your full CIN watchlist
            "maxResults": 10000,
            "includeDirectors": True,
        },
    }],
)

# In your ETL pipeline, upsert on CIN to detect changes:
# - Status changes (Active -> Struck Off)
# - Director additions or resignations
# - Capital structure changes
# - Address updates

Weekly cadence catches material changes (status transitions, director resignations) with minimal extraction cost. For actively monitored entities, increase to daily.

Sample output

Two records from a bulk extraction:

[
  {
    "cin": "L17110MH1973PLC019786",
    "company_name": "Reliance Industries Limited",
    "status": "Active",
    "incorporation_date": "08 May, 1973",
    "company_type": "Public Limited",
    "authorized_capital": "50,000 Cr",
    "paid_up_capital": "13,532.5 Cr",
    "registered_address": "3rd Floor, Maker Chambers IV, 222, Nariman Point, Mumbai - 400021",
    "roc": "RoC-Mumbai",
    "listing_status": "Listed",
    "state": "Maharashtra",
    "state_code": "MH",
    "nic_industry_code": "17110",
    "directors": [
      {"name": "Mukesh Dhirubhai Ambani", "din": "00001695", "designation": "Managing Director"}
    ],
    "query": "L17110MH1973PLC019786"
  },
  {
    "cin": "L65910MH2000PLC129408",
    "company_name": "ICICI Bank Limited",
    "status": "Active",
    "incorporation_date": "05 Jan, 1994",
    "company_type": "Public Limited",
    "authorized_capital": "2,500 Cr",
    "paid_up_capital": "1,396 Cr",
    "registered_address": "ICICI Bank Tower, Near Chakli Circle, Old Padra Road, Vadodara - 390007",
    "roc": "RoC-Ahmedabad",
    "listing_status": "Listed",
    "state": "Maharashtra",
    "state_code": "MH",
    "nic_industry_code": "65910",
    "directors": [
      {"name": "Sandeep Bakhshi", "din": "00109206", "designation": "Managing Director & CEO"}
    ],
    "query": "L65910MH2000PLC129408"
  }
]

Every field is typed and consistent across records. The cin field serves as a stable primary key. directors is always an array (empty if includeDirectors is false). query maps each result to the input CIN for batch reconciliation.

Common pitfalls

Four issues developers hit when building MCA databases. The National Stock Exchange publishes CIN lists for all listed companies, which is a reliable seed source for building your initial database. CIN sourcing -- the actor needs CINs as input, not company names. For bulk database builds, source CIN lists from data.gov.in company datasets, industry association directories, stock exchange listings (for listed companies), or existing CRM records. Company-name lookups work but are approximate and unreliable for common names.

Schema drift in capital fields -- authorized and paid-up capital come as formatted strings (e.g., "50,000 Cr") rather than raw integers. Parse these before inserting into numeric database columns. Handle edge cases: some older companies report capital in lakhs, others in crores.

Incomplete records for dormant companies -- companies that have not filed annual returns may have missing fields (email, PAN, last AGM date). Design your schema with nullable columns and treat missing data as a signal, not an error.

Rate-limiting at scale -- the actor enforces polite 2-second delays between lookups to avoid detection. A batch of 1,000 CINs takes approximately 35 minutes. Plan your ETL windows accordingly, and split very large batches across multiple runs.

Thirdwatch handles the registry access and extraction logic so you can focus on the database and application layer. For bankruptcy cross-referencing, pair with IBBI Insolvency India.

Related use cases

MCA India Scraper actor page
Scrape MCA India company data for due diligence -- compliance-focused single-company lookups
Look up MCA directors for KYC compliance -- director identity verification
Track India company incorporations by state -- geographic incorporation trends
Build an India bankruptcy database from IBBI
The complete guide to scraping compliance data
All Thirdwatch use-case guides

Frequently asked questions

How many companies can I look up in a single run?

The actor accepts up to 10,000 CINs per run via the maxResults parameter. For larger batches, split across multiple runs. Each CIN lookup takes roughly 2 seconds due to polite rate-limiting, so a batch of 500 CINs completes in about 15-20 minutes.

What database format works best for MCA company data?

PostgreSQL with a companies table keyed on CIN and a directors junction table keyed on CIN plus DIN. This normalizes the one-to-many company-to-director relationship and enables efficient cross-company director queries for interlock analysis.

Scrape MCA India Company Data for Due Diligence (2026 Guide)Look Up MCA Company Directors for KYC Compliance (2026)Track India Company Incorporations by State With MCA Data

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.