Skip to main content
Thirdwatchthirdwatch
Compliance & registries

Build an India Bankruptcy Database From IBBI Registry Data

Build a structured India insolvency and bankruptcy database from IBBI registry data. Complete Python recipes for ingestion, dedup, schema design, and sync.

May 26, 2026 · 7 min read · 1,532 words
See the scraper →

Thirdwatch's IBBI Scraper returns structured corporate debtor and insolvency professional records from India's IBBI registry — CIN, case type, status, timelines, IP assignments, notice URLs. This guide walks through building a production database: schema design, ingestion pipeline, dedup logic, incremental sync, and downstream queries.

Why build a structured IBBI database

India's Insolvency and Bankruptcy Code (IBC) has processed over 7,700 CIRP cases since 2016, making the IBBI registry one of the most data-rich regulatory sources in the Indian compliance ecosystem. According to IBBI's annual report for FY2024-25, the cumulative claims admitted under CIRP exceed INR 10 lakh crore, and the registry adds 150-200 new case records per quarter. Yet the IBBI website offers no bulk export, no API, and no structured download — every data point requires manual lookup.

The job-to-be-done is engineering. The Reserve Bank of India's Financial Stability Report cites IBC resolution data as a key indicator for banking sector health. A fintech building a credit-risk scoring product needs IBBI data as a feature in their model pipeline. A legal-tech startup building an IBC case tracker needs structured case records with status change detection. A compliance SaaS serving NBFCs needs automated counterparty screening against the insolvency registry. An academic research group studying IBC resolution efficiency needs a clean longitudinal dataset. All require a structured database with reliable ingestion, dedup, and incremental sync — not one-off manual lookups.

How does this compare to the alternatives?

Three paths to a structured IBBI database:

Approach Time to first query Incremental sync Schema control Maintenance
Manual CSV from IBBI website Hours per batch Manual re-download None (flat file) Per-session manual work
Commercial compliance APIs (Probe42, SignalX) Days (contract + onboarding) Vendor-managed Vendor schema Annual subscription
Thirdwatch IBBI Scraper + your database 30 minutes Automated via watermark Full control Thirdwatch tracks IBBI changes

Commercial compliance APIs bundle IBBI data with other registry sources but at enterprise pricing with vendor-locked schemas. The IBBI Scraper actor page gives you raw records that you ingest into your own schema.

How to build the database in 6 steps

Step 1: How do I authenticate against Apify?

Sign in at apify.com (free tier, no credit card), open Settings and Integrations, and copy your personal API token:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"
pip install apify-client pandas sqlalchemy

Step 2: How do I design the database schema?

Two core tables map to the IBBI actor's two entity types.

from sqlalchemy import create_engine, Column, String, Date, DateTime, Text
from sqlalchemy.orm import declarative_base, sessionmaker
from datetime import datetime

Base = declarative_base()

class CorporateDebtor(Base):
    __tablename__ = "corporate_debtors"
    id = Column(String, primary_key=True)  # cin + case_type + announcement_type_full
    cin = Column(String, index=True)
    company_name = Column(String)
    case_type = Column(String)
    announcement_type_full = Column(Text)
    status = Column(String, index=True)
    announcement_date = Column(Date)
    claim_submission_deadline = Column(Date)
    applicant_name = Column(String)
    insolvency_professional_name = Column(String)
    remarks = Column(Text)
    notice_url = Column(String)
    source_url = Column(String)
    data_source = Column(String)
    first_seen = Column(DateTime, default=datetime.utcnow)
    last_updated = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

class InsolvencyProfessional(Base):
    __tablename__ = "insolvency_professionals"
    registration_number = Column(String, primary_key=True)
    name = Column(String)
    ipa = Column(String)
    enrolment_date = Column(Date)
    city = Column(String)
    state = Column(String)
    status = Column(String, index=True)
    source_url = Column(String)
    data_source = Column(String)
    first_seen = Column(DateTime, default=datetime.utcnow)
    last_updated = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

engine = create_engine("sqlite:///ibbi_registry.db")
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
print("Schema created: corporate_debtors + insolvency_professionals")

The composite primary key on corporate_debtors (cin + case_type + announcement_type_full) handles the fact that a single CIN can appear in multiple announcements across its CIRP lifecycle.

Step 3: How do I run the initial bulk ingestion?

Fetch corporate debtor records for a seed list of companies or broad queries.

from apify_client import ApifyClient
import hashlib

client = ApifyClient(os.environ["APIFY_TOKEN"])
session = Session()

SEED_QUERIES = [
    "Essar", "Bhushan", "Jaypee", "Videocon", "Amtek",
    "Alok Industries", "Lanco", "ABG Shipyard", "Ruchi Soya",
    "Electrosteel", "Monnet Ispat", "Jyoti Structures"
]

run = client.actor("thirdwatch/ibbi-scraper").call(
    run_input={
        "queries": SEED_QUERIES,
        "searchType": "corporate_debtor",
        "maxResults": 200
    }
)

items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
inserted, updated = 0, 0

for item in items:
    if item.get("entity_type") != "corporate_debtor":
        continue
    record_id = hashlib.md5(
        f"{item.get('cin', '')}{item.get('case_type', '')}{item.get('announcement_type_full', '')}".encode()
    ).hexdigest()

    existing = session.query(CorporateDebtor).get(record_id)
    if existing:
        if existing.status != item.get("status"):
            existing.status = item.get("status")
            existing.last_updated = datetime.utcnow()
            updated += 1
    else:
        session.add(CorporateDebtor(
            id=record_id,
            cin=item.get("cin"),
            company_name=item.get("company_name"),
            case_type=item.get("case_type"),
            announcement_type_full=item.get("announcement_type_full"),
            status=item.get("status"),
            announcement_date=item.get("announcement_date"),
            claim_submission_deadline=item.get("claim_submission_deadline"),
            applicant_name=item.get("applicant_name"),
            insolvency_professional_name=item.get("insolvency_professional_name"),
            remarks=item.get("remarks"),
            notice_url=item.get("notice_url"),
            source_url=item.get("source_url"),
            data_source=item.get("data_source"),
        ))
        inserted += 1

session.commit()
print(f"Bulk ingestion complete: {inserted} inserted, {updated} updated")

Twelve seed queries covering India's largest IBC cases should return 200-500 records — enough to validate the pipeline before scaling.

Step 4: How do I ingest insolvency professional records?

Run a separate query with searchType set to insolvency_professional.

ip_run = client.actor("thirdwatch/ibbi-scraper").call(
    run_input={
        "queries": ["ICSI", "ICAI", "ICMAI", "IIIPI"],
        "searchType": "insolvency_professional",
        "maxResults": 500
    }
)

ip_items = list(client.dataset(ip_run["defaultDatasetId"]).iterate_items())
ip_inserted = 0

for item in ip_items:
    if item.get("entity_type") != "insolvency_professional":
        continue
    reg_num = item.get("registration_number")
    if not reg_num:
        continue

    existing = session.query(InsolvencyProfessional).get(reg_num)
    if not existing:
        session.add(InsolvencyProfessional(
            registration_number=reg_num,
            name=item.get("name"),
            ipa=item.get("ipa"),
            enrolment_date=item.get("enrolment_date"),
            city=item.get("city"),
            state=item.get("state"),
            status=item.get("status"),
            source_url=item.get("source_url"),
            data_source=item.get("data_source"),
        ))
        ip_inserted += 1

session.commit()
print(f"IP ingestion complete: {ip_inserted} professionals inserted")

Querying by IPA names surfaces IPs across all four registered agencies.

Step 5: How do I implement incremental sync?

Use announcement_date as a watermark to fetch only new records on subsequent runs.

from sqlalchemy import func

def incremental_sync(session, client):
    latest_date = session.query(
        func.max(CorporateDebtor.announcement_date)
    ).scalar()

    run = client.actor("thirdwatch/ibbi-scraper").call(
        run_input={
            "queries": ["CIRP", "Liquidation"],
            "searchType": "corporate_debtor",
            "maxResults": 500
        }
    )

    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    new_records = [
        item for item in items
        if item.get("announcement_date") and item["announcement_date"] > str(latest_date)
    ]

    inserted, status_changed = 0, 0
    for item in new_records:
        record_id = hashlib.md5(
            f"{item.get('cin', '')}{item.get('case_type', '')}{item.get('announcement_type_full', '')}".encode()
        ).hexdigest()

        existing = session.query(CorporateDebtor).get(record_id)
        if existing and existing.status != item.get("status"):
            existing.status = item.get("status")
            existing.last_updated = datetime.utcnow()
            status_changed += 1
        elif not existing:
            session.add(CorporateDebtor(
                id=record_id, cin=item.get("cin"),
                company_name=item.get("company_name"),
                case_type=item.get("case_type"),
                announcement_type_full=item.get("announcement_type_full"),
                status=item.get("status"),
                announcement_date=item.get("announcement_date"),
                claim_submission_deadline=item.get("claim_submission_deadline"),
                applicant_name=item.get("applicant_name"),
                insolvency_professional_name=item.get("insolvency_professional_name"),
                remarks=item.get("remarks"),
                notice_url=item.get("notice_url"),
                source_url=item.get("source_url"),
                data_source=item.get("data_source"),
            ))
            inserted += 1

    session.commit()
    return inserted, status_changed

new, changed = incremental_sync(session, client)
print(f"Incremental sync: {new} new records, {changed} status changes")

Status changes (Admitted to Resolved, Admitted to Liquidation) are the highest-signal events in the dataset. Log these transitions to a separate audit table for downstream alerting.

Step 6: How do I query the database for downstream use?

Run analytical queries against the structured database.

from sqlalchemy import text

# Top insolvency professionals by case count
top_ips = session.execute(text("""
    SELECT insolvency_professional_name, COUNT(*) as case_count,
           COUNT(DISTINCT cin) as unique_companies
    FROM corporate_debtors
    WHERE insolvency_professional_name IS NOT NULL
    GROUP BY insolvency_professional_name
    ORDER BY case_count DESC
    LIMIT 15
""")).fetchall()

print("Top IPs by case volume:")
for ip_name, cases, companies in top_ips:
    print(f"  {ip_name}: {cases} cases across {companies} companies")

# Resolution rate by case type
resolution = session.execute(text("""
    SELECT case_type, status, COUNT(*) as count
    FROM corporate_debtors
    GROUP BY case_type, status
    ORDER BY case_type, count DESC
""")).fetchall()

print("\nCase outcomes by type:")
for case_type, status, count in resolution:
    print(f"  {case_type} / {status}: {count}")

These queries surface operational intelligence: which IPs handle the most cases, and what the resolution vs. liquidation ratio looks like across case types.

Sample output

A corporate debtor record and an insolvency professional record side by side.

[
  {
    "entity_type": "corporate_debtor",
    "company_name": "Alok Industries Limited",
    "cin": "L17110MH1986PLC038463",
    "case_type": "CIRP",
    "announcement_type_full": "Public Announcement under Section 15 of IBC",
    "status": "Resolved",
    "announcement_date": "2017-07-18",
    "claim_submission_deadline": "2017-08-01",
    "applicant_name": "State Bank of India",
    "insolvency_professional_name": "Ajay Joshi",
    "remarks": "Resolution plan by Reliance Industries approved",
    "notice_url": "https://ibbi.gov.in/uploads/notice/alok.pdf",
    "source_url": "https://ibbi.gov.in/corporate-debtor",
    "data_source": "IBBI"
  },
  {
    "entity_type": "insolvency_professional",
    "name": "Ajay Joshi",
    "registration_number": "IBBI/IPA-001/IP-P00123/2017-18/10456",
    "ipa": "ICSI IIP",
    "enrolment_date": "2017-01-15",
    "city": "Mumbai",
    "state": "Maharashtra",
    "status": "Active",
    "source_url": "https://ibbi.gov.in/insolvency-professional",
    "data_source": "IBBI"
  }
]

The registration_number on insolvency professional records is the canonical join key to cross-reference IP assignments across corporate debtor cases.

Common pitfalls

Six things go wrong in IBBI database builds. The National Company Law Tribunal publishes case lists that complement IBBI data for litigation tracking. Composite key collisions — using CIN alone as a primary key fails because a single company can have multiple CIRP announcements (initial, revised deadline, resolution approval). Always use a composite of cin + case_type + announcement_type_full. Date parsing inconsistency — IBBI dates appear in multiple formats (DD-MM-YYYY, YYYY-MM-DD, DD/MM/YYYY). Normalize all dates at ingestion time with a permissive parser. Status enumeration drift — IBBI has introduced new status values over the years (e.g., "Withdrawn u/s 12A" was added after the 2018 amendment). Treat status as a free-text field, not a strict enum, and add new values to your lookup table as they appear.

IP name normalization — the same insolvency professional may appear as "Mr. Satish Kumar Gupta", "Satish K. Gupta", or "S. K. Gupta" across different case records. Use registration_number as the canonical identifier; fall back to fuzzy name matching. Missing CIN on older records — pre-2018 IBBI records occasionally have blank CINs. Implement a fallback join on company_name + announcement_date. Schedule the incremental sync as a daily cron job for active monitoring or weekly for research datasets.

Related use cases

Frequently asked questions

What schema should I use for an IBBI insolvency database?

Two core tables: corporate_debtors (cin PRIMARY KEY, company_name, case_type, announcement_type_full, status, announcement_date, claim_submission_deadline, applicant_name, insolvency_professional_name, remarks, notice_url, source_url, first_seen, last_updated) and insolvency_professionals (registration_number PRIMARY KEY, name, ipa, enrolment_date, city, state, status, source_url, first_seen, last_updated). Use cin as the join key to MCA and CIBIL systems. Add a case_status_history table (cin, status, observed_date) for time-series tracking of status transitions.

How do I handle incremental sync with the IBBI registry?

Store a high-water mark (latest announcement_date seen per search query) in your database. On each sync run, fetch records and compare announcement_date against the watermark. Insert new records, update status on existing CINs where status has changed, and advance the watermark. For corporate debtors, dedup on cin + case_type + announcement_type_full (a single CIN can have multiple announcements). For insolvency professionals, dedup on registration_number. Run daily for active monitoring or weekly for research datasets.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.