Skip to main content
Thirdwatchthirdwatch
Real estate

Build an India Property Research Pipeline with 99acres (2026)

Build a production India property-research pipeline from 99acres using Thirdwatch. Multi-city + per-locality + recipes.

Apr 28, 2026 · 6 min read · 1,300 words
See the scraper →

Thirdwatch's 99acres Scraper makes India property-research-pipeline development a structured workflow — multi-city ingestion, locality-tier enrichment, longitudinal data warehouse seeding. Built for India proptech SaaS startups, India real-estate-investment platforms, India HR-relocation services, and India PE-research SaaS founders.

Why build a 99acres research pipeline

99acres is the canonical India tier-1 metro foundation source. According to 99acres' 2024 IRIS quarterly report, the platform indexes 1.5M+ active India listings with 90%+ tier-1 metro broker representation — material foundation for India proptech products. For India proptech + real-estate-investment teams, 99acres provides the canonical multi-source India property pipeline starting point.

The job-to-be-done is structured. An India proptech SaaS startup builds a 100-locality data warehouse for customer-facing comparison tools. An India real-estate-investment platform powers per-locality investment scoring with weekly 99acres data. An India HR-relocation service offers tier-1 metro relocation briefings. An India PE-research SaaS provides society-level yield benchmarks. All reduce to multi-city ingestion + cross-snapshot enrichment + downstream-product API exposure.

How does this compare to the alternatives?

Three options for India property-research pipelines:

Approach Cost per 100-locality weekly Reliability Setup time Maintenance
Knight Frank / JLL India $20K-$100K/year Authoritative, lagged Weeks Annual contract
Manual locality-research Free, time-intensive Slow Hours/locality Daily manual work
Thirdwatch 99acres Scraper Pay per result HTTP + structured data 5 minutes Thirdwatch tracks 99acres

The 99acres Scraper actor page gives you raw real-time tier-1 data at materially lower per-record cost.

How to build the pipeline in 4 steps

Step 1: Authenticate

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: Ingest tier-1 metro per-locality batches

import os, requests, datetime, json, pathlib
from itertools import product

ACTOR = "thirdwatch~acres99-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

INDIA_TIER_1 = {
    "Mumbai": ["Powai", "Bandra-West", "Andheri-West", "Lower-Parel"],
    "Delhi-NCR": ["Gurgaon", "Noida", "Saket", "Vasant-Kunj"],
    "Bangalore": ["Indiranagar", "Whitefield", "HSR-Layout", "Koramangala"],
    "Hyderabad": ["Hitech-City", "Gachibowli", "Banjara-Hills"],
    "Pune": ["Koregaon-Park", "Hinjewadi", "Aundh"],
    "Chennai": ["OMR", "Velachery", "Anna-Nagar"],
}

queries = []
for city, localities in INDIA_TIER_1.items():
    for loc in localities:
        for bhk in ["2BHK", "3BHK"]:
            for listing in ["rent", "buy"]:
                queries.append({"city": city, "locality": loc,
                                "property_type": "apartment",
                                "bhk": bhk, "listing": listing})

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={"queries": queries, "maxResults": 50},
    timeout=3600,
)
records = resp.json()
ts = datetime.datetime.utcnow().strftime("%Y%m%d")
pathlib.Path(f"snapshots/99acres-pipeline-{ts}.json").write_text(json.dumps(records))
print(f"{ts}: {len(records)} listings across {len(queries)} queries")

22 city-localities × 2 BHK × 2 listing-types = 88 queries × 50 = 4,400 records per snapshot.

Step 3: Enrich + persist to PostgreSQL

import re, pandas as pd, psycopg2

def parse_inr(s):
    if not isinstance(s, str): return None
    s = s.lower().replace("₹", "").replace(",", "").strip()
    if "k" in s: return float(s.replace("k", "").strip()) * 1000
    if "cr" in s: return float(re.search(r"([\d.]+)", s).group(1)) * 10_000_000
    if "lac" in s or "lakh" in s: return float(re.search(r"([\d.]+)", s).group(1)) * 100_000
    try: return float(s)
    except: return None

df = pd.DataFrame(records)
df["price_inr"] = df.price.apply(parse_inr)
df["area_sqft"] = pd.to_numeric(df.area_sqft, errors="coerce")
df["price_per_sqft"] = df.price_inr / df.area_sqft

# Locality-name normalization
LOCALITY_CANONICAL = {
    "Indira Nagar": "Indiranagar",
    "HSR": "HSR-Layout",
    # ... extend mapping
}
df["locality"] = df.locality.replace(LOCALITY_CANONICAL)

# Persist to PostgreSQL
with psycopg2.connect(...) as conn, conn.cursor() as cur:
    for _, row in df.iterrows():
        cur.execute(
            """INSERT INTO india_listings
                  (listing_id, city, locality, bhk, listing_type, price_inr,
                   area_sqft, price_per_sqft, snapshot_date)
               VALUES (%s,%s,%s,%s,%s,%s,%s,%s, current_date)
               ON CONFLICT (listing_id) DO UPDATE SET
                 price_inr = EXCLUDED.price_inr,
                 snapshot_date = EXCLUDED.snapshot_date""",
            (row.listing_id, row.city, row.locality, row.bhk, row.listing,
             row.price_inr, row.area_sqft, row.price_per_sqft),
        )
print(f"Persisted {len(df)} listings")

Step 4: Compute per-locality benchmarks + expose via API

# Per-locality benchmarks (refreshed weekly)
benchmarks = (
    df.dropna(subset=["price_per_sqft"])
    .groupby(["city", "locality", "bhk", "listing"])
    .agg(median_psf=("price_per_sqft", "median"),
         p25_psf=("price_per_sqft", lambda x: x.quantile(0.25)),
         p75_psf=("price_per_sqft", lambda x: x.quantile(0.75)),
         listing_count=("listing_id", "count"))
    .reset_index()
)
benchmarks = benchmarks[benchmarks.listing_count >= 5]
benchmarks.to_sql("india_benchmarks", con=engine, if_exists="replace")

# Expose via REST API (FastAPI example)
# @app.get("/api/locality/{city}/{locality}/benchmarks")
# def get_benchmarks(city: str, locality: str):
#     return query("SELECT * FROM india_benchmarks WHERE city=%s AND locality=%s",
#                  (city, locality))

print(f"{len(benchmarks)} locality-tier benchmarks ready for API")

Sample output

{
  "listing_id": "99acres-12345",
  "title": "3 BHK Apartment for Sale in Indiranagar",
  "city": "Bangalore",
  "locality": "Indiranagar",
  "price": "₹3.5 Cr",
  "price_inr": 35000000,
  "area_sqft": 1850,
  "price_per_sqft": 18919,
  "bedrooms": 3,
  "furnishing_status": "Semi-Furnished",
  "tenure": "Freehold",
  "url": "https://www.99acres.com/property-99acres-12345"
}

Common pitfalls

Three things go wrong in property-pipeline development. Locality-name normalization variance — Indiranagar vs Indira Nagar vs Indiranagara; for clean longitudinal research, build canonical-name mapping (50-100 entries cover 99% of cases). Format-mixing in price — listings mix Crores (₹3.5 Cr), Lakhs (₹35 Lac), Thousands (₹35K) per BHK + listing-type; always normalize to base INR before benchmarking. Cross-platform dedup — same listing posted on 99acres + MagicBricks; for accurate inventory research, cluster on (locality, area_sqft, bedrooms) before benchmarking.

Thirdwatch's actor uses a lightweight HTTP path so you pay only for the data, not for proxy or compute overhead. Pair 99acres with MagicBricks Scraper for tier-2/3 cross-validation + NoBroker Scraper for owner-listed cross-reference. A fourth subtle issue worth flagging: India tier-1 cycles tightly correlate with tech-hiring cycles — Bangalore pricing dropped 8-12% during 2022-2023 tech layoffs, recovered 12-15% during 2024 hiring rebound; for accurate trend research, segment per tech-cycle phase. A fifth pattern unique to India proptech: society-level data isn't available on 99acres directly — for society-tier research, supplement with CommonFloor (society-skewed). A sixth and final pitfall: India fiscal-year-start (April 1) drives 30-40% of annual real-estate transaction activity; for accurate base-rate research, deseasonalize against fiscal-year cycle.

Operational best practices for production pipelines

Tier the cadence: Tier 1 (active investor-research watchlist, weekly), Tier 2 (broader tier-1 coverage, monthly), Tier 3 (long-tail localities, quarterly). 60-80% cost reduction with negligible signal loss when watchlist is properly tiered.

Snapshot raw payloads with gzip compression. Re-derive per-locality benchmarks from raw JSON as your locality-name + BHK-classification logic evolves. Cross-snapshot diff alerts on per-locality price-velocity catch India real-estate-cycle inflection points.

Schema validation. Daily validation suite asserting expected core fields with non-null rates above 80% (required) and 50% (optional). 99acres schema occasionally changes during platform UI revisions — catch drift early. A seventh pattern at scale: cross-snapshot diff alerts for material price shifts (>5% Q/Q at locality level) catch market-cycle inflection points before broader market awareness. An eighth pattern for cost-controlled teams: implement an incremental-diff pipeline that only re-processes records whose hash changed since the previous snapshot. For watchlists where 90%+ of records are unchanged between snapshots, hash-comparison-driven incremental processing reduces downstream-compute by 80-90% while preserving full data fidelity.

A ninth pattern unique to research-grade data work: schema validation should run continuously, not just at pipeline build-time. Run a daily validation suite that asserts each scraper returns the expected core fields with non-null rates above 80% (for required fields) and 50% (for optional). Alert on schema breakage same-day so consumers don't degrade silently.

A tenth pattern around alert-fatigue management: tune alert thresholds quarterly based on actual analyst-action rates. If analysts ignore 80%+ of alerts at a given threshold, raise the threshold. If they manually surface signals the alerts missed, lower the threshold.

An eleventh and final pattern at production scale: cross-snapshot diff alerts. Beyond detecting individual changes, build alerts on cross-snapshot field-level diffs — name changes, category re-classifications, status changes. These structural changes precede or follow material events and are leading indicators of organization-level disruption. Persist a structured-diff log alongside aggregate snapshots: for each entity, persist (field, old_value, new_value) tuples per scrape. Surface high-leverage diffs to human reviewers; low-leverage diffs stay in the audit log.

A twelfth pattern: cost attribution per consumer. Tag every API call with a downstream-consumer identifier (team, product, feature) so you can attribute compute spend back to the workflow that drove it. When a downstream consumer's spend exceeds projected budget, you can have a precise conversation with them about the queries driving cost.

Related use cases

Frequently asked questions

Why build a 99acres property research pipeline?

99acres (InfoEdge) is the canonical India tier-1 metro real-estate aggregator with 1.5M+ active listings + 90%+ Mumbai/Delhi/Bangalore broker representation. According to 99acres' 2024 IRIS report, the platform powers India real-estate transaction-data feeding into RBI Housing Price Index. For India proptech platforms, real-estate-investment SaaS, and India PE-research functions, 99acres provides the canonical foundation feed.

What does a production property pipeline architecture look like?

Three-stage pipeline: (1) ingestion (weekly per-city per-locality scrapes); (2) enrichment (locality-name normalization, BHK-tier classification, capital-rental yield computation); (3) persistence (PostgreSQL with cross-snapshot history). Output: per-locality longitudinal data warehouse with rental + capital + yield benchmarks updated weekly.

How fresh do pipeline snapshots need to be?

Weekly cadence catches meaningful India tier-1 shifts. Monthly cadence captures faster-moving Bangalore + Hyperabad markets (post-tech-cycle). For active investor-research, weekly snapshots produce stable trend data. India tier-1 metros move materially faster than tier-2/3 — Bangalore can shift 5-10% within a quarter post-major tech layoffs/hiring.

How do I scale to 100+ tier-1 localities?

100 localities × 4 BHK-tiers × 2 listing-types (rent+sale) = 800 queries × 50 records = 40K records weekly. Compute: ~30 min run-time on Apify. For 10K-locality scale (full India coverage including tier-2/3), partition into geographic batches + parallelize 4-8 actor instances.

Can I integrate with proptech downstream products?

Yes. Production pipeline pattern: (1) actor pulls weekly; (2) enrichment Lambda normalizes data; (3) PostgreSQL upsert with snapshot-history; (4) downstream products query via REST API + Snowflake. Most India proptech SaaS startups (Squareyards, NoBroker analytics) use this pattern. Build time: 2-3 weeks for full pipeline + downstream integration.

How does this compare to Knight Frank + JLL India research?

[Knight Frank India](https://www.knightfrank.com/india) + [JLL India](https://www.jll.in/) bundle India real-estate research at $20K-$100K/year, lagged 30-90 days. The actor delivers raw real-time per-locality 99acres data on pay-per-result pricing. For programmatic India property pipelines (auto-scoring + auto-categorization), the actor at scale is materially cheaper. For curated qualitative India trend-narratives, consultancies still add value.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.