Build an India Property Price Database From Square Yards
Build a normalized India property price database using Thirdwatch's Square Yards Scraper. Schema design, dedup logic, and incremental refresh recipes.

Thirdwatch's Square Yards Scraper is the extraction layer for building a normalized India property price database. This guide covers schema design, incremental refresh logic, deduplication, price normalization, and multi-city ingestion pipelines. Built for developers building proptech products, property analytics dashboards, or internal tools that need structured India residential data at scale.
Why build a property database from Square Yards
India's residential real estate market is valued at over $300 billion according to IBEF's 2025 real estate sector report, yet no public API provides structured listing-level data. Government indices like NHB RESIDEX publish quarterly aggregates for 50 cities, not individual listings. Paid data providers charge enterprise rates and deliver static exports, not live feeds.
For developers building proptech applications -- price comparison tools, investment calculators, market intelligence dashboards, or property recommendation engines -- the gap between aggregate indices and listing-level data is the entire product. You need individual records with price, area, coordinates, bedroom count, developer name, and furnishing status, refreshed on a cadence that tracks market movement.
Square Yards covers 40+ Indian cities with developer-verified listings. The scraper returns 22 structured fields per listing, including latitude and longitude for spatial queries and developer_name for supply-side analysis. The data pipeline described below turns raw scraper output into a queryable, deduplicated, time-series-capable property database.
How does this compare to the alternatives?
| Approach | Listing-level data | API access | Incremental refresh | Schema control |
|---|---|---|---|---|
| NHB RESIDEX | No (aggregate index only) | CSV download | Quarterly | Fixed |
| PropEquity / Liases Foras | Yes (paid) | Bulk export | Monthly | Vendor-defined |
| Manual scraping (custom scripts) | Yes | Self-maintained | On-demand | Full |
| Thirdwatch Square Yards Scraper | Yes (22 fields) | Apify API + webhooks | On-demand | Full |
Custom scraping scripts require ongoing maintenance when the site redesigns. The Square Yards Scraper abstracts that maintenance and provides a stable API contract for your ingestion pipeline. For a broader India real estate view, combine with 99acres or MagicBricks data from other Thirdwatch actors.
How to build the database in 6 steps
Step 1: How do I design the database schema?
Two tables: a current-state listings table and a time-series snapshots table.
CREATE TABLE listings (
url TEXT PRIMARY KEY,
title TEXT,
property_for TEXT, -- 'sale' or 'rent'
property_type TEXT,
bedrooms INTEGER,
bathrooms INTEGER,
area REAL,
floor TEXT,
address TEXT,
locality TEXT,
city TEXT,
latitude REAL,
longitude REAL,
developer_name TEXT,
project_name TEXT,
price BIGINT,
rent_monthly INTEGER,
price_per_sqft REAL,
listed_by TEXT,
available_from TEXT,
furnishing TEXT,
first_seen DATE,
last_seen DATE,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE snapshots (
id SERIAL PRIMARY KEY,
url TEXT REFERENCES listings(url),
price BIGINT,
rent_monthly INTEGER,
snapshot_date DATE,
UNIQUE(url, snapshot_date)
);
CREATE INDEX idx_listings_city_locality ON listings(city, locality);
CREATE INDEX idx_listings_coords ON listings USING GIST (
ST_SetSRID(ST_MakePoint(longitude, latitude), 4326)
);
CREATE INDEX idx_snapshots_date ON snapshots(snapshot_date);The url field is the natural key -- stable across snapshots. The spatial index enables PostGIS radius queries for location-based applications.
Step 2: How do I authenticate and install dependencies?
pip install apify-client psycopg2-binary pandas
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"
export DATABASE_URL="postgresql://user:pass@localhost:5432/propdb"Step 3: How do I run the scraper across multiple cities?
import os
from apify_client import ApifyClient
client = ApifyClient(os.environ["APIFY_TOKEN"])
CITIES = [
"Mumbai", "Delhi", "Bangalore", "Hyderabad", "Pune",
"Chennai", "Kolkata", "Ahmedabad", "Gurgaon", "Noida",
]
all_items = []
for city_batch in [CITIES[i:i+3] for i in range(0, len(CITIES), 3)]:
run = client.actor("thirdwatch/squareyards-scraper").call(run_input={
"queries": city_batch,
"maxResults": 500,
"propertyFor": "sale",
})
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
all_items.extend(items)
print(f"Batch {city_batch}: {len(items)} listings")
print(f"Total: {len(all_items)} listings across {len(CITIES)} cities")Batching 3 cities per run keeps individual run times predictable. For rental data, run a parallel pass with "propertyFor": "rent".
Step 4: How do I normalize and deduplicate before ingestion?
import pandas as pd
from datetime import date
df = pd.DataFrame(all_items)
# Deduplicate within batch (same URL can appear in overlapping locality searches)
df = df.drop_duplicates(subset=["url"], keep="first")
# Compute price per sqft
df["price_per_sqft"] = df.apply(
lambda r: r["price"] / r["area"] if r.get("area") and r["area"] > 0 else None,
axis=1,
)
# Add snapshot metadata
df["snapshot_date"] = date.today().isoformat()
# Validate critical fields
df = df[df["url"].notna() & (df["url"].str.len() > 10)]
df = df[df["city"].notna()]
print(f"Clean records: {len(df)}")
print(f"Cities: {df['city'].nunique()}")
print(f"Records with coordinates: {df['latitude'].notna().sum()}")Step 5: How do I upsert into the database?
import psycopg2
from psycopg2.extras import execute_values
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()
# Upsert listings (update price + last_seen on conflict)
listing_rows = [
(r["url"], r.get("title"), r.get("property_for"), r.get("property_type"),
r.get("bedrooms"), r.get("bathrooms"), r.get("area"), r.get("floor"),
r.get("address"), r.get("locality"), r.get("city"),
r.get("latitude"), r.get("longitude"), r.get("developer_name"),
r.get("project_name"), r.get("price"), r.get("rent_monthly"),
r.get("price_per_sqft"), r.get("listed_by"), r.get("available_from"),
r.get("furnishing"), r["snapshot_date"], r["snapshot_date"])
for _, r in df.iterrows()
]
execute_values(cur, """
INSERT INTO listings (url, title, property_for, property_type,
bedrooms, bathrooms, area, floor, address, locality, city,
latitude, longitude, developer_name, project_name,
price, rent_monthly, price_per_sqft, listed_by,
available_from, furnishing, first_seen, last_seen)
VALUES %s
ON CONFLICT (url) DO UPDATE SET
price = EXCLUDED.price,
rent_monthly = EXCLUDED.rent_monthly,
price_per_sqft = EXCLUDED.price_per_sqft,
last_seen = EXCLUDED.last_seen,
updated_at = CURRENT_TIMESTAMP
""", listing_rows)
# Insert snapshot records
snapshot_rows = [
(r["url"], r.get("price"), r.get("rent_monthly"), r["snapshot_date"])
for _, r in df.iterrows()
]
execute_values(cur, """
INSERT INTO snapshots (url, price, rent_monthly, snapshot_date)
VALUES %s
ON CONFLICT (url, snapshot_date) DO NOTHING
""", snapshot_rows)
conn.commit()
print(f"Upserted {len(listing_rows)} listings, {len(snapshot_rows)} snapshots")Step 6: How do I schedule incremental refreshes?
Use Apify's scheduling to trigger weekly runs, then poll for completion and ingest:
import time
def refresh_city(city: str, property_for: str = "sale"):
run = client.actor("thirdwatch/squareyards-scraper").call(
run_input={
"queries": [city],
"maxResults": 500,
"propertyFor": property_for,
},
wait_secs=300,
)
return list(client.dataset(run["defaultDatasetId"]).iterate_items())
# Weekly refresh across all cities
for city in CITIES:
items = refresh_city(city)
# ... normalize and upsert as above
print(f"{city}: {len(items)} listings refreshed")Schedule this as a cron job or use Apify's built-in scheduler to trigger runs at a fixed weekly cadence.
Sample output
Two records from a Hyderabad query:
[
{
"title": "3 BHK Apartment in My Home Bhooja",
"property_for": "sale",
"property_type": "Apartment",
"bedrooms": 3,
"bathrooms": 3,
"area": 2150,
"floor": "8th of 30",
"address": "Madhapur, Hyderabad",
"locality": "Madhapur",
"city": "Hyderabad",
"latitude": 17.4400,
"longitude": 78.3900,
"developer_name": "My Home Group",
"project_name": "My Home Bhooja",
"price": 24500000,
"rent_monthly": null,
"listed_by": "Developer",
"furnishing": "Semi-Furnished"
},
{
"title": "2 BHK Apartment in Aparna Sarovar Grande",
"property_for": "sale",
"property_type": "Apartment",
"bedrooms": 2,
"bathrooms": 2,
"area": 1350,
"floor": "3rd of 14",
"address": "Nallagandla, Hyderabad",
"locality": "Nallagandla",
"city": "Hyderabad",
"latitude": 17.4580,
"longitude": 78.3150,
"developer_name": "Aparna Constructions",
"project_name": "Aparna Sarovar Grande",
"price": 9200000,
"rent_monthly": null,
"listed_by": "Developer",
"furnishing": "Unfurnished"
}
]Common pitfalls
Four things break in production property database pipelines. URL instability -- Square Yards occasionally changes URL slugs during site redesigns; store a hash of (city + locality + project_name + bedrooms + area) as a fallback dedup key alongside the URL. Price outliers -- a small fraction of listings have placeholder prices (1 rupee or 999999999); filter with a reasonable floor and ceiling per city before computing aggregates. Coordinate drift -- some listings share project-level coordinates rather than unit-level; when building spatial features, cluster by project_name first to avoid treating one project as multiple data points. Schema drift -- if Square Yards adds or renames fields, downstream code that hard-codes column names will break; validate the schema of each batch against the expected field list before ingestion and log warnings on mismatches.
Related use cases
Frequently asked questions
What schema should I use for an India property price database?
A listings table with url as the primary natural key, plus city, locality, property_type, bedrooms, area, price, price_per_sqft (computed), latitude, longitude, developer_name, furnishing, listed_by, and a snapshot_date timestamp. Add a separate snapshots table keyed on (url, snapshot_date) for time-series tracking. Partition by city for query performance once you exceed a few hundred thousand rows.
How do I handle deduplication across snapshots?
Use the url field as the natural key. On each snapshot ingest, upsert into the listings table (update price, available_from, and snapshot_date if the url already exists) and insert into the snapshots table unconditionally. This gives you a current-state view in listings and a full history in snapshots. Dedup before insert by dropping rows with duplicate urls within the same snapshot batch.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.