The Complete Guide to Scraping Business Data (2026)
Pick the right Thirdwatch scraper for any local-business or B2B data use case — Google Maps, JustDial, IndiaMart, Yelp. Decision tree + lead-gen recipes.

Thirdwatch publishes 4 business-data scrapers covering Google Maps (global), JustDial (India local), IndiaMart (India B2B suppliers), and Yelp (US restaurants/services). This guide is the decision tree for picking the right one (or combination) for your use case — B2B lead-gen, local-business directories, supplier discovery, territory mapping.
The business-data scraping landscape
Local business data is fragmented by geography. According to Google's 2024 Maps Business statistics, Google Maps indexes 200M+ business listings globally with phone-number coverage above 75% for active businesses. But India's long tail (Tier 2/3 cities, local services) lives more on JustDial than Google Maps; B2B-supplier data lives on IndiaMart; restaurants and consumer services have richer Yelp coverage in the US.
For a B2B sales-prospecting team, the right answer is usually Google Maps as primary + 1 regional supplement. For India-specific lead-gen, all three (Google Maps + JustDial + IndiaMart). For US restaurant/service-based operations, Google Maps + Yelp.
Compare Thirdwatch business-data scrapers
| Scraper | Coverage | Approach | Cost/1K | Best for |
|---|---|---|---|---|
| Google Maps | Global, 200M businesses | HTTP + internal API | Pay per result | Global lead-gen |
| JustDial | India local services | TLS-level fingerprinting + Next.js | Pay per result | India lead-gen |
| IndiaMart | India B2B suppliers | httpx + residential | Pay per result | B2B sourcing India |
| Yelp | US restaurants/services | production-grade anti-bot tooling + cookies | Pay per result | US restaurant research |
Decision tree
"I'm doing US/UK/EU B2B local lead-gen." Google Maps alone covers 85%+ of viable contacts. 200-query category × neighborhood batches per metro produce 5K-record cohorts — well within budget for a recurring outbound refresh. Dedupe on place_id, normalize phone to E.164, route to outbound dialing.
"I'm doing India local-services lead-gen." Google Maps (metros) + JustDial (Tier 2/3 cities + service businesses Google Maps under-indexes). For comprehensive India coverage, run both. JustDial is materially better for healthcare/wellness, education, and home services in non-metro India.
"I'm sourcing B2B suppliers in India." IndiaMart is canonical (manufacturers, wholesalers, GST-verified suppliers with MOQ + price-band data). For category-level supplier discovery (machinery, textiles, packaging, chemicals), IndiaMart covers the broadest supplier universe at the lowest cost.
"I'm building a restaurant directory in the US." Google Maps (canonical, fresh data) + Yelp (review depth). Google Maps for discovery + filtering by rating/cuisine; Yelp for review-text mining. Cross-reference on (name, lat, lng) for canonical-business matching.
"I'm building a global B2B contact-enrichment pipeline." Google Maps for local-business surface + ZoomInfo/Apollo for enterprise B2B. Google Maps catches the long-tail SMB segment those tools under-index.
"I'm a field-sales team mapping territory." Google Maps with category × neighborhood × radius queries. The lat/lng + place_id fields enable PostGIS-based territory assignment + clustering.
Cross-source recipe: India full-coverage business pipeline
import os, requests, pandas as pd
TOKEN = os.environ["APIFY_TOKEN"]
def run(actor, payload, timeout=3600):
r = requests.post(
f"https://api.apify.com/v2/acts/{actor}/run-sync-get-dataset-items",
params={"token": TOKEN}, json=payload, timeout=timeout
)
return r.json()
CATEGORIES = ["dentist", "physiotherapist", "diagnostic center"]
CITIES = ["Mumbai", "Pune", "Hyderabad", "Bangalore"]
queries = [f"{c} {city}" for c in CATEGORIES for city in CITIES]
gmaps = run("thirdwatch~google-maps-scraper",
{"queries": queries, "maxResults": 100})
jd = run("thirdwatch~justdial-business-scraper",
{"queries": queries, "maxResults": 100})
g = pd.DataFrame(gmaps).assign(source="google_maps")
j = pd.DataFrame(jd).assign(source="justdial")
# Normalize: phone to E.164, lower-case name + city
import re
def normalize_phone(p, cc="91"):
if not isinstance(p, str): return None
digits = re.sub(r"\D", "", p)
if not digits: return None
if not digits.startswith(cc) and len(digits) == 10:
digits = cc + digits
return f"+{digits}"
for d in [g, j]:
d["phone_e164"] = d.phone.apply(normalize_phone)
d["name_norm"] = d.name.str.lower().str.strip()
combined = pd.concat([g, j], ignore_index=True)
combined = combined.drop_duplicates(subset=["phone_e164"], keep="first")
print(f"{len(combined)} unique businesses across {combined.source.nunique()} sources")
print(combined.source.value_counts())Cross-source dedup on E.164 phone yields 30-50% larger pool than either source alone — JustDial catches mid/Tier-2 city operators Google Maps under-indexes.
All use-case guides for business-data scrapers
Google Maps
- Scrape Google Maps businesses for lead-gen
- Build local-business database with Google Maps
- Find restaurants by cuisine and rating
- Scrape business phone and website from Google Maps
JustDial
- Scrape JustDial for local business leads
- Build India local services directory from JustDial
- Monitor JustDial reviews for reputation
- Find clinics and doctors in India with JustDial
IndiaMart
- Find B2B suppliers in India with IndiaMart
- Build procurement shortlist from IndiaMart
- Monitor wholesale prices in India with IndiaMart
- Verify Indian suppliers GST and MOQ at scale
(Yelp use-case guides in Wave 3.)
Common patterns across business-data scrapers
Canonical natural keys.
- Google Maps:
place_id - JustDial:
business_idper listing - IndiaMart:
supplier_idper supplier - Yelp:
(business-slug, location)
Phone normalization. All sources return phone in human-readable formats; normalize to E.164 (+13125550123) before storage and downstream use.
Closed-business handling. Google Maps occasionally retains records for permanently-closed businesses; filter on is_temporarily_closed: false and is_permanently_closed: false to exclude.
Compliance considerations. Phone numbers from Google Maps are public records but pre-call notification requirements vary by jurisdiction (TCPA in US, GDPR ePrivacy in EU). Layer compliance-validation on top of raw extraction.
Source-coverage drift. Google's category-list changes over time; if a category you used to scrape now returns sparse results, that's Google's indexing decision rather than scraper failure.
Operational best practices for production pipelines
A handful of patterns matter more than the per-actor specifics once you're running these scrapers in production at scale.
Tier the cadence to match signal half-life. Daily polling is canonical for monitoring use cases (price drift, hiring velocity, brand mentions), but most teams over-poll long-tail watchlist items. Tier the watchlist into Tier 1 (high-stakes, hourly), Tier 2 (active monitoring, daily), Tier 3 (research-only, weekly). Typical 60-80% cost reduction with negligible signal loss.
Snapshot raw payloads alongside derived fields. Pipeline cost is dominated by scrape volume, not storage. Persisting the raw JSON response per snapshot lets you re-derive metrics without re-scraping when your sentiment model improves, your category-classifier evolves, or you discover a previously-ignored field. Compress with gzip at write-time (4-8x size reduction).
Three-tier retention. Most production pipelines run: 90 days of raw snapshots at full fidelity + 12 months of derived per-record aggregates + indefinite retention of derived metric time-series. Storage costs stay flat at scale; query patterns map cleanly to the right retention tier.
Cross-source dedup via the canonical 4-tuple. Across-source dedup (LinkedIn vs Indeed vs Google Jobs; Talabat vs Deliveroo vs Noon Food; Trustpilot vs G2) typically uses (name-norm, location-norm, identifier-norm, key-numeric). Within-source dedup uses each platform's stable natural key (place_id, asin, videoId, shortcode, etc.). Both are essential — get either wrong and metrics become noisy.
Validate live before declaring fields stable. Schemas drift. Run a daily validation suite that asserts each scraper returns the expected core fields with non-null rates above 80% (for required fields) and 50% (for optional). Alert on schema breakage same-day.
Tag scrape timestamps in every record. Platform-displayed timestamps lag actual events by minutes-to-hours. For accurate freshness analysis, treat (platform_timestamp, scrape_timestamp) as a tuple — the larger of the two is your "actively-listed since" anchor, the smaller is your "first-detected" anchor.
Frequently asked questions
Which scraper for global B2B lead-gen?
Google Maps is the canonical global source — 200M+ business listings with phone, website, hours, lat/lng, place_id. About 75-85% have phone, 60-70% have website. For US/UK/EU lead-gen, Google Maps alone covers 85%+ of viable contacts on a pay-per-result basis. For India-specific, layer in JustDial; for B2B suppliers, IndiaMart.
What's the right query strategy for lead-gen?
Three patterns: (1) `category + city` (`dentist Chicago`); (2) `niche + city` (`pediatric dentist Chicago`); (3) `category + neighborhood` (`dentist Lincoln Park Chicago`). Combine all three for comprehensive coverage. For metro-level lead-gen, 200+ queries spanning categories × neighborhoods captures most local businesses. Dedupe on `place_id`.
How fresh is the data?
Google Maps updates phone/website fields within 48 hours of business owner edits. For active outbound campaigns, refresh quarterly — that's the cadence at which a meaningful share of small businesses change phone numbers or websites. For static directory builds, annual refresh is sufficient.
Is the phone field reliable for outbound dialing?
Yes, with caveats. Numbers on Google Maps are submitted by business owners or auto-discovered from website crawls; for active operators they're 95%+ accurate. About 5-10% return wrong-number / voicemail-only / out-of-service. For cold-call sales pipelines, validate with a phone-number-validation service before mass dialing. Normalize to E.164 before storing.
What about India-specific business data?
JustDial for local-services lead-gen (clinics, restaurants, services in 200+ Indian cities). IndiaMart for B2B suppliers (manufacturers, wholesalers, GST-verified). Both are India-focused and Google Maps misses meaningful India long-tail. For comprehensive India coverage: Google Maps (international + metros) + JustDial (Tier 2/3 cities + local services) + IndiaMart (B2B suppliers). Together cover 95%+ of India business landscape.
How does this compare to ZoomInfo or Apollo?
ZoomInfo and Apollo bundle business data with company-firmographic enrichment + email/phone validation at $5K-$15K/year per seat. They're stronger on enterprise B2B contact data; weaker on local-business coverage. The actor delivers raw business data on a pay-per-result basis — for high-volume local lead-gen, the actor is materially cheaper. For full-stack B2B sales operations, ZoomInfo/Apollo win.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.