Build US Local Services Directory from Yelp (2026)

Q: What's the right query strategy?

Three patterns: (1) `category + city` (`personal injury lawyer Boston`, `pediatrician Chicago`); (2) `niche + neighborhood` (`yoga studio Williamsburg`); (3) `category + zip-code` for hyperlocal coverage. For metro-level directory coverage, target 100+ category-city pairs across top 50 US metros = 5,000+ queries returning 100K+ businesses.

Q: How do I dedupe across overlapping queries?

Yelp's `business_id` (URL slug) is the canonical natural key per business. Cross-query overlap is typically 20-30% (especially for businesses in multiple categories). Dedupe on `business_id` before treating as unique inventory. For chain-businesses, each location has its own `business_id`.

Published April 28, 2026 · 1506 words · For developers

Thirdwatch's Yelp Scraper lets US directory-builders, content-aggregators, and local-services lead-gen platforms ingest 100K+ businesses at $0.008 per record — name, phone, website, address, hours, reviews, photos, categories, price range. Built for US local-services directory products, hyperlocal lead-gen, and content-aggregator platforms.

Why build a US directory from Yelp

US local-services discovery happens largely on Yelp + Google. According to Yelp's 2024 Local Search report, the platform indexes 6M+ active US businesses across 22 service categories — legal (200K+ lawyers), medical (500K+ providers), home services (1M+ contractors), automotive, beauty, restaurants. For US directory-builders + content-aggregator platforms competing in local-services SEO, Yelp data is the canonical source.

The job-to-be-done is structured. A US legal-services directory startup ingests 100K+ lawyer profiles for SEO-driven content (Boston Personal Injury Lawyers, Chicago Family Law, etc.). A medical-services aggregator surfaces 500K+ providers for patient-search products. A home-services lead-gen platform builds per-metro contractor databases. A travel + lifestyle content publisher mines Yelp for editorial city-guide content. All reduce to category + metro queries + per-business detail aggregation.

How does this compare to the alternatives?

Three options for US directory data:

Approach	Cost per 100K records monthly	Reliability	Setup time	Maintenance
Yelp Fusion API	Free (5K/day cap)	Official	Days (use-case approval)	Strict TOS + rate limits
Manual aggregation from multiple sources	Effectively unbounded analyst time	Patchy	Continuous	Doesn't scale
Thirdwatch Yelp Scraper	$800 ($0.008 × 100K)	Camoufox + cookie pool	5 minutes	Thirdwatch tracks Yelp changes

Yelp Fusion API rate-limits at 5K/day. The Yelp Scraper actor page gives you raw directory data at scale without API gatekeeping.

How to build a directory in 4 steps

Step 1: How do I authenticate against Apify?

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I pull a category × metro batch?

Pass category + city queries.

import os, requests, pandas as pd
from itertools import product

ACTOR = "thirdwatch~yelp-business-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

CATEGORIES = ["personal injury lawyer", "pediatrician",
              "plumber", "electrician", "yoga studio",
              "dentist", "chiropractor", "veterinarian"]
METROS = ["New York", "Los Angeles", "Chicago", "Houston",
          "Phoenix", "Philadelphia", "San Antonio", "San Diego"]

queries = [f"{c} {m}" for c, m in product(CATEGORIES, METROS)]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={"queries": queries, "maxResults": 100},
    timeout=3600,
)
df = pd.DataFrame(resp.json())
df = df.drop_duplicates(subset=["business_id"])
print(f"{len(df)} unique businesses across {df.address.str.split(',').str[-2].str.strip().nunique()} metros")

8 categories × 8 metros = 64 queries × 100 = up to 6,400 records, costing $51.

Step 3: How do I extract structured directory schema?

Build per-business schema for SEO-driven content pages.

def build_directory_schema(row):
    return {
        "name": row.get("name"),
        "category": row.get("category"),
        "all_categories": row.get("all_categories"),
        "address": row.get("address"),
        "phone": row.get("phone"),
        "website": row.get("website"),
        "rating": row.get("rating"),
        "review_count": row.get("review_count"),
        "price_range": row.get("price_range"),
        "hours": row.get("hours"),
        "lat": row.get("lat"),
        "lng": row.get("lng"),
        "photos": row.get("photos", [])[:3],  # first 3 photos
        "yelp_url": row.get("url"),
        "is_open": row.get("is_open", True),
    }

directory = [build_directory_schema(r) for _, r in df.iterrows() if r.get("rating", 0) >= 3.5]
print(f"{len(directory)} businesses in directory (3.5+ rating)")

3.5+ rating threshold filters viable directory-listing candidates. Sub-3.5 ratings are typically poor-quality businesses that hurt directory user-trust.

Step 4: How do I push to Postgres + build SEO pages?

Upsert per business_id + generate static-site directory pages.

import pathlib, psycopg2

with psycopg2.connect(...) as conn, conn.cursor() as cur:
    for biz in directory:
        cur.execute(
            """INSERT INTO local_services
                  (business_id, name, category, address, phone, website,
                   rating, review_count, lat, lng, last_scraped)
               VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s, current_date)
               ON CONFLICT (business_id) DO UPDATE SET
                 rating = EXCLUDED.rating,
                 review_count = EXCLUDED.review_count,
                 last_scraped = current_date""",
            (biz["yelp_url"].split("/")[-1], biz["name"], biz["category"],
             biz["address"], biz["phone"], biz["website"], biz["rating"],
             biz["review_count"], biz["lat"], biz["lng"])
        )

# Generate per-(category, city) directory page
for (cat, city), grp in df.groupby(["category", "city"]):
    out = pathlib.Path(f"directory/{cat}_{city}.md".replace(" ", "_"))
    out.parent.mkdir(parents=True, exist_ok=True)
    lines = [f"# Best {cat.title()} in {city}\n"]
    for _, b in grp.head(15).iterrows():
        lines.append(f"## {b['name']}\n- Rating: {b.rating} ({b.review_count} reviews)\n"
                     f"- {b.address}\n- Phone: {b.phone}\n")
    out.write_text("\n".join(lines))
print(f"Generated {len(list(pathlib.Path('directory').glob('*.md')))} directory pages")

Static-site-generator-ready directory pages enable SEO-driven traffic acquisition for local-services directory products.

Sample output

A single Yelp business record looks like this. Five rows weigh ~10 KB.

{
  "business_id": "Smith-Personal-Injury-Boston",
  "name": "Smith & Associates Personal Injury Attorneys",
  "category": "Personal Injury Law",
  "all_categories": ["Personal Injury Law", "Lawyers", "Legal Services"],
  "address": "100 State St, Boston, MA 02109",
  "phone": "+1-617-555-0100",
  "website": "https://smithpersonalinjury.com",
  "rating": 4.8,
  "review_count": 245,
  "price_range": "$$$",
  "lat": 42.3597,
  "lng": -71.0567,
  "hours": ["Mon-Fri: 9 AM-6 PM", "Sat-Sun: Closed"],
  "photos": ["https://s3-media2.fl.yelpcdn.com/..."],
  "url": "https://www.yelp.com/biz/smith-associates-personal-injury-boston"
}

business_id (URL slug) is the canonical natural key. all_categories (vs primary category) catches multi-category businesses critical for directory cross-categorization. price_range ($-$$$$) enables price-band filtering useful for service-comparison content.

Common pitfalls

Three things go wrong in directory pipelines. Closed-business retention — Yelp shows permanently-closed businesses with is_closed: true flag; filter to is_closed: false strictly. Multi-location chain-confusion — chain businesses (LA Fitness, Starbucks) have separate business_id per location; for chain-aware research, group by name + filter by lat/lng cluster. Review-text licensing — Yelp's TOS restricts wholesale republication of review text; for directory products, link to Yelp business pages rather than republishing full reviews.

Thirdwatch's actor uses Camoufox + cookie preservation at $5/1K, ~40% margin. Pair Yelp with Google Maps Scraper for cross-source coverage. A fourth subtle issue worth flagging: Yelp's "Yelp Fusion API" terms-of-service require attribution + linking back to yelp.com for any commercial use; for compliance, ensure directory products properly attribute Yelp as data source. A fifth pattern unique to local-services directories: SEO-driven content pages need at minimum 15-20 businesses per (category, city) page to rank well — sparse directory pages (under 5 businesses) tend not to rank. For low-density rural-area pages, supplement with regional-scope or aggregate at county-level. A sixth and final pitfall: Yelp's review-quality varies dramatically by category — restaurants + retail get many reviews per business (50-500+); medical + legal get fewer (10-50) due to confidentiality concerns. For accurate quality-filtering, segment review-count thresholds by category rather than applying a uniform threshold.

Operational best practices for production pipelines

Tier the cadence to match signal half-life. US business data changes slowly — weekly polling on top categories + monthly on long-tail covers most use cases. 60-80% cost reduction with negligible signal loss when watchlist is properly tiered.

Snapshot raw payloads. Pipeline cost is dominated by scrape volume, not storage. Persisting raw JSON snapshots lets you re-derive metrics — particularly useful as your category-classifier evolves with new Yelp taxonomy releases.

Schema validation. Run a daily validation suite asserting expected core fields with non-null rates above 80% (required) and 50% (optional). Yelp schema occasionally changes during platform UI revisions — catch drift early. Cross-snapshot diff alerts on business-status changes (active → closed) catch market-velocity signals.

Related use cases

Frequently asked questions

Why build a US local-services directory from Yelp?

Yelp dominates US local-services discovery (legal, medical, home services, automotive, beauty). According to Yelp's 2024 report, the platform indexes 6M+ US businesses across 22 service categories. For US directory-builder products, content-aggregator platforms, and local-services lead-gen tools, Yelp is the canonical content source. Fresh data + review depth makes Yelp materially deeper than Google Maps for US service-business research.

What's the right query strategy?

Three patterns: (1) category + city (personal injury lawyer Boston, pediatrician Chicago); (2) niche + neighborhood (yoga studio Williamsburg); (3) category + zip-code for hyperlocal coverage. For metro-level directory coverage, target 100+ category-city pairs across top 50 US metros = 5,000+ queries returning 100K+ businesses.

How do I dedupe across overlapping queries?

Yelp's business_id (URL slug) is the canonical natural key per business. Cross-query overlap is typically 20-30% (especially for businesses in multiple categories). Dedupe on business_id before treating as unique inventory. For chain-businesses, each location has its own business_id.

How fresh do directory snapshots need to be?

For active directory products serving real-time consumer queries, weekly cadence catches new listings within 7 days. For SEO-driven content directories (long-form pages indexed by Google), monthly cadence suffices. For hyperlocal-services lead-gen (legal, medical), daily snapshots capture new providers + closed businesses for accurate availability.

Can I monetize directory products legally?

Yes. Yelp data is publicly accessible. Many US local-services directory products (Healthgrades, Avvo, Houzz) compete with Yelp using scraped + curated data. For commercial products: (1) attribute Yelp as data source; (2) avoid wholesale republication of review text; (3) link out to Yelp business pages for full reviews; (4) layer your own value-add (better filtering, AI-summaries, lead-routing).

How does this compare to Yelp Fusion API?

Yelp Fusion API is gated behind use-case approval + 5K/day rate limit on free tier. The actor delivers similar coverage at $0.008/record without rate-limit ceiling. For low-volume one-off research (under 5K/day), Yelp Fusion API is cheapest. For high-volume directory-builder products, the actor scales without API gatekeeping.

Run the Yelp Scraper on Apify Store — pay-per-record, free to try, no credit card to test.