Jobs & recruitment

Build a Multi-Source Jobs Feed with Google Jobs (2026)

Build a 20+ board jobs aggregator using Thirdwatch's Google Jobs Scraper. Postgres ingestion + source-priority dedupe + Meilisearch recipes.

Apr 27, 2026 · 5 min read · 1,075 words

See the scraper →

Thirdwatch's Google Jobs Scraper anchors a multi-source jobs aggregator — one query covers 20+ boards (Indeed, LinkedIn, Glassdoor, ZipRecruiter, Monster, CareerBuilder, specialist boards). This guide is the canonical recipe for building a jobs feed on Google Jobs as the discovery layer with optional direct-source enrichment, Postgres ingestion, source-priority dedupe, and Meilisearch faceted search.

Why anchor on Google Jobs

Google Jobs is the most efficient single jobs-discovery surface. According to Google's 2024 Search statistics, Google fields more than 800 million job-related searches monthly with the Google Jobs panel driving 30%+ of click-throughs. For aggregator builders, that means anchoring on Google Jobs collapses what would otherwise be a 5-10 source integration project into one — and the resulting dataset attributes each listing to its underlying board.

The job-to-be-done is structured. An aggregator developer wants comprehensive cross-board US coverage from one ingestion pipeline. A staffing-tech SaaS needs a jobs feed embedded in their product covering as many sources as possible. A workforce-analytics platform requires a longitudinal jobs dataset spanning all major boards. A recruiter agency builds an internal aggregator to consolidate sourcing across multiple boards. All reduce to Google Jobs pull → cross-source dedupe → Postgres + search-index ingestion.

How does this compare to the alternatives?

Three options for building a multi-source jobs feed:

Approach	Cost per 100K jobs/day	Reliability	Setup time	Maintenance
Per-source scrapers (LinkedIn + Indeed + Monster + ZipRecruiter + ...)	Pay per result × 4-8 sources	Production-tested per source	Half a day per source	Per-source maintenance
Curated jobs-feed licence (LinkedIn Talent Insights, Indeed Hiring Insights)	$50K–$300K/year	High	Months	Vendor lock-in
Thirdwatch Google Jobs Scraper anchor	Pay per result	Production-tested, 20+ source aggregation	One day	Thirdwatch tracks Google changes

Per-source scraping gives the deepest coverage but requires maintaining each scraper. The Google Jobs Scraper actor page collapses discovery into one call.

How to build a multi-source jobs feed in 4 steps

Step 1: How do I authenticate against Apify?

Sign in at apify.com (free tier, no credit card), open Settings → Integrations, and copy your personal API token. Every example below assumes the token is in APIFY_TOKEN:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I pull a multi-keyword multi-metro Google Jobs feed?

Spawn parallel runs across role × metro combinations.

import os, requests, time

ACTOR = "thirdwatch~google-jobs-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

ROLES = ["software engineer", "data scientist", "product manager",
         "registered nurse", "accountant"]
METROS = ["New York NY", "San Francisco CA", "Austin TX",
          "Chicago IL", "Boston MA"]

run_ids = []
for role in ROLES:
    for metro in METROS:
        r = requests.post(
            f"https://api.apify.com/v2/acts/{ACTOR}/runs",
            params={"token": TOKEN},
            json={"queries": [f"{role} {metro}"], "country": "us",
                  "maxResults": 50},
        )
        run_ids.append((role, metro, r.json()["data"]["id"]))
        time.sleep(0.5)

print(f"Spawned {len(run_ids)} role × metro runs")

5 roles × 5 metros = 25 runs × 50 jobs = up to 1,250 listings.

Step 3: How do I dedupe across source attributions?

Build the canonical 4-tuple key. Within Google Jobs alone the same posting can appear under multiple sources (LinkedIn + Glassdoor + Indeed for the same Microsoft job).

import pandas as pd, re, json

def normalise(s):
    return re.sub(r"\W+", " ", (s or "").lower()).strip()

all_jobs = []
for role, metro, run_id in run_ids:
    while True:
        s = requests.get(f"https://api.apify.com/v2/actor-runs/{run_id}",
                         params={"token": TOKEN}).json()["data"]["status"]
        if s in ("SUCCEEDED", "FAILED", "ABORTED"):
            break
        time.sleep(20)
    if s != "SUCCEEDED":
        continue
    items = requests.get(f"https://api.apify.com/v2/actor-runs/{run_id}/dataset/items",
                         params={"token": TOKEN}).json()
    all_jobs.extend(items)

df = pd.DataFrame(all_jobs)
df["dedupe_key"] = (
    df.title.fillna("").apply(normalise) + "|"
    + df.company_name.fillna("").apply(normalise) + "|"
    + df.location.fillna("").apply(normalise)
)
SOURCE_PRIORITY = {"LinkedIn": 0, "Indeed": 1, "Glassdoor": 2,
                    "ZipRecruiter": 3, "Monster": 4}
df["priority"] = df.source.map(SOURCE_PRIORITY).fillna(99)
unique = (df.sort_values(["dedupe_key", "priority"])
            .drop_duplicates(subset=["dedupe_key"], keep="first"))
print(f"Deduped: {len(df)} → {len(unique)} unique ({len(unique)/len(df):.0%})")

Source priority matters because LinkedIn descriptions are typically richer than Glassdoor descriptions for the same job; preferring LinkedIn rows in dedupe gives downstream consumers better data.

Step 4: How do I serve faceted search via Postgres + Meilisearch?

For under 100K active listings, Postgres GIN handles faceted search at sub-100ms. Past that, push to Meilisearch.

CREATE TABLE jobs_feed (
  dedupe_key       text PRIMARY KEY,
  title            text NOT NULL,
  company_name     text NOT NULL,
  location         text,
  salary           text,
  description      text,
  job_type         text,
  source           text,
  apply_url        text,
  posted_date      text,
  scraped_at       timestamptz DEFAULT now()
);
CREATE INDEX jobs_search_idx ON jobs_feed USING gin (
  to_tsvector('english', coalesce(title,'') || ' ' || coalesce(description,''))
);
CREATE INDEX jobs_filter_idx ON jobs_feed (source, job_type, posted_date DESC);

Push to Meilisearch for typo-tolerant faceted search:

import meilisearch, os

client = meilisearch.Client("http://meilisearch:7700", os.environ["MEILI_KEY"])
index = client.index("jobs")
index.update_settings({
    "filterableAttributes": ["source", "job_type", "location"],
    "sortableAttributes": ["posted_date"],
    "searchableAttributes": ["title", "company_name", "description"],
})

docs = unique.to_dict("records")
for d in docs:
    d["id"] = d["dedupe_key"]
index.add_documents(docs, primary_key="id")
print(f"Indexed {len(docs)} jobs in Meilisearch")

Pair with a Next.js or Astro frontend; users get sub-100ms typo-tolerant search across daily-fresh aggregated jobs.

Sample output

A single deduped record looks like this. Five rows weigh ~10 KB.

{
  "title": "Data Analyst",
  "company_name": "Amazon",
  "location": "Seattle, WA",
  "salary": "$85,000 - $120,000",
  "description": "We are looking for a Data Analyst to join our team and drive decisions across supply chain operations...",
  "job_type": "Full-time",
  "source": "LinkedIn",
  "apply_url": "https://www.linkedin.com/jobs/view/123456789",
  "posted_date": "2 days ago"
}

source distinguishes which board originated each listing — useful for quality-tier filtering and for source-attribution UX in the aggregator. apply_url points to the original board's apply page (not a Google redirect), preserving recruiter-tracking parameters.

Common pitfalls

Three things go wrong in production multi-source aggregators. Description truncation — Google Jobs sometimes returns shorter descriptions than original source postings; for full content, follow up with the source-board scraper using apply_url. Source-coverage drift — Google's set of indexed boards changes over time; if a particular source you used to see (e.g., Dice for tech) starts disappearing, that's Google's indexing decision, not a scraper failure. Salary-format inconsistency — salary is a free-text string varying by source; for analytics, parse to numerics with regex but tolerate ~30% null rate.

Thirdwatch's actor uses Production-grade anti-bot handling bypass for Google's anti-bot defenses. The 4096 MB max memory and 3,600-second timeout headroom mean even 25-run batch fan-outs complete cleanly. Pair Google Jobs with LinkedIn Jobs Scraper, Indeed Scraper, and Monster Scraper for source-board enrichment after Google Jobs discovery.

Related use cases

Frequently asked questions

Why anchor an aggregator on Google Jobs vs scraping each board separately?

Google Jobs aggregates listings from 20+ boards (Indeed, LinkedIn, Glassdoor, ZipRecruiter, Monster, CareerBuilder, and dozens of specialist boards) into a single search response. For aggregator builders this collapses 5-10 source integrations into one — and the resulting dataset includes source attribution per row, so you still know which board originated each listing. The trade-off: Google Jobs descriptions are sometimes shorter than the originals, and Google's coverage lags some boards by hours.

How much does it cost?

Thirdwatch's Google Jobs Scraper uses pay-per-result pricing with volume tiers, so cost scales with how many jobs you pull and drops at higher tiers. A focused aggregator (10 keywords × 5 metros × 100 jobs daily) runs cheaply enough to schedule daily — competitive with running 5+ direct-source scrapers separately.

How do I dedupe Google Jobs against direct-source scrapers?

Google Jobs returns its own consolidated record per job, but the same posting will also appear in your direct LinkedIn or Indeed scrapes if you run those. Dedupe on (title-norm, company-norm, location-norm, salary_min) when merging. About 50-60% of Google Jobs rows overlap with direct-source rows; the unique 40-50% are listings on smaller boards your direct scrapers don't cover.

What database scheme works for a Google-Jobs-anchored aggregator?

Postgres with GIN full-text on title + description handles up to 1 million active listings comfortably at sub-100ms search. Past 1M, push to Meilisearch or Typesense. Store source attribution per record so faceted search can offer per-source filtering. Index posted_date for time-based filtering and salary fields for band-based filtering.

What's the right strategy for source enrichment?

Use Google Jobs as discovery (one query covers 20+ boards) and direct-source scrapers (LinkedIn, Indeed, Monster) for deep enrichment on top-priority listings — Google Jobs descriptions are sometimes shorter than originals, so deepening with the source-board scraper backfills full content for promotional-tier rows. This hybrid approach gives the broadest coverage at the lowest unit cost.

How fresh do Google Jobs results need to be?

Six-hourly is the sweet spot. Google Jobs indexes new postings within 2-12 hours of source-board appearance, so faster cadences than 6-hourly capture freshness gains diminishingly. For most aggregator use cases, six-hourly catches new postings within 6-18 hours of original publication — good enough to compete with single-source aggregators on freshness.

Scrape Google Jobs Aggregated Listings (2026 Guide)Find Jobs with Direct Apply URLs (2026 Google Jobs Guide)Track Job Posting Velocity on Google Jobs (2026)

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.