Skip to main content
Thirdwatchthirdwatch
E-commerce & products

Build a Competitor Product Database from Shopify Stores

Create a structured competitor product catalog from Shopify stores with pricing, variants, and sale data. Python pipeline with scheduled weekly refresh.

May 26, 2026 · 5 min read · 1,130 words
See the scraper →

Thirdwatch's Shopify Store Scraper pulls structured product catalogs from any public Shopify store — title, vendor, product type, tags, variant-level pricing with compare-at prices, SKUs, and inventory status. No API keys or merchant credentials needed. Feed multiple competitor store URLs in one run and get back a unified dataset ready for side-by-side analysis. Built for brand strategists, ecommerce operators, and DTC growth teams who track competitor catalogs systematically.

Why build a competitor product database from Shopify

The DTC market has consolidated around Shopify. According to Shopify's 2024 annual report, merchants on the platform processed over $235 billion in gross merchandise volume. For any brand selling direct-to-consumer, the competitive set almost certainly includes multiple Shopify stores. Knowing what competitors sell, at what price points, with which variants, and when they run sales is the foundation of competitive product strategy.

The job-to-be-done is catalog comparison. A DTC shoe brand wants to know every sneaker SKU, price, and color variant across five Shopify competitors — updated weekly. A holding company managing a portfolio of DTC brands needs a single database that catalogs product overlap, pricing gaps, and launch cadence across all portfolio brands plus their direct competitors. A growth marketer planning a sale event wants to see which competitors are currently running compare-at-price discounts and on which product types. All of these start with a structured, multi-store product database that refreshes automatically.

How does this compare to the alternatives?

Three approaches to building a multi-store competitor catalog:

Approach Cost Reliability Setup time Maintenance
Manual browsing + spreadsheet Free Misses variants, stale within days Hours per store Full re-entry on each refresh
DIY script per store Free compute Breaks on layout changes, rate limits 1-2 days Per-store debugging
Thirdwatch Shopify Store Scraper Pay per result Unified schema across all stores 5 minutes Thirdwatch maintains

Manual catalog comparison breaks the moment you need variant-level granularity or weekly refresh. DIY scripts work for one store but fragment when you add a second — each store may have different collection structures, pagination behavior, and rate limits. The Shopify Store Scraper normalizes everything into a single schema with store_domain as the partition key, so your downstream comparison logic never changes when you add or remove a competitor.

How to build a competitor database in 4 steps

Step 1: How do I set up the API token?

Create a free account at apify.com, navigate to Settings, and copy your API token. Store it as an environment variable:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I pull catalogs from multiple competitors at once?

Pass all competitor store URLs in a single storeUrls array. The actor tags each product with store_domain and store_url for easy grouping.

import os, requests, pandas as pd

ACTOR = "thirdwatch~shopify-store-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

COMPETITORS = [
    "https://www.allbirds.com",
    "https://www.bombas.com",
    "https://www.everlane.com",
    "https://www.koio.co",
    "https://www.greats.com",
]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "storeUrls": COMPETITORS,
        "maxProductsPerStore": 500,
        "includeVariants": True,
    },
    timeout=1200,
)
products = resp.json()
df = pd.DataFrame(products)
print(f"{len(df)} products across {df.store_domain.nunique()} stores")

Five stores at 500 products each yields up to 2,500 rows — a comprehensive competitor catalog in one API call.

Step 3: How do I compare pricing and catalog structure across stores?

Group by store_domain and product_type to build a cross-competitor pricing matrix.

# Price distribution by store and product type
pricing = df.groupby(["store_domain", "product_type"]).agg(
    product_count=("product_id", "count"),
    avg_price=("min_price", "mean"),
    median_price=("min_price", "median"),
    pct_on_sale=("on_sale", "mean"),
    avg_variants=("variant_count", "mean"),
).round(2).reset_index()

print(pricing.sort_values(["product_type", "store_domain"]).to_string(index=False))

# Identify which stores have the deepest catalog per category
catalog_depth = df.groupby("store_domain").agg(
    total_products=("product_id", "count"),
    unique_types=("product_type", "nunique"),
    pct_available=("available", "mean"),
    pct_on_sale=("on_sale", "mean"),
).round(2)
print(catalog_depth)

pct_on_sale across stores reveals who is running promotions aggressively versus holding full price. pct_available flags inventory health — a store with 60% availability is likely running low on popular SKUs.

Step 4: How do I schedule weekly refreshes to keep the database current?

Use Apify's scheduling API to re-run the scraper on a cron schedule. Each run overwrites the dataset with a fresh snapshot.

curl -X POST "https://api.apify.com/v2/schedules?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "competitor-catalog-weekly",
    "cronExpression": "0 6 * * 1",
    "timezone": "America/New_York",
    "isEnabled": true,
    "actions": [{
      "type": "RUN_ACTOR",
      "actorId": "thirdwatch~shopify-store-scraper",
      "runInput": {
        "storeUrls": [
          "https://www.allbirds.com",
          "https://www.bombas.com",
          "https://www.everlane.com"
        ],
        "maxProductsPerStore": 500,
        "includeVariants": true
      }
    }]
  }'

This fires every Monday at 6 AM ET. Add an ACTOR.RUN.SUCCEEDED webhook to push the fresh dataset to your data warehouse, Google Sheets, or BI tool automatically.

Sample output

Two products from different stores in a single run. The store_domain field partitions the data for cross-store comparison.

[
  {
    "store_domain": "allbirds.com",
    "store_url": "https://www.allbirds.com",
    "url": "https://www.allbirds.com/products/mens-wool-runners",
    "product_id": 4029431775334,
    "title": "Men's Wool Runners",
    "vendor": "Allbirds",
    "product_type": "Shoes",
    "tags": ["mens", "runner", "wool"],
    "min_price": 110.0,
    "max_price": 110.0,
    "on_sale": false,
    "variant_count": 24,
    "available": true,
    "updated_at": "2026-04-10T08:32:11Z"
  },
  {
    "store_domain": "bombas.com",
    "store_url": "https://www.bombas.com",
    "url": "https://www.bombas.com/products/womens-ankle-sock-6-pack",
    "product_id": 7812345678,
    "title": "Women's Ankle Sock 6-Pack",
    "vendor": "Bombas",
    "product_type": "Socks",
    "tags": ["womens", "ankle", "6-pack"],
    "min_price": 62.80,
    "max_price": 62.80,
    "on_sale": true,
    "variant_count": 8,
    "available": true,
    "updated_at": "2026-05-01T14:20:00Z"
  }
]

on_sale: true on the Bombas product means at least one variant has a compare_at_price higher than its current price — the merchant is running a markdown. The updated_at timestamp tells you when the merchant last touched the product listing, useful for detecting price changes between refreshes.

Common pitfalls

Three recurring issues in competitor database pipelines. According to BuiltWith's Shopify usage statistics, over 4.8 million live sites run on Shopify as of 2026, making it the dominant DTC platform globally. Cross-currency comparison — Shopify product feeds do not include currency codes. A $110 product from allbirds.com (USD) and a 110 product from a UK store (GBP) are not comparable without mapping store_domain to currency. Build a domain-to-currency lookup table and apply FX rates before any cross-store aggregation. Collection handle inconsistency — if you use collectionHandles to target specific categories, remember that handles are store-specific. "mens-shoes" exists on Allbirds but might be "shoes-men" or "men" on another store. Check each store's navigation or sitemap before assuming a handle exists. Stale snapshots — a weekly refresh is fine for catalog structure but too slow for pricing intelligence during sale events like Black Friday. Switch to daily cadence during high-change periods.

The actor handles pagination, variant expansion, and rate limiting internally, so adding a new competitor is a one-line change to the storeUrls array.

Related use cases

Frequently asked questions

How often should I refresh a competitor product database?

Weekly for catalog changes like new launches and discontinued products. Daily if you are tracking pricing or sale events. Apify schedules handle both cadences with a single cron expression and no additional infrastructure.

Can I compare more than two stores at once?

Yes. Pass any number of store URLs in the storeUrls array. The actor processes them sequentially and tags each product with its store_domain, so downstream grouping and comparison is straightforward in pandas or SQL.

Does the scraper capture out-of-stock products?

Yes. Products remain in the Shopify feed even when all variants are out of stock. The available field is false and each variant's available field is also false. This lets you track inventory patterns and restock cadence.

What if a competitor uses a headless Shopify storefront?

Headless storefronts that still expose the standard Shopify products.json endpoint work normally. Fully custom frontends that disable the public product feed are skipped automatically.

Can I export the data to Google Sheets or a database?

Apify datasets export to CSV, JSON, and Excel natively. For Google Sheets, use Apify's built-in Google Sheets integration or the API to push rows directly after each run completes.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.