Skip to main content
Thirdwatchthirdwatch
E-commerce & products

Scrape Any Shopify Store for Product Research (2026 Guide)

Pull products, prices, variants, and inventory from any public Shopify store. No login or API key needed. Python code examples for DTC brand research.

May 26, 2026 · 6 min read · 1,275 words
See the scraper →

Thirdwatch's Shopify Store Scraper extracts products from any public Shopify storefront — title, vendor, product type, tags, images, full variant data (SKU, price, compare-at price, availability), and options. No login, no API key, no merchant permission. Works on custom domains and myshopify.com subdomains. Built for DTC researchers, brand analysts, and ecommerce operators who need structured Shopify catalog data at scale.

Why scrape Shopify stores for product research

Shopify powers over 4.6 million live stores globally, according to BuiltWith's 2025 Shopify usage statistics. That makes Shopify the single largest source of DTC product data on the web — from niche artisan brands to publicly traded companies like Allbirds and Gymshark. The problem: Shopify's Admin API requires merchant credentials, and the Storefront API requires an access token issued by the store owner. Neither is available to outside researchers.

The job-to-be-done is straightforward. A brand strategist at a DTC holding company needs to catalog every product, price point, and variant across 30 portfolio competitors. A venture analyst evaluating a Shopify-native brand wants to quantify catalog depth, price distribution, and sale frequency. An ecommerce consultant building a market map needs structured data from dozens of stores without manual copy-paste. A product manager benchmarking against three direct competitors wants variant-level pricing with SKU granularity. All of these reduce to: give me every product from these Shopify stores as structured rows.

How does this compare to the alternatives?

Three paths to getting Shopify product data into a spreadsheet or database:

Approach Cost Reliability Setup time Maintenance
DIY Python + products.json pagination Free compute Breaks on rate limits, anti-bot stores 2-4 hours You maintain pagination, error handling, proxy logic
Generic scraping API (ScrapingBee, Apify generic) Per-request pricing Works but no Shopify-specific parsing 30 minutes You write and maintain the parser
Thirdwatch Shopify Store Scraper Pay per result Handles pagination, variants, collections 5 minutes Thirdwatch maintains

The DIY route works for a single store but breaks at scale — Shopify rate-limits aggressive crawlers and some stores use Cloudflare. Generic scraping APIs return raw HTML that you still need to parse. The Shopify Store Scraper returns 25+ structured fields per product including variant arrays, compare-at prices, and sale flags out of the box. Shopify's own Admin API returns richer data but requires merchant credentials, making it inaccessible for competitive research. The public products.json endpoint that most DIY scrapers use is the same source this actor consumes, but with proper pagination handling, rate-limit management, and structured output normalization already built in.

How to scrape Shopify products in 4 steps

Step 1: How do I authenticate against Apify?

Sign up at apify.com (free tier, no credit card required), open Settings, and copy your API token. Every example below assumes the token is in APIFY_TOKEN:

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I pull all products from a Shopify store?

Pass one or more store URLs in storeUrls. The actor accepts homepages, myshopify.com subdomains, or specific collection URLs.

import os, requests, pandas as pd

ACTOR = "thirdwatch~shopify-store-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "storeUrls": [
            "https://www.allbirds.com",
            "https://www.gymshark.com"
        ],
        "maxProductsPerStore": 100,
        "includeVariants": True,
    },
    timeout=600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} products across {df.store_domain.nunique()} stores")
print(df[["store_domain", "title", "vendor", "product_type",
          "min_price", "max_price", "on_sale", "variant_count"]].head(10))

Two stores at 100 products each returns up to 200 rows — small enough for an initial research sweep. Raise maxProductsPerStore to 10000 for full catalog pulls.

Step 3: How do I scrape a specific collection or category?

Use collectionHandles to target specific Shopify collections within each store. Collections are Shopify's per-store category system — handles like new-arrivals, sale, or mens-shoes.

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "storeUrls": ["https://www.allbirds.com"],
        "collectionHandles": ["mens-shoes", "womens-shoes"],
        "sortBy": "newest",
        "maxProductsPerStore": 50,
        "includeVariants": True,
    },
    timeout=600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} products from targeted collections")

You can also pass the full collection URL directly in storeUrls — for example, https://www.allbirds.com/collections/new-arrivals — and the actor scopes to that collection automatically.

Step 4: How do I filter by price range?

Use minPrice and maxPrice to narrow results. These are post-filters applied after fetching — a product is kept if any of its variants falls within the range.

resp = requests.post(
    f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
    params={"token": TOKEN},
    json={
        "storeUrls": ["https://www.gymshark.com"],
        "minPrice": 30,
        "maxPrice": 80,
        "maxProductsPerStore": 200,
        "includeVariants": False,
    },
    timeout=600,
)
df = pd.DataFrame(resp.json())
print(f"{len(df)} products in $30-$80 range")
print(df[["title", "product_type", "min_price", "max_price", "on_sale"]].head(10))

Setting includeVariants to false produces smaller output when you only need price ranges and product metadata, not individual SKU-level data. This is useful for high-level competitive scans where you need catalog breadth and price positioning but not the full SKU matrix. For pricing research that requires size-by-color granularity (e.g., tracking whether specific variants sell out faster), keep includeVariants enabled and flatten the variants array into a separate dataframe joined on product_id.

Sample output

A single product record from Allbirds looks like this. Each row is one Shopify product with full variant detail.

{
  "store_domain": "allbirds.com",
  "store_url": "https://www.allbirds.com",
  "url": "https://www.allbirds.com/products/mens-wool-runners",
  "product_id": 4029431775334,
  "handle": "mens-wool-runners",
  "title": "Men's Wool Runners",
  "vendor": "Allbirds",
  "product_type": "Shoes",
  "tags": ["mens", "runner", "wool"],
  "description": "Our iconic everyday shoe, spun from superfine merino wool...",
  "image": "https://cdn.shopify.com/s/files/1/0023/...",
  "images": ["https://cdn.shopify.com/s/files/1/0023/..."],
  "options": ["Size", "Color"],
  "min_price": 110.0,
  "max_price": 110.0,
  "min_compare_at_price": null,
  "max_compare_at_price": null,
  "on_sale": false,
  "variants": [
    {
      "id": 39274,
      "title": "10 / Natural Grey",
      "sku": "WR-NG-10",
      "price": 110.0,
      "compare_at_price": null,
      "available": true,
      "option1": "10",
      "option2": "Natural Grey",
      "option3": null
    }
  ],
  "variant_count": 24,
  "available": true,
  "created_at": "2021-05-12T10:00:00Z",
  "updated_at": "2026-04-10T08:32:11Z",
  "published_at": "2021-05-12T10:05:00Z"
}

Key fields: on_sale is true when any variant has a compare_at_price higher than its price — a reliable indicator of active discounts without parsing description text. variant_count tells you the SKU breadth at a glance. created_at and updated_at let you track catalog freshness and launch cadence.

Common pitfalls

Three things to watch for in Shopify product research pipelines. Currency ambiguity — Shopify's product feed does not include currency codes. Prices are floats in the store's default currency. If you are comparing across stores in different countries, you need to map store_domain to currency and apply conversion rates downstream. Collection handle guessing — there is no global Shopify taxonomy. Collection handles like sale, new-arrivals, or best-sellers are common but not universal. Check a store's sitemap or navigation before assuming handles exist. Compare-at price nullscompare_at_price is only populated when the merchant sets a sale price. A null value means no discount is configured, not that the product is full-priced by default.

Thirdwatch's actor handles pagination, rate limiting, and proxy rotation internally so you get clean structured data without managing the scraping infrastructure yourself.

A fourth consideration specific to competitive research: catalog freshness tracking. The created_at and updated_at timestamps on each product reveal a store's launch cadence. A competitor launching 15 new products per month is in growth mode; one with all products last updated six months ago is in maintenance mode. Track created_at distributions across weekly snapshots to build a new-product velocity metric per competitor. Similarly, updated_at spikes across many products often indicate a coordinated price change or seasonal sale — actionable timing intelligence for pricing teams. These temporal signals are invisible on the storefront but fully exposed in the structured data the actor returns.

Related use cases

Frequently asked questions

Do I need the store owner's permission to scrape a Shopify store?

No. The actor reads publicly accessible product data from Shopify's standard storefront endpoints. Any store with a public catalog works. Password-protected or private stores are automatically skipped.

Does this work on custom domains or only myshopify.com?

Both. Pass the brand's primary domain like allbirds.com or gymshark.com. The actor detects the underlying Shopify storefront automatically and resolves products.json from any valid Shopify domain.

What about Shopify Plus stores?

Fully supported. Shopify Plus uses the same public product feed as standard Shopify. Enterprise stores like Gymshark, Allbirds, and Bombas all work without any configuration changes.

How many products can I pull from one store?

Up to 10,000 per run via the maxProductsPerStore input. Most DTC brands have 50 to 500 products. Large multi-brand stores may have several thousand. Set maxProductsPerStore to 10000 for a full catalog pull.

Is variant-level data included?

Yes when includeVariants is true (the default). Each product record includes an array of variants with SKU, price, compare-at price, availability, and option values like size and color.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.