E-commerce & products

Build a Fynd Product Catalog Database With Python (2026)

Build a structured product catalog database from Fynd D2C storefronts — automated ingestion, dedup, schema design. Python recipes with Thirdwatch actor.

May 26, 2026 · 6 min read · 1,316 words

See the scraper →

Thirdwatch's Fynd Platform Scraper returns structured product data from any Fynd-powered D2C storefront — product IDs, slugs, pricing, discounts, ratings, stock status, and brand attribution. Built for developers building product catalog databases, comparison engines, price-tracking systems, or D2C analytics platforms.

Why build a catalog database from Fynd storefronts

Fynd Platform powers a growing segment of India's D2C storefronts. According to Inc42's D2C report, India had over 800 funded D2C brands by end of 2025, and a significant share run their storefronts on Fynd's commerce infrastructure. Unlike marketplace listings where product data is standardized by the marketplace, each Fynd storefront is a standalone product catalog with its own pricing logic, collection taxonomy, and inventory rules.

For developers building product comparison tools, price trackers, or D2C market intelligence platforms, this means you need a reliable ingestion pipeline that can pull structured data from arbitrary Fynd storefronts and normalize it into a consistent schema. The alternative — writing custom scrapers per storefront — fails at scale because every Fynd store shares the same underlying platform and breaks identically when Fynd updates its frontend.

The Fynd Scraper returns a consistent schema across all Fynd storefronts: store_domain, product_id, slug, product_name, brand, price, price_max, original_price, discount_percent, currency, rating, rating_count, image_url, url, in_stock, item_type, and source_query. One API call per store, consistent output, ready for database ingestion.

How does this compare to alternatives?

Three paths to a Fynd product catalog database:

Approach	Reliability	Setup time	Maintenance
Custom BeautifulSoup scraper per store	Medium; breaks on platform updates	2-4 hours per store	You fix every breakage
Headless browser automation (Playwright)	Higher; handles JS rendering	1-2 days for robust setup	Browser + anti-bot drift
Thirdwatch Fynd Scraper API + your DB	Production-grade, platform-change resilient	30 minutes	Thirdwatch tracks Fynd changes

The Fynd Scraper actor page abstracts the extraction layer entirely. Your code handles only what it should: calling the API, transforming the output, and writing to your database.

How to build a Fynd product catalog database

Step 1: How do I authenticate and install dependencies?

Get a free Apify API token at apify.com, install the Python client, and set up your database.

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"
pip install apify-client duckdb pandas

Step 2: How do I design the catalog schema?

Define a schema that maps directly to the actor's output fields. DuckDB for local development, PostgreSQL for production.

import duckdb

db = duckdb.connect("fynd_catalog.duckdb")
db.execute("""
    CREATE TABLE IF NOT EXISTS products (
        store_domain   VARCHAR,
        product_id     VARCHAR,
        slug           VARCHAR,
        product_name   VARCHAR,
        brand          VARCHAR,
        price          DOUBLE,
        price_max      DOUBLE,
        original_price DOUBLE,
        discount_percent DOUBLE,
        currency       VARCHAR,
        rating         DOUBLE,
        rating_count   INTEGER,
        image_url      VARCHAR,
        url            VARCHAR,
        in_stock       BOOLEAN,
        item_type      VARCHAR,
        source_query   VARCHAR,
        pulled_at      DATE,
        PRIMARY KEY (store_domain, product_id, pulled_at)
    )
""")
print("Schema ready")

The composite primary key (store_domain, product_id, pulled_at) lets you store historical snapshots without dedup conflicts. Each weekly pull creates new rows; querying the latest snapshot per product uses a simple window function.

Step 3: How do I ingest products from multiple storefronts?

Use the apify-client SDK for cleaner async handling and automatic pagination.

from apify_client import ApifyClient
import datetime

client = ApifyClient(os.environ["APIFY_TOKEN"])
today = datetime.date.today().isoformat()

STORES = [
    "https://www.store-alpha.com",
    "https://www.store-beta.com",
    "https://www.store-gamma.com",
]

all_products = []

for store_url in STORES:
    run = client.actor("thirdwatch/fynd-scraper").call(
        run_input={
            "storeUrls": [store_url],
            "maxResultsPerTarget": 1000,
        },
        timeout_secs=900,
    )
    items = list(
        client.dataset(run["defaultDatasetId"]).iterate_items()
    )
    for item in items:
        item["pulled_at"] = today
    all_products.extend(items)
    print(f"{store_url}: {len(items)} products")

print(f"Total: {len(all_products)} products from {len(STORES)} stores")

Step 4: How do I load into the database with upsert logic?

Insert new rows and handle the composite key constraint for idempotent re-runs.

import pandas as pd

df = pd.DataFrame(all_products)

# Select only the columns matching our schema
cols = [
    "store_domain", "product_id", "slug", "product_name", "brand",
    "price", "price_max", "original_price", "discount_percent",
    "currency", "rating", "rating_count", "image_url", "url",
    "in_stock", "item_type", "source_query", "pulled_at",
]
df = df.reindex(columns=cols)

db.execute("INSERT OR REPLACE INTO products SELECT * FROM df")

count = db.execute("SELECT COUNT(*) FROM products").fetchone()[0]
print(f"Catalog database now has {count} total rows")

Step 5: How do I query the catalog for the latest snapshot?

Use a window function to get the most recent data per product across all stores.

latest = db.execute("""
    WITH ranked AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY store_domain, product_id
                ORDER BY pulled_at DESC
            ) AS rn
        FROM products
    )
    SELECT store_domain, product_id, product_name, brand,
           price, original_price, discount_percent,
           rating, rating_count, in_stock, pulled_at
    FROM ranked
    WHERE rn = 1
    ORDER BY store_domain, price DESC
""").df()
print(latest.head(20))

Step 6: How do I schedule automated weekly ingestion?

Wrap the ingestion in a script and trigger it via cron, Airflow, or any scheduler.

#!/usr/bin/env python3
"""fynd_ingest.py — weekly Fynd catalog ingestion."""
import os, datetime, duckdb, pandas as pd
from apify_client import ApifyClient

STORES = [
    "https://www.store-alpha.com",
    "https://www.store-beta.com",
]

client = ApifyClient(os.environ["APIFY_TOKEN"])
db = duckdb.connect("fynd_catalog.duckdb")
today = datetime.date.today().isoformat()

for store_url in STORES:
    run = client.actor("thirdwatch/fynd-scraper").call(
        run_input={
            "storeUrls": [store_url],
            "maxResultsPerTarget": 1000,
        },
        timeout_secs=900,
    )
    items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
    for item in items:
        item["pulled_at"] = today
    df = pd.DataFrame(items).reindex(columns=[
        "store_domain", "product_id", "slug", "product_name", "brand",
        "price", "price_max", "original_price", "discount_percent",
        "currency", "rating", "rating_count", "image_url", "url",
        "in_stock", "item_type", "source_query", "pulled_at",
    ])
    db.execute("INSERT OR REPLACE INTO products SELECT * FROM df")
    print(f"{today} | {store_url}: {len(items)} products ingested")

db.close()

Add to cron: 0 6 * * 1 python3 fynd_ingest.py for Monday-morning refreshes.

Sample output

Two records from a Fynd-powered storefront. Production pipelines typically ingest 200-5,000 products per store per run.

[
  {
    "store_domain": "www.store-alpha.com",
    "product_id": "5432198",
    "slug": "relaxed-fit-cargo-joggers-charcoal",
    "product_name": "Relaxed Fit Cargo Joggers - Charcoal",
    "brand": "Store Alpha",
    "price": 1599,
    "price_max": 1599,
    "original_price": 2199,
    "discount_percent": 27,
    "currency": "INR",
    "rating": 4.4,
    "rating_count": 519,
    "image_url": "https://cdn.fynd.com/v2/falling-surf-7c8bb8/fyprod/...",
    "url": "https://www.store-alpha.com/product/relaxed-fit-cargo-joggers-charcoal",
    "in_stock": true,
    "item_type": "standard",
    "source_query": null
  },
  {
    "store_domain": "www.store-beta.com",
    "product_id": "6789012",
    "slug": "vitamin-c-serum-30ml",
    "product_name": "Vitamin C Brightening Serum 30ml",
    "brand": "Store Beta",
    "price": 699,
    "price_max": 699,
    "original_price": 999,
    "discount_percent": 30,
    "currency": "INR",
    "rating": 4.6,
    "rating_count": 1203,
    "image_url": "https://cdn.fynd.com/v2/falling-surf-7c8bb8/fyprod/...",
    "url": "https://www.store-beta.com/product/vitamin-c-serum-30ml",
    "in_stock": true,
    "item_type": "standard",
    "source_query": null
  }
]

For database ingestion the critical fields are product_id (dedup key), store_domain (partition key), price and original_price (pricing analysis), and in_stock (inventory tracking). The slug field is useful for building URL-based lookups without storing full URLs.

Common pitfalls

Four issues surface in production Fynd catalog pipelines. India's D2C market is projected to reach $100 billion by 2027 according to Bain & Company's India Venture Capital Report, making platform-level data infrastructure critical for competitive intelligence. Missing composite key on upserts — using product_id alone as a primary key fails when you ingest the same product from different stores or want to keep historical snapshots. Always use (store_domain, product_id, pulled_at) as your composite key. Null handling in ratings — newly listed products may return null for rating and rating_count; cast these to 0 in your schema or use COALESCE in queries, otherwise aggregations silently drop rows. Schema drift across stores — while all Fynd stores share the same platform, some may return additional fields or omit optional ones; always use explicit column selection (reindex(columns=...)) rather than bulk-inserting the raw JSON. Timeout on large catalogs — stores with 5,000+ products may exceed the default API timeout; set timeout_secs=900 or higher and use the async client for parallel store ingestion.

Thirdwatch's actor handles Fynd platform changes and extraction logic so your pipeline code stays focused on ingestion and analytics. You maintain the store list and database schema; the actor returns clean structured rows. Pair with our Flipkart Scraper or Myntra Scraper to build a cross-platform catalog that covers both D2C and marketplace channels for the same brands.

Related use cases

Frequently asked questions

What database should I use for Fynd product catalogs?

For small-scale projects (under 100K products), SQLite or DuckDB work well and require zero infrastructure. For production pipelines ingesting multiple storefronts weekly, PostgreSQL with a product_id + store_domain composite key handles dedup and upserts cleanly. If you need full-text search across product names, PostgreSQL's tsvector or a dedicated search index like Typesense is the standard approach.

How do I handle product variants in the Fynd output?

The actor returns one row per product, not per variant. The price field reflects the current selling price and price_max reflects the highest variant price. If a product has size-dependent pricing, price and price_max will differ. For variant-level analysis you would need to follow the product URL and parse the variant data from the product detail page, which is outside the actor's current scope.

Scrape Fynd D2C Storefronts for Brand Research (2026 Guide)Monitor Fynd Brand Pricing for Competitive Intel (2026)Track Fynd Platform Trends for Ecommerce Strategy (2026)

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.