Build a Fynd Product Catalog Database With Python (2026)
Build a structured product catalog database from Fynd D2C storefronts — automated ingestion, dedup, schema design. Python recipes with Thirdwatch actor.

Thirdwatch's Fynd Platform Scraper returns structured product data from any Fynd-powered D2C storefront — product IDs, slugs, pricing, discounts, ratings, stock status, and brand attribution. Built for developers building product catalog databases, comparison engines, price-tracking systems, or D2C analytics platforms.
Why build a catalog database from Fynd storefronts
Fynd Platform powers a growing segment of India's D2C storefronts. According to Inc42's D2C report, India had over 800 funded D2C brands by end of 2025, and a significant share run their storefronts on Fynd's commerce infrastructure. Unlike marketplace listings where product data is standardized by the marketplace, each Fynd storefront is a standalone product catalog with its own pricing logic, collection taxonomy, and inventory rules.
For developers building product comparison tools, price trackers, or D2C market intelligence platforms, this means you need a reliable ingestion pipeline that can pull structured data from arbitrary Fynd storefronts and normalize it into a consistent schema. The alternative — writing custom scrapers per storefront — fails at scale because every Fynd store shares the same underlying platform and breaks identically when Fynd updates its frontend.
The Fynd Scraper returns a consistent schema across all Fynd storefronts: store_domain, product_id, slug, product_name, brand, price, price_max, original_price, discount_percent, currency, rating, rating_count, image_url, url, in_stock, item_type, and source_query. One API call per store, consistent output, ready for database ingestion.
How does this compare to alternatives?
Three paths to a Fynd product catalog database:
| Approach | Reliability | Setup time | Maintenance |
|---|---|---|---|
| Custom BeautifulSoup scraper per store | Medium; breaks on platform updates | 2-4 hours per store | You fix every breakage |
| Headless browser automation (Playwright) | Higher; handles JS rendering | 1-2 days for robust setup | Browser + anti-bot drift |
| Thirdwatch Fynd Scraper API + your DB | Production-grade, platform-change resilient | 30 minutes | Thirdwatch tracks Fynd changes |
The Fynd Scraper actor page abstracts the extraction layer entirely. Your code handles only what it should: calling the API, transforming the output, and writing to your database.
How to build a Fynd product catalog database
Step 1: How do I authenticate and install dependencies?
Get a free Apify API token at apify.com, install the Python client, and set up your database.
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"
pip install apify-client duckdb pandasStep 2: How do I design the catalog schema?
Define a schema that maps directly to the actor's output fields. DuckDB for local development, PostgreSQL for production.
import duckdb
db = duckdb.connect("fynd_catalog.duckdb")
db.execute("""
CREATE TABLE IF NOT EXISTS products (
store_domain VARCHAR,
product_id VARCHAR,
slug VARCHAR,
product_name VARCHAR,
brand VARCHAR,
price DOUBLE,
price_max DOUBLE,
original_price DOUBLE,
discount_percent DOUBLE,
currency VARCHAR,
rating DOUBLE,
rating_count INTEGER,
image_url VARCHAR,
url VARCHAR,
in_stock BOOLEAN,
item_type VARCHAR,
source_query VARCHAR,
pulled_at DATE,
PRIMARY KEY (store_domain, product_id, pulled_at)
)
""")
print("Schema ready")The composite primary key (store_domain, product_id, pulled_at) lets you store historical snapshots without dedup conflicts. Each weekly pull creates new rows; querying the latest snapshot per product uses a simple window function.
Step 3: How do I ingest products from multiple storefronts?
Use the apify-client SDK for cleaner async handling and automatic pagination.
from apify_client import ApifyClient
import datetime
client = ApifyClient(os.environ["APIFY_TOKEN"])
today = datetime.date.today().isoformat()
STORES = [
"https://www.store-alpha.com",
"https://www.store-beta.com",
"https://www.store-gamma.com",
]
all_products = []
for store_url in STORES:
run = client.actor("thirdwatch/fynd-scraper").call(
run_input={
"storeUrls": [store_url],
"maxResultsPerTarget": 1000,
},
timeout_secs=900,
)
items = list(
client.dataset(run["defaultDatasetId"]).iterate_items()
)
for item in items:
item["pulled_at"] = today
all_products.extend(items)
print(f"{store_url}: {len(items)} products")
print(f"Total: {len(all_products)} products from {len(STORES)} stores")Step 4: How do I load into the database with upsert logic?
Insert new rows and handle the composite key constraint for idempotent re-runs.
import pandas as pd
df = pd.DataFrame(all_products)
# Select only the columns matching our schema
cols = [
"store_domain", "product_id", "slug", "product_name", "brand",
"price", "price_max", "original_price", "discount_percent",
"currency", "rating", "rating_count", "image_url", "url",
"in_stock", "item_type", "source_query", "pulled_at",
]
df = df.reindex(columns=cols)
db.execute("INSERT OR REPLACE INTO products SELECT * FROM df")
count = db.execute("SELECT COUNT(*) FROM products").fetchone()[0]
print(f"Catalog database now has {count} total rows")Step 5: How do I query the catalog for the latest snapshot?
Use a window function to get the most recent data per product across all stores.
latest = db.execute("""
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY store_domain, product_id
ORDER BY pulled_at DESC
) AS rn
FROM products
)
SELECT store_domain, product_id, product_name, brand,
price, original_price, discount_percent,
rating, rating_count, in_stock, pulled_at
FROM ranked
WHERE rn = 1
ORDER BY store_domain, price DESC
""").df()
print(latest.head(20))Step 6: How do I schedule automated weekly ingestion?
Wrap the ingestion in a script and trigger it via cron, Airflow, or any scheduler.
#!/usr/bin/env python3
"""fynd_ingest.py — weekly Fynd catalog ingestion."""
import os, datetime, duckdb, pandas as pd
from apify_client import ApifyClient
STORES = [
"https://www.store-alpha.com",
"https://www.store-beta.com",
]
client = ApifyClient(os.environ["APIFY_TOKEN"])
db = duckdb.connect("fynd_catalog.duckdb")
today = datetime.date.today().isoformat()
for store_url in STORES:
run = client.actor("thirdwatch/fynd-scraper").call(
run_input={
"storeUrls": [store_url],
"maxResultsPerTarget": 1000,
},
timeout_secs=900,
)
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())
for item in items:
item["pulled_at"] = today
df = pd.DataFrame(items).reindex(columns=[
"store_domain", "product_id", "slug", "product_name", "brand",
"price", "price_max", "original_price", "discount_percent",
"currency", "rating", "rating_count", "image_url", "url",
"in_stock", "item_type", "source_query", "pulled_at",
])
db.execute("INSERT OR REPLACE INTO products SELECT * FROM df")
print(f"{today} | {store_url}: {len(items)} products ingested")
db.close()Add to cron: 0 6 * * 1 python3 fynd_ingest.py for Monday-morning refreshes.
Sample output
Two records from a Fynd-powered storefront. Production pipelines typically ingest 200-5,000 products per store per run.
[
{
"store_domain": "www.store-alpha.com",
"product_id": "5432198",
"slug": "relaxed-fit-cargo-joggers-charcoal",
"product_name": "Relaxed Fit Cargo Joggers - Charcoal",
"brand": "Store Alpha",
"price": 1599,
"price_max": 1599,
"original_price": 2199,
"discount_percent": 27,
"currency": "INR",
"rating": 4.4,
"rating_count": 519,
"image_url": "https://cdn.fynd.com/v2/falling-surf-7c8bb8/fyprod/...",
"url": "https://www.store-alpha.com/product/relaxed-fit-cargo-joggers-charcoal",
"in_stock": true,
"item_type": "standard",
"source_query": null
},
{
"store_domain": "www.store-beta.com",
"product_id": "6789012",
"slug": "vitamin-c-serum-30ml",
"product_name": "Vitamin C Brightening Serum 30ml",
"brand": "Store Beta",
"price": 699,
"price_max": 699,
"original_price": 999,
"discount_percent": 30,
"currency": "INR",
"rating": 4.6,
"rating_count": 1203,
"image_url": "https://cdn.fynd.com/v2/falling-surf-7c8bb8/fyprod/...",
"url": "https://www.store-beta.com/product/vitamin-c-serum-30ml",
"in_stock": true,
"item_type": "standard",
"source_query": null
}
]For database ingestion the critical fields are product_id (dedup key), store_domain (partition key), price and original_price (pricing analysis), and in_stock (inventory tracking). The slug field is useful for building URL-based lookups without storing full URLs.
Common pitfalls
Four issues surface in production Fynd catalog pipelines. India's D2C market is projected to reach $100 billion by 2027 according to Bain & Company's India Venture Capital Report, making platform-level data infrastructure critical for competitive intelligence. Missing composite key on upserts — using product_id alone as a primary key fails when you ingest the same product from different stores or want to keep historical snapshots. Always use (store_domain, product_id, pulled_at) as your composite key. Null handling in ratings — newly listed products may return null for rating and rating_count; cast these to 0 in your schema or use COALESCE in queries, otherwise aggregations silently drop rows. Schema drift across stores — while all Fynd stores share the same platform, some may return additional fields or omit optional ones; always use explicit column selection (reindex(columns=...)) rather than bulk-inserting the raw JSON. Timeout on large catalogs — stores with 5,000+ products may exceed the default API timeout; set timeout_secs=900 or higher and use the async client for parallel store ingestion.
Thirdwatch's actor handles Fynd platform changes and extraction logic so your pipeline code stays focused on ingestion and analytics. You maintain the store list and database schema; the actor returns clean structured rows. Pair with our Flipkart Scraper or Myntra Scraper to build a cross-platform catalog that covers both D2C and marketplace channels for the same brands.
Related use cases
Frequently asked questions
What database should I use for Fynd product catalogs?
For small-scale projects (under 100K products), SQLite or DuckDB work well and require zero infrastructure. For production pipelines ingesting multiple storefronts weekly, PostgreSQL with a product_id + store_domain composite key handles dedup and upserts cleanly. If you need full-text search across product names, PostgreSQL's tsvector or a dedicated search index like Typesense is the standard approach.
How do I handle product variants in the Fynd output?
The actor returns one row per product, not per variant. The price field reflects the current selling price and price_max reflects the highest variant price. If a product has size-dependent pricing, price and price_max will differ. For variant-level analysis you would need to follow the product URL and parse the variant data from the product detail page, which is outside the actor's current scope.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.