Build a Continuous AJIO Product Research Pipeline (2026)
Architect a continuous AJIO product research pipeline with Thirdwatch — daily snapshots, Postgres or Snowflake warehouse, dbt models, and BI dashboards.

Thirdwatch's AJIO Scraper is the data-source layer for a continuous India fashion product-research pipeline — daily snapshots, raw-JSON object storage, Postgres or Snowflake warehouse, dbt transformations, and BI dashboards on top. Built for engineering teams operationalising India fashion ecom intelligence and founders bootstrapping a price-comparison or trend-research product.
Why build an AJIO product research pipeline
A one-off AJIO pull is research. A continuous snapshot stream is a data product. For India fashion intelligence — pricing analytics, brand-launch tracking, EOSS rhythm modelling — the value compounds with longitudinal history. A 30-day window tells you about a sale event; a 12-month window tells you about Reliance Retail's strategic positioning and the seasonality of Indian online fashion.
According to the IBEF 2024 report on Indian retail and ecom, India's online fashion market is projected to grow above 20% CAGR through 2028, and AJIO is one of the two pure-play marketplaces driving that. For engineering teams at consumer-fashion startups, brand teams at heritage retailers, and founders building India-specific price-comparison or trend-research products, a stable AJIO data layer is foundational.
The job-to-be-done is structured: a consumer fashion app needs a daily AJIO catalog feed for its UI; a brand ops team wants 12 months of competitor-brand history in BI; a founder building price comparison needs an SLA-grade upstream feed. All reduce to scheduled runs, object-storage landing, warehouse ingestion, transformation, and consumption.
How does this compare to the alternatives?
| Approach | Reliability | Setup time | Maintenance |
|---|---|---|---|
| Build your own AJIO scraper + pipeline | Brittle — AJIO's anti-bot changes break the scraper periodically | 2 to 4 engineer weeks for v1 | Ongoing per AJIO change |
| Buy India fashion data SaaS | High — vendor SLA — but the underlying data is the same public catalog | Days for procurement | Vendor lock-in, paid tier |
| Thirdwatch AJIO Scraper + your own pipeline | Production-tested data layer + your warehouse stack | Half a day for v1, days for full dbt project | Thirdwatch handles AJIO drift |
The "buy SaaS" path makes sense for non-technical teams. For engineering teams that already run a warehouse, the AJIO Scraper actor plus your existing dbt + Postgres or Snowflake stack is meaningfully cheaper and gives you raw control over schema, modelling, and retention.
How to build the pipeline in 5 steps
Step 1: How do I schedule the AJIO Scraper on Apify?
Create an Apify schedule that fires the actor with the production input every day at a fixed IST time. Capture the run via webhook.
import os
from apify_client import ApifyClient
client = ApifyClient(os.environ["APIFY_TOKEN"])
actor_id = "thirdwatch/ajio-scraper"
input_payload = {
"queries": ["AJIO Own", "Netplay", "Performax",
"Levi's", "Allen Solly", "Biba", "HRX"],
"category": "all",
"sortBy": "newest",
"maxResults": 1000,
}
schedule = client.schedules().create({
"name": "ajio-daily-brand-watchlist",
"cronExpression": "30 0 * * *", # 6:00 AM IST = 00:30 UTC
"isEnabled": True,
"actions": [{
"type": "RUN_ACTOR",
"actorId": actor_id,
"runInput": {"body": input_payload, "contentType": "application/json"},
}],
})
print(schedule["id"])Attach a webhook on the actor that fires ACTOR.RUN.SUCCEEDED to your loader endpoint — Apify will POST the run metadata, from which you can fetch the dataset items.
Step 2: How do I land raw JSON into object storage?
The webhook handler fetches the dataset items by defaultDatasetId and writes a single Parquet file to S3 keyed by snapshot_date.
import os, datetime, io
from apify_client import ApifyClient
import pandas as pd, boto3
client = ApifyClient(os.environ["APIFY_TOKEN"])
s3 = boto3.client("s3")
BUCKET = "ajio-snapshots"
def handle_webhook(run_id: str):
run = client.run(run_id).get()
dataset_id = run["defaultDatasetId"]
items = list(client.dataset(dataset_id).iterate_items())
df = pd.DataFrame(items)
df["snapshot_date"] = datetime.date.today().isoformat()
df["load_ts_utc"] = datetime.datetime.utcnow().isoformat()
buf = io.BytesIO()
df.to_parquet(buf, index=False)
key = f"raw/ajio/snapshot_date={df.snapshot_date.iloc[0]}/products.parquet"
s3.put_object(Bucket=BUCKET, Key=key, Body=buf.getvalue())
return keyParquet keeps the raw layer compact (typically 5 to 20 MB per daily snapshot at 5,000 SKUs) and gives you DuckDB / Athena access for ad-hoc queries without warehouse spend.
Step 3: How do I load into Postgres?
A small upsert script reads today's Parquet from S3, conforms types, and writes to the warehouse.
import pandas as pd, sqlalchemy
engine = sqlalchemy.create_engine(os.environ["WAREHOUSE_URL"])
df = pd.read_parquet("s3://ajio-snapshots/raw/ajio/snapshot_date=2026-05-12/products.parquet")
keep = ["sku", "snapshot_date", "product_name", "brand", "price", "list_price",
"original_price", "discount_percent", "currency", "category",
"catalog_name", "color_group", "coupon_applicable", "planning_category",
"image_url", "url", "load_ts_utc"]
df = df[keep]
with engine.begin() as conn:
conn.execute(sqlalchemy.text("""
CREATE TABLE IF NOT EXISTS raw_ajio_products (
sku TEXT, snapshot_date DATE,
product_name TEXT, brand TEXT,
price NUMERIC, list_price NUMERIC, original_price NUMERIC,
discount_percent NUMERIC, currency TEXT,
category TEXT, catalog_name TEXT, color_group TEXT,
coupon_applicable BOOLEAN, planning_category TEXT,
image_url TEXT, url TEXT, load_ts_utc TIMESTAMP,
PRIMARY KEY (sku, snapshot_date)
)
"""))
df.to_sql("raw_ajio_products", conn, if_exists="append", index=False,
method="multi", chunksize=500)The (sku, snapshot_date) primary key is the natural grain. Re-running the loader for the same day is idempotent if you wrap the insert in an ON CONFLICT DO UPDATE.
Step 4: How do I model this in dbt?
Three layers: staging (cleaned types and null handling), marts (business-grade facts and dimensions), analytics (derived metrics for BI).
-- models/staging/stg_ajio_products.sql
select
sku,
snapshot_date,
product_name,
initcap(trim(brand)) as brand,
case when lower(trim(brand)) in ('ajio own', 'netplay', 'teamspirit', 'performax')
then true else false end as is_private_label,
price::numeric, list_price::numeric, original_price::numeric,
discount_percent::numeric,
split_part(category, '>', 1) as segment,
split_part(category, '>', 2) as vertical,
split_part(category, '>', 3) as brick,
planning_category, url
from {{ source('raw', 'raw_ajio_products') }}-- models/marts/fct_ajio_price_history.sql
{{ config(materialized='incremental', unique_key=['sku', 'snapshot_date']) }}
select * from {{ ref('stg_ajio_products') }}
{% if is_incremental() %}
where snapshot_date > (select coalesce(max(snapshot_date), '2000-01-01') from {{ this }})
{% endif %}-- models/analytics/agg_brand_discount_daily.sql
select
snapshot_date, brand, is_private_label,
count(*) as sku_count,
avg(discount_percent) as avg_discount,
avg(price) as avg_price,
sum(case when discount_percent >= 50 then 1 else 0 end) as deep_discount_skus
from {{ ref('fct_ajio_price_history') }}
group by 1, 2, 3The agg_brand_discount_daily mart powers the daily brand-velocity dashboard. The is_private_label flag separates AJIO's Reliance house brands from third-party brands — the two have different operational discount patterns.
Step 5: How do I expose this to BI and alerts?
Connect Metabase / Superset / Looker to the warehouse and build dashboards on the analytics schema. For Slack alerts, run a small materialised view check on a schedule:
create materialized view mv_new_launches_today as
select today.sku, today.brand, today.product_name, today.price, today.discount_percent, today.url
from fct_ajio_price_history today
left join fct_ajio_price_history yest
on today.sku = yest.sku and yest.snapshot_date = today.snapshot_date - 1
where yest.sku is null
and today.snapshot_date = current_date;A cron-scheduled Python script reads from mv_new_launches_today and POSTs a daily digest to Slack. The whole stack is sub-100 lines of Python plus a small dbt project.
Sample output
Three rows from the agg_brand_discount_daily mart show how the pipeline aggregates raw snapshots into operational metrics:
[
{
"snapshot_date": "2026-05-12",
"brand": "AJIO Own",
"is_private_label": true,
"sku_count": 4123,
"avg_discount": 62.4,
"avg_price": 612,
"deep_discount_skus": 3401
},
{
"snapshot_date": "2026-05-12",
"brand": "Levi's",
"is_private_label": false,
"sku_count": 287,
"avg_discount": 38.1,
"avg_price": 2241,
"deep_discount_skus": 51
},
{
"snapshot_date": "2026-05-12",
"brand": "Biba",
"is_private_label": false,
"sku_count": 412,
"avg_discount": 44.7,
"avg_price": 1893,
"deep_discount_skus": 102
}
]Private-label discount averages run materially higher than third-party averages — the operational signal you couldn't extract without a structured pipeline.
Common pitfalls
Three things go wrong in production AJIO pipelines. Schema drift — AJIO occasionally renames or adds fields; pin your staging layer's column list explicitly rather than select *. Idempotent loads — re-running the loader without an upsert produces duplicate rows that corrupt time-series aggregates. Backfill gaps — there's no API for historical AJIO snapshots; you can only fill forward. Alert on missing snapshot_date partitions and treat gaps as data-quality incidents.
Thirdwatch's actor handles AJIO's anti-bot defences transparently — production-tested at daily watchlist volumes. The upstream layer of your pipeline becomes the most stable part; engineering effort goes into schema, modelling, and consumption rather than fighting anti-bot.
Related use cases
Frequently asked questions
What does a production AJIO research pipeline look like?
Four layers: an Apify schedule that fires the AJIO Scraper daily, a webhook that lands raw JSON in S3 or GCS, a small Python loader that upserts into a Postgres warehouse keyed on sku plus snapshot_date, and a dbt project that derives staging, marts, and analytics views on top. BI dashboards consume the marts; Slack alerts fire from materialised views.
What's the right schema for AJIO product snapshots?
A wide fact_ajio_products table keyed on (sku, snapshot_date) carrying price, list_price, original_price, discount_percent, brand, category path, planning_category, image_url, coupon_applicable, and a load timestamp. A dim_brand table with normalised brand identities and a private-label flag. The fact table grows by roughly the size of one daily snapshot per day.
How do I keep the warehouse cheap on cloud storage?
Partition the fact table by snapshot_date and apply retention — 90 days hot in Postgres, the full history in S3 Parquet. Most BI questions need only the last 30 to 90 days. For backfill or long-window analytics, query Parquet directly via DuckDB or Athena. This keeps Postgres under 5 GB even for a multi-brand watchlist.
How should I orchestrate the daily run?
Two viable options: Apify's built-in schedule plus webhooks (simplest, no extra infra), or Airflow / Prefect / Dagster calling the Apify API on a cron (better if you already have an orchestrator). Both push raw JSON to object storage; the loader DAG runs after the actor finishes. Apify webhooks fire reliably on run completion.
Can I plug AJIO data into an existing dbt + Snowflake stack?
Yes. Replace the Postgres loader with a Snowflake COPY INTO or external table over the S3 Parquet landing zone, then point your dbt project at it. Schema and SQL examples in this post translate directly — only the load mechanic changes. Snowflake's clustering on snapshot_date makes daily snapshot queries trivially cheap.
How do I version-control the pipeline?
Treat it like any data project. Git-track the Apify actor input JSON, the loader script, the dbt project, the warehouse DDL, and the BI dashboard JSON. Run dbt build in CI on every PR to catch schema breaks before they reach production. Apify input JSON is small enough to live alongside your loader code.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.