E-commerce & products

Build an India Beauty Trend Pipeline With Nykaa Scraper Data

Operational blueprint for an India beauty trend pipeline — Nykaa SKU data, ingredient tagging, growth scoring, and influencer collab shortlists for marketers.

May 12, 2026 · 6 min read · 1,442 words

See the scraper →

Thirdwatch's Nykaa Scraper is the structured data backbone for an India beauty trend pipeline — SKU-level brand, price, rating, rating_count, and category data refreshable on any cadence. This guide is the operational blueprint: pipeline shape, ingredient tagging, growth scoring, and how to convert the signal into an influencer collab shortlist.

▶ Skip the setup: Run this as a ready-to-go task on Apify → — pre-loaded with the exact configuration from this guide. No code required.

Why build a Nykaa-powered beauty trend pipeline

India's beauty and personal-care market is the fastest growing globally. Per RedSeer and IMARC, the BPC market crossed $20B in 2024 and is projected to reach $30B+ by 2027, with online taking share through 2028. Nykaa addresses the premium and aspirational mid-market — roughly half that TAM at the GMV level. The Nykaa FY24 annual report cites ~6,800 active brands across roughly 4.8K SKUs in just makeup. For trend forecasting, ingredient adoption tracking, and influencer-campaign targeting, no other Indian beauty surface offers comparable depth or freshness.

The job-to-be-done is operational, not exploratory. A category strategist at a CPG major needs a weekly trend digest, not a one-time research artifact. A growth marketer at an indie brand needs a daily shortlist of rising SKUs to reverse-engineer for messaging. An agency planning influencer collab campaigns needs a refreshable shortlist of products in active discovery mode. The pipeline shape matters more than any one snapshot — repeatability, ingredient tagging, growth scoring, and brand-set diff are the four components that turn raw rows into a marketing asset.

How does this compare to alternatives?

Three approaches for building the pipeline backbone:

Approach	Reliability	Setup time	Maintenance
In-house scraper (DIY Python + browser automation)	Medium; breaks on Nykaa redesigns	2-4 weeks engineering	Continuous — anti-bot moves regularly
Trend SaaS feed (Trendalytics, Edited, Spate)	High; pricey enterprise contracts	4-8 weeks	Vendor-managed
Thirdwatch Nykaa Scraper	Production-tested with production-grade anti-bot tooling	30 minutes	Thirdwatch tracks Nykaa changes

A DIY scraper is feasible but the maintenance burden compounds: Nykaa redesigns 2-4 times a year, each requiring debugging. Trend SaaS is reliable but priced for enterprise. The Nykaa Scraper actor page is the middle path — production-tested with transparent per-result pricing.

How to build a Nykaa trend pipeline in 5 steps

Step 1: How do I authenticate against Apify?

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I structure the daily ingestion job?

Run a fixed category sweep daily. Persist raw JSON for replay-ability.

import os, requests, json, datetime, pathlib

ACTOR = "thirdwatch~nykaa-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

CATEGORIES_LEAF = ["lipstick", "foundation", "eye-makeup", "nail",
                   "face-wash", "moisturizer", "serum",
                   "shampoo", "conditioner"]
CATEGORIES_TOP = ["makeup", "skin", "hair", "fragrance", "men"]

today = datetime.date.today().isoformat()
out_dir = pathlib.Path(f"snapshots/{today}")
out_dir.mkdir(parents=True, exist_ok=True)

def pull(cat, sort, n):
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
        params={"token": TOKEN},
        json={"queries": [], "category": cat, "sortBy": sort, "maxResults": n},
        timeout=900,
    )
    return resp.json()

for cat in CATEGORIES_LEAF:
    pop = pull(cat, "popularity", 100)
    new = pull(cat, "newest", 100)
    (out_dir / f"{cat}-pop.json").write_text(json.dumps(pop))
    (out_dir / f"{cat}-new.json").write_text(json.dumps(new))

for cat in CATEGORIES_TOP:
    new = pull(cat, "newest", 200)
    (out_dir / f"{cat}-top-new.json").write_text(json.dumps(new))

print(f"{today}: ingestion complete")

This produces a stable directory layout — snapshots/YYYY-MM-DD/{category}-{sort}.json — that any downstream consumer can reason about.

Step 3: How do I normalize and tag ingredients at ingest?

Brand normalization plus ingredient regex tagging is the highest-leverage step. Do it once, on ingest, not at query time.

import pandas as pd, re, glob

INGREDIENTS = {
    "niacinamide": r"niacinamid",
    "retinol": r"retinol|retinal|retinoid",
    "hyaluronic_acid": r"hyaluronic|hyaluron",
    "salicylic_acid": r"salicylic|bha",
    "glycolic_acid": r"glycolic",
    "mandelic_acid": r"mandelic",
    "peptides": r"peptide",
    "ceramides": r"ceramide",
    "vitamin_c": r"vitamin c|ascorbic|kakadu",
    "alpha_arbutin": r"alpha arbutin|arbutin",
    "kojic_acid": r"kojic",
    "tranexamic": r"tranexamic",
    "spf": r"\bspf\b|sunscreen",
    "vegan": r"\bvegan\b",
    "clean": r"\bclean\b|paraben.?free|sulphate.?free|sulfate.?free",
}

BRAND_NORM = {
    "mac": "MAC", "m.a.c.": "MAC", "mac cosmetics": "MAC",
    "sugar": "Sugar Cosmetics", "sugar cosmetics": "Sugar Cosmetics",
    "the ordinary": "The Ordinary", "ordinary": "The Ordinary",
    "loreal": "L'Oreal", "l'oreal": "L'Oreal", "loreal paris": "L'Oreal",
}

def tag_ingredients(text):
    text = (text or "").lower()
    return [k for k, pat in INGREDIENTS.items() if re.search(pat, text)]

def norm_brand(b):
    if not b: return None
    key = b.strip().lower()
    return BRAND_NORM.get(key, b.strip())

rows = []
for f in glob.glob("snapshots/*/*.json"):
    snapshot_date = f.split("/")[-2]
    sort = "newest" if "new" in f.split("/")[-1] else "popularity"
    for j in json.loads(pathlib.Path(f).read_text()):
        j["snapshot_date"] = snapshot_date
        j["sort_view"] = sort
        j["brand_norm"] = norm_brand(j.get("brand"))
        j["ingredient_tags"] = tag_ingredients(j.get("product_name", ""))
        rows.append(j)

df = pd.DataFrame(rows)
df["snapshot_date"] = pd.to_datetime(df["snapshot_date"])
df.to_parquet("nykaa_history.parquet")

DuckDB or SQLite over Parquet gives sub-second query times up to multi-million-row scale, with zero infra to operate.

Step 4: How do I compute SKU growth scores?

Rolling 4-week rating_count delta is the cleanest growth proxy. Group by SKU and ingredient tag.

import duckdb

con = duckdb.connect()
con.execute("CREATE VIEW nykaa AS SELECT * FROM 'nykaa_history.parquet'")

growth = con.execute("""
WITH weekly AS (
    SELECT
        sku,
        product_name,
        brand_norm,
        date_trunc('week', snapshot_date) AS week,
        MAX(rating_count) AS rc,
        AVG(rating) AS rating,
        AVG(price) AS price,
        ANY_VALUE(ingredient_tags) AS tags
    FROM nykaa
    WHERE sku IS NOT NULL AND rating_count IS NOT NULL
    GROUP BY 1, 2, 3, 4
),
deltas AS (
    SELECT
        sku, product_name, brand_norm, week, rc, rating, price, tags,
        rc - LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_delta,
        LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_ago
    FROM weekly
)
SELECT
    product_name, brand_norm, rc, rc_4w_delta,
    ROUND(100.0 * rc_4w_delta / NULLIF(rc_4w_ago, 0), 1) AS growth_pct,
    rating, price, tags
FROM deltas
WHERE week = (SELECT MAX(week) FROM deltas)
  AND rc_4w_delta IS NOT NULL
ORDER BY rc_4w_delta DESC
LIMIT 100
""").df()

print(growth.head(30))

The top 100 rows are your rising-SKU shortlist for the week.

Step 5: How do I convert the shortlist into an influencer collab brief?

Group rising SKUs by ingredient and brand-newness. A real trend has 5+ peer brands; a campaign push has one.

import collections

# Trend signal: ingredient gaining traction across multiple brands
tag_counter = collections.Counter()
for _, row in growth.iterrows():
    for t in (row.tags or []):
        tag_counter[t] += 1

print("Ingredient trend signal (count of rising SKUs):")
for tag, n in tag_counter.most_common(10):
    print(f"  {tag}: {n} rising SKUs")

# Brand new-entry signal: brands appearing for the first time
known_brands_path = pathlib.Path("known_brands.json")
known = set(json.loads(known_brands_path.read_text())) if known_brands_path.exists() else set()
current_brands = set(df.brand_norm.dropna().unique())
new_brands = current_brands - known

# Collab shortlist: rising SKUs from new brands tagged with trending ingredients
trending_tags = {t for t, n in tag_counter.most_common(5)}
shortlist = growth[
    growth.brand_norm.isin(new_brands)
    & growth.tags.apply(lambda ts: bool(set(ts or []) & trending_tags))
].head(20)
print(shortlist[["product_name", "brand_norm", "tags", "rc_4w_delta"]])

known_brands_path.write_text(json.dumps(sorted(known | current_brands)))

A rising SKU from a brand new to Nykaa, tagged with a trending ingredient, is the canonical "indie brand worth onboarding to a creator campaign" signal.

Sample output

The pipeline emits a weekly digest that looks like this. The underlying actor records remain the structured rows from the Nykaa Scraper with the schema documented in the market-research guide.

{
  "week": "2026-05-12",
  "trending_ingredients": [
    {"tag": "niacinamide", "rising_skus": 47},
    {"tag": "peptides", "rising_skus": 38},
    {"tag": "mandelic_acid", "rising_skus": 21}
  ],
  "new_brands_on_nykaa": ["Earth Rhythm", "Conscious Chemist", "Foxtale"],
  "collab_shortlist": [
    {
      "product_name": "Foxtale Mighty Peptide Cream",
      "brand": "Foxtale",
      "rc_4w_delta": 12450,
      "tags": ["peptides", "clean"],
      "url": "https://www.nykaa.com/foxtale-mighty-peptide-cream/p/..."
    }
  ]
}

This digest plugs directly into a creator-campaign brief: a shortlist of products in active discovery mode, the brands behind them, and the ingredient story that connects them.

Common pitfalls

Three things go wrong building trend pipelines. Sale-window pollution — rating_count jumps disproportionately during Pink Friday and similar events because reviewers churn through orders; exclude sale weeks from growth-delta computation or you'll over-rank legacy SKUs. Ingredient false-positives — regex tags pick up "vitamin C-free" as "vitamin_c"; use negative lookahead patterns or weight by full-text proximity. Brand-history bootstrapping — your "new brand" detector only works after 4+ weeks of history; treat the first month as warmup and don't publish brand-new alerts until then.

A fourth subtle issue: Nykaa Luxe SKUs (prestige brands like Charlotte Tilbury, La Mer, Dior) trade at MRP almost always; segment prestige from mass before computing discount-depth trends, otherwise the mass-market promotional cycle dominates the signal.

Thirdwatch's actor handles Nykaa's production-grade anti-bot tooling by intercepting the page's embedded JSON payload, with a DOM fallback when the JSON shape shifts. A daily 14-category sweep at 100-200 results each finishes in under ten minutes — cheap enough to run via cron without a dedicated worker. Pair this pipeline with our Myntra Scraper (Tira beauty crossover), AJIO Scraper (AJIO Luxe), and Amazon Scraper for full multi-channel India beauty trend coverage.

Related use cases

Frequently asked questions

What's the simplest pipeline architecture?

Daily cron triggers the actor with 10-15 category sweeps, results land in JSON, a Python script normalizes brands and tags ingredients via regex, results write to a Postgres or DuckDB table. A nightly view computes growth scores by SKU. A weekly script produces the trend digest. Total moving parts: cron, actor, Python, DB, view. No Kafka, no Airflow needed for under 100K rows.

Which ingredients should I tag for trend tracking?

The 2026 actives short list: niacinamide, retinol, hyaluronic acid, salicylic acid, glycolic acid, mandelic acid, peptides, ceramides, vitamin C, kojic acid, alpha arbutin, tranexamic acid. Add brand-specific marketing terms separately (clean, vegan, paraben-free, fragrance-free). Regex tag each SKU against the list at ingest time.

How do I generate an influencer collab shortlist from this data?

Rank SKUs by rolling rating_count growth over 4-8 weeks. The top 50 are products in active discovery mode — the right brief for influencer campaigns. Cross-reference with brand-set diff to find SKUs from brands that are also new to Nykaa, which signal a brand willing to spend on awareness. Match to creator categories (lipstick reviewers, K-beauty enthusiasts, clean-beauty advocates).

Do I need a data warehouse for this?

Not at indie or mid-market scale. A weekly 10-category × 100 SKU pull is ~5K rows. A year is ~250K rows. DuckDB on a single laptop handles this in milliseconds. SaaS warehouses (BigQuery, Snowflake) start mattering past ~10M rows or if multiple analysts query simultaneously.

How do I distinguish a real trend from a brand marketing push?

Real trends show up across multiple brands simultaneously over 4-8 weeks. A marketing push shows up as one brand spiking, no peers. Group your growth-scored SKU set by ingredient tag and look for ingredients where 5+ distinct brands are gaining concurrently. That's a category trend; a single-brand spike is a campaign.

Scrape Nykaa Products for Indian Beauty Market Research Data Track Nykaa Beauty Brand Pricing for Competitive Brand Recon Monitor Nykaa Bestsellers and New India Beauty Brand Drops

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.