Skip to main content
Thirdwatchthirdwatch
E-commerce & products

Build an India Beauty Trend Pipeline With Nykaa Scraper Data

Operational blueprint for an India beauty trend pipeline — Nykaa SKU data, ingredient tagging, growth scoring, and influencer collab shortlists for marketers.

May 12, 2026 · 6 min read · 1,414 words
See the scraper →

Thirdwatch's Nykaa Scraper is the structured data backbone for an India beauty trend pipeline — SKU-level brand, price, rating, rating_count, and category data refreshable on any cadence. This guide is the operational blueprint: pipeline shape, ingredient tagging, growth scoring, and how to convert the signal into an influencer collab shortlist.

Why build a Nykaa-powered beauty trend pipeline

India's beauty and personal-care market is the fastest growing globally. Per RedSeer and IMARC, the BPC market crossed $20B in 2024 and is projected to reach $30B+ by 2027, with online taking share through 2028. Nykaa addresses the premium and aspirational mid-market — roughly half that TAM at the GMV level. The Nykaa FY24 annual report cites ~6,800 active brands across roughly 4.8K SKUs in just makeup. For trend forecasting, ingredient adoption tracking, and influencer-campaign targeting, no other Indian beauty surface offers comparable depth or freshness.

The job-to-be-done is operational, not exploratory. A category strategist at a CPG major needs a weekly trend digest, not a one-time research artifact. A growth marketer at an indie brand needs a daily shortlist of rising SKUs to reverse-engineer for messaging. An agency planning influencer collab campaigns needs a refreshable shortlist of products in active discovery mode. The pipeline shape matters more than any one snapshot — repeatability, ingredient tagging, growth scoring, and brand-set diff are the four components that turn raw rows into a marketing asset.

How does this compare to alternatives?

Three approaches for building the pipeline backbone:

Approach Reliability Setup time Maintenance
In-house scraper (DIY Python + browser automation) Medium; breaks on Nykaa redesigns 2-4 weeks engineering Continuous — anti-bot moves regularly
Trend SaaS feed (Trendalytics, Edited, Spate) High; pricey enterprise contracts 4-8 weeks Vendor-managed
Thirdwatch Nykaa Scraper Production-tested with production-grade anti-bot tooling 30 minutes Thirdwatch tracks Nykaa changes

A DIY scraper is feasible but the maintenance burden compounds: Nykaa redesigns 2-4 times a year, each requiring debugging. Trend SaaS is reliable but priced for enterprise. The Nykaa Scraper actor page is the middle path — production-tested with transparent per-result pricing.

How to build a Nykaa trend pipeline in 5 steps

Step 1: How do I authenticate against Apify?

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"

Step 2: How do I structure the daily ingestion job?

Run a fixed category sweep daily. Persist raw JSON for replay-ability.

import os, requests, json, datetime, pathlib

ACTOR = "thirdwatch~nykaa-scraper"
TOKEN = os.environ["APIFY_TOKEN"]

CATEGORIES_LEAF = ["lipstick", "foundation", "eye-makeup", "nail",
                   "face-wash", "moisturizer", "serum",
                   "shampoo", "conditioner"]
CATEGORIES_TOP = ["makeup", "skin", "hair", "fragrance", "men"]

today = datetime.date.today().isoformat()
out_dir = pathlib.Path(f"snapshots/{today}")
out_dir.mkdir(parents=True, exist_ok=True)

def pull(cat, sort, n):
    resp = requests.post(
        f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
        params={"token": TOKEN},
        json={"queries": [], "category": cat, "sortBy": sort, "maxResults": n},
        timeout=900,
    )
    return resp.json()

for cat in CATEGORIES_LEAF:
    pop = pull(cat, "popularity", 100)
    new = pull(cat, "newest", 100)
    (out_dir / f"{cat}-pop.json").write_text(json.dumps(pop))
    (out_dir / f"{cat}-new.json").write_text(json.dumps(new))

for cat in CATEGORIES_TOP:
    new = pull(cat, "newest", 200)
    (out_dir / f"{cat}-top-new.json").write_text(json.dumps(new))

print(f"{today}: ingestion complete")

This produces a stable directory layout — snapshots/YYYY-MM-DD/{category}-{sort}.json — that any downstream consumer can reason about.

Step 3: How do I normalize and tag ingredients at ingest?

Brand normalization plus ingredient regex tagging is the highest-leverage step. Do it once, on ingest, not at query time.

import pandas as pd, re, glob

INGREDIENTS = {
    "niacinamide": r"niacinamid",
    "retinol": r"retinol|retinal|retinoid",
    "hyaluronic_acid": r"hyaluronic|hyaluron",
    "salicylic_acid": r"salicylic|bha",
    "glycolic_acid": r"glycolic",
    "mandelic_acid": r"mandelic",
    "peptides": r"peptide",
    "ceramides": r"ceramide",
    "vitamin_c": r"vitamin c|ascorbic|kakadu",
    "alpha_arbutin": r"alpha arbutin|arbutin",
    "kojic_acid": r"kojic",
    "tranexamic": r"tranexamic",
    "spf": r"\bspf\b|sunscreen",
    "vegan": r"\bvegan\b",
    "clean": r"\bclean\b|paraben.?free|sulphate.?free|sulfate.?free",
}

BRAND_NORM = {
    "mac": "MAC", "m.a.c.": "MAC", "mac cosmetics": "MAC",
    "sugar": "Sugar Cosmetics", "sugar cosmetics": "Sugar Cosmetics",
    "the ordinary": "The Ordinary", "ordinary": "The Ordinary",
    "loreal": "L'Oreal", "l'oreal": "L'Oreal", "loreal paris": "L'Oreal",
}

def tag_ingredients(text):
    text = (text or "").lower()
    return [k for k, pat in INGREDIENTS.items() if re.search(pat, text)]

def norm_brand(b):
    if not b: return None
    key = b.strip().lower()
    return BRAND_NORM.get(key, b.strip())

rows = []
for f in glob.glob("snapshots/*/*.json"):
    snapshot_date = f.split("/")[-2]
    sort = "newest" if "new" in f.split("/")[-1] else "popularity"
    for j in json.loads(pathlib.Path(f).read_text()):
        j["snapshot_date"] = snapshot_date
        j["sort_view"] = sort
        j["brand_norm"] = norm_brand(j.get("brand"))
        j["ingredient_tags"] = tag_ingredients(j.get("product_name", ""))
        rows.append(j)

df = pd.DataFrame(rows)
df["snapshot_date"] = pd.to_datetime(df["snapshot_date"])
df.to_parquet("nykaa_history.parquet")

DuckDB or SQLite over Parquet gives sub-second query times up to multi-million-row scale, with zero infra to operate.

Step 4: How do I compute SKU growth scores?

Rolling 4-week rating_count delta is the cleanest growth proxy. Group by SKU and ingredient tag.

import duckdb

con = duckdb.connect()
con.execute("CREATE VIEW nykaa AS SELECT * FROM 'nykaa_history.parquet'")

growth = con.execute("""
WITH weekly AS (
    SELECT
        sku,
        product_name,
        brand_norm,
        date_trunc('week', snapshot_date) AS week,
        MAX(rating_count) AS rc,
        AVG(rating) AS rating,
        AVG(price) AS price,
        ANY_VALUE(ingredient_tags) AS tags
    FROM nykaa
    WHERE sku IS NOT NULL AND rating_count IS NOT NULL
    GROUP BY 1, 2, 3, 4
),
deltas AS (
    SELECT
        sku, product_name, brand_norm, week, rc, rating, price, tags,
        rc - LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_delta,
        LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_ago
    FROM weekly
)
SELECT
    product_name, brand_norm, rc, rc_4w_delta,
    ROUND(100.0 * rc_4w_delta / NULLIF(rc_4w_ago, 0), 1) AS growth_pct,
    rating, price, tags
FROM deltas
WHERE week = (SELECT MAX(week) FROM deltas)
  AND rc_4w_delta IS NOT NULL
ORDER BY rc_4w_delta DESC
LIMIT 100
""").df()

print(growth.head(30))

The top 100 rows are your rising-SKU shortlist for the week.

Step 5: How do I convert the shortlist into an influencer collab brief?

Group rising SKUs by ingredient and brand-newness. A real trend has 5+ peer brands; a campaign push has one.

import collections

# Trend signal: ingredient gaining traction across multiple brands
tag_counter = collections.Counter()
for _, row in growth.iterrows():
    for t in (row.tags or []):
        tag_counter[t] += 1

print("Ingredient trend signal (count of rising SKUs):")
for tag, n in tag_counter.most_common(10):
    print(f"  {tag}: {n} rising SKUs")

# Brand new-entry signal: brands appearing for the first time
known_brands_path = pathlib.Path("known_brands.json")
known = set(json.loads(known_brands_path.read_text())) if known_brands_path.exists() else set()
current_brands = set(df.brand_norm.dropna().unique())
new_brands = current_brands - known

# Collab shortlist: rising SKUs from new brands tagged with trending ingredients
trending_tags = {t for t, n in tag_counter.most_common(5)}
shortlist = growth[
    growth.brand_norm.isin(new_brands)
    & growth.tags.apply(lambda ts: bool(set(ts or []) & trending_tags))
].head(20)
print(shortlist[["product_name", "brand_norm", "tags", "rc_4w_delta"]])

known_brands_path.write_text(json.dumps(sorted(known | current_brands)))

A rising SKU from a brand new to Nykaa, tagged with a trending ingredient, is the canonical "indie brand worth onboarding to a creator campaign" signal.

Sample output

The pipeline emits a weekly digest that looks like this. The underlying actor records remain the structured rows from the Nykaa Scraper with the schema documented in the market-research guide.

{
  "week": "2026-05-12",
  "trending_ingredients": [
    {"tag": "niacinamide", "rising_skus": 47},
    {"tag": "peptides", "rising_skus": 38},
    {"tag": "mandelic_acid", "rising_skus": 21}
  ],
  "new_brands_on_nykaa": ["Earth Rhythm", "Conscious Chemist", "Foxtale"],
  "collab_shortlist": [
    {
      "product_name": "Foxtale Mighty Peptide Cream",
      "brand": "Foxtale",
      "rc_4w_delta": 12450,
      "tags": ["peptides", "clean"],
      "url": "https://www.nykaa.com/foxtale-mighty-peptide-cream/p/..."
    }
  ]
}

This digest plugs directly into a creator-campaign brief: a shortlist of products in active discovery mode, the brands behind them, and the ingredient story that connects them.

Common pitfalls

Three things go wrong building trend pipelines. Sale-window pollutionrating_count jumps disproportionately during Pink Friday and similar events because reviewers churn through orders; exclude sale weeks from growth-delta computation or you'll over-rank legacy SKUs. Ingredient false-positives — regex tags pick up "vitamin C-free" as "vitamin_c"; use negative lookahead patterns or weight by full-text proximity. Brand-history bootstrapping — your "new brand" detector only works after 4+ weeks of history; treat the first month as warmup and don't publish brand-new alerts until then.

A fourth subtle issue: Nykaa Luxe SKUs (prestige brands like Charlotte Tilbury, La Mer, Dior) trade at MRP almost always; segment prestige from mass before computing discount-depth trends, otherwise the mass-market promotional cycle dominates the signal.

Thirdwatch's actor handles Nykaa's production-grade anti-bot tooling by intercepting the page's embedded JSON payload, with a DOM fallback when the JSON shape shifts. A daily 14-category sweep at 100-200 results each finishes in under ten minutes — cheap enough to run via cron without a dedicated worker. Pair this pipeline with our Myntra Scraper (Tira beauty crossover), AJIO Scraper (AJIO Luxe), and Amazon Scraper for full multi-channel India beauty trend coverage.

Related use cases

Frequently asked questions

What's the simplest pipeline architecture?

Daily cron triggers the actor with 10-15 category sweeps, results land in JSON, a Python script normalizes brands and tags ingredients via regex, results write to a Postgres or DuckDB table. A nightly view computes growth scores by SKU. A weekly script produces the trend digest. Total moving parts: cron, actor, Python, DB, view. No Kafka, no Airflow needed for under 100K rows.

Which ingredients should I tag for trend tracking?

The 2026 actives short list: niacinamide, retinol, hyaluronic acid, salicylic acid, glycolic acid, mandelic acid, peptides, ceramides, vitamin C, kojic acid, alpha arbutin, tranexamic acid. Add brand-specific marketing terms separately (clean, vegan, paraben-free, fragrance-free). Regex tag each SKU against the list at ingest time.

How do I generate an influencer collab shortlist from this data?

Rank SKUs by rolling rating_count growth over 4-8 weeks. The top 50 are products in active discovery mode — the right brief for influencer campaigns. Cross-reference with brand-set diff to find SKUs from brands that are also new to Nykaa, which signal a brand willing to spend on awareness. Match to creator categories (lipstick reviewers, K-beauty enthusiasts, clean-beauty advocates).

Do I need a data warehouse for this?

Not at indie or mid-market scale. A weekly 10-category × 100 SKU pull is ~5K rows. A year is ~250K rows. DuckDB on a single laptop handles this in milliseconds. SaaS warehouses (BigQuery, Snowflake) start mattering past ~10M rows or if multiple analysts query simultaneously.

How do I distinguish a real trend from a brand marketing push?

Real trends show up across multiple brands simultaneously over 4-8 weeks. A marketing push shows up as one brand spiking, no peers. Group your growth-scored SKU set by ingredient tag and look for ingredients where 5+ distinct brands are gaining concurrently. That's a category trend; a single-brand spike is a campaign.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.