Build an India Beauty Trend Pipeline With Nykaa Scraper Data
Operational blueprint for an India beauty trend pipeline — Nykaa SKU data, ingredient tagging, growth scoring, and influencer collab shortlists for marketers.

Thirdwatch's Nykaa Scraper is the structured data backbone for an India beauty trend pipeline — SKU-level brand, price, rating, rating_count, and category data refreshable on any cadence. This guide is the operational blueprint: pipeline shape, ingredient tagging, growth scoring, and how to convert the signal into an influencer collab shortlist.
Why build a Nykaa-powered beauty trend pipeline
India's beauty and personal-care market is the fastest growing globally. Per RedSeer and IMARC, the BPC market crossed $20B in 2024 and is projected to reach $30B+ by 2027, with online taking share through 2028. Nykaa addresses the premium and aspirational mid-market — roughly half that TAM at the GMV level. The Nykaa FY24 annual report cites ~6,800 active brands across roughly 4.8K SKUs in just makeup. For trend forecasting, ingredient adoption tracking, and influencer-campaign targeting, no other Indian beauty surface offers comparable depth or freshness.
The job-to-be-done is operational, not exploratory. A category strategist at a CPG major needs a weekly trend digest, not a one-time research artifact. A growth marketer at an indie brand needs a daily shortlist of rising SKUs to reverse-engineer for messaging. An agency planning influencer collab campaigns needs a refreshable shortlist of products in active discovery mode. The pipeline shape matters more than any one snapshot — repeatability, ingredient tagging, growth scoring, and brand-set diff are the four components that turn raw rows into a marketing asset.
How does this compare to alternatives?
Three approaches for building the pipeline backbone:
| Approach | Reliability | Setup time | Maintenance |
|---|---|---|---|
| In-house scraper (DIY Python + browser automation) | Medium; breaks on Nykaa redesigns | 2-4 weeks engineering | Continuous — anti-bot moves regularly |
| Trend SaaS feed (Trendalytics, Edited, Spate) | High; pricey enterprise contracts | 4-8 weeks | Vendor-managed |
| Thirdwatch Nykaa Scraper | Production-tested with production-grade anti-bot tooling | 30 minutes | Thirdwatch tracks Nykaa changes |
A DIY scraper is feasible but the maintenance burden compounds: Nykaa redesigns 2-4 times a year, each requiring debugging. Trend SaaS is reliable but priced for enterprise. The Nykaa Scraper actor page is the middle path — production-tested with transparent per-result pricing.
How to build a Nykaa trend pipeline in 5 steps
Step 1: How do I authenticate against Apify?
export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"Step 2: How do I structure the daily ingestion job?
Run a fixed category sweep daily. Persist raw JSON for replay-ability.
import os, requests, json, datetime, pathlib
ACTOR = "thirdwatch~nykaa-scraper"
TOKEN = os.environ["APIFY_TOKEN"]
CATEGORIES_LEAF = ["lipstick", "foundation", "eye-makeup", "nail",
"face-wash", "moisturizer", "serum",
"shampoo", "conditioner"]
CATEGORIES_TOP = ["makeup", "skin", "hair", "fragrance", "men"]
today = datetime.date.today().isoformat()
out_dir = pathlib.Path(f"snapshots/{today}")
out_dir.mkdir(parents=True, exist_ok=True)
def pull(cat, sort, n):
resp = requests.post(
f"https://api.apify.com/v2/acts/{ACTOR}/run-sync-get-dataset-items",
params={"token": TOKEN},
json={"queries": [], "category": cat, "sortBy": sort, "maxResults": n},
timeout=900,
)
return resp.json()
for cat in CATEGORIES_LEAF:
pop = pull(cat, "popularity", 100)
new = pull(cat, "newest", 100)
(out_dir / f"{cat}-pop.json").write_text(json.dumps(pop))
(out_dir / f"{cat}-new.json").write_text(json.dumps(new))
for cat in CATEGORIES_TOP:
new = pull(cat, "newest", 200)
(out_dir / f"{cat}-top-new.json").write_text(json.dumps(new))
print(f"{today}: ingestion complete")This produces a stable directory layout — snapshots/YYYY-MM-DD/{category}-{sort}.json — that any downstream consumer can reason about.
Step 3: How do I normalize and tag ingredients at ingest?
Brand normalization plus ingredient regex tagging is the highest-leverage step. Do it once, on ingest, not at query time.
import pandas as pd, re, glob
INGREDIENTS = {
"niacinamide": r"niacinamid",
"retinol": r"retinol|retinal|retinoid",
"hyaluronic_acid": r"hyaluronic|hyaluron",
"salicylic_acid": r"salicylic|bha",
"glycolic_acid": r"glycolic",
"mandelic_acid": r"mandelic",
"peptides": r"peptide",
"ceramides": r"ceramide",
"vitamin_c": r"vitamin c|ascorbic|kakadu",
"alpha_arbutin": r"alpha arbutin|arbutin",
"kojic_acid": r"kojic",
"tranexamic": r"tranexamic",
"spf": r"\bspf\b|sunscreen",
"vegan": r"\bvegan\b",
"clean": r"\bclean\b|paraben.?free|sulphate.?free|sulfate.?free",
}
BRAND_NORM = {
"mac": "MAC", "m.a.c.": "MAC", "mac cosmetics": "MAC",
"sugar": "Sugar Cosmetics", "sugar cosmetics": "Sugar Cosmetics",
"the ordinary": "The Ordinary", "ordinary": "The Ordinary",
"loreal": "L'Oreal", "l'oreal": "L'Oreal", "loreal paris": "L'Oreal",
}
def tag_ingredients(text):
text = (text or "").lower()
return [k for k, pat in INGREDIENTS.items() if re.search(pat, text)]
def norm_brand(b):
if not b: return None
key = b.strip().lower()
return BRAND_NORM.get(key, b.strip())
rows = []
for f in glob.glob("snapshots/*/*.json"):
snapshot_date = f.split("/")[-2]
sort = "newest" if "new" in f.split("/")[-1] else "popularity"
for j in json.loads(pathlib.Path(f).read_text()):
j["snapshot_date"] = snapshot_date
j["sort_view"] = sort
j["brand_norm"] = norm_brand(j.get("brand"))
j["ingredient_tags"] = tag_ingredients(j.get("product_name", ""))
rows.append(j)
df = pd.DataFrame(rows)
df["snapshot_date"] = pd.to_datetime(df["snapshot_date"])
df.to_parquet("nykaa_history.parquet")DuckDB or SQLite over Parquet gives sub-second query times up to multi-million-row scale, with zero infra to operate.
Step 4: How do I compute SKU growth scores?
Rolling 4-week rating_count delta is the cleanest growth proxy. Group by SKU and ingredient tag.
import duckdb
con = duckdb.connect()
con.execute("CREATE VIEW nykaa AS SELECT * FROM 'nykaa_history.parquet'")
growth = con.execute("""
WITH weekly AS (
SELECT
sku,
product_name,
brand_norm,
date_trunc('week', snapshot_date) AS week,
MAX(rating_count) AS rc,
AVG(rating) AS rating,
AVG(price) AS price,
ANY_VALUE(ingredient_tags) AS tags
FROM nykaa
WHERE sku IS NOT NULL AND rating_count IS NOT NULL
GROUP BY 1, 2, 3, 4
),
deltas AS (
SELECT
sku, product_name, brand_norm, week, rc, rating, price, tags,
rc - LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_delta,
LAG(rc, 4) OVER (PARTITION BY sku ORDER BY week) AS rc_4w_ago
FROM weekly
)
SELECT
product_name, brand_norm, rc, rc_4w_delta,
ROUND(100.0 * rc_4w_delta / NULLIF(rc_4w_ago, 0), 1) AS growth_pct,
rating, price, tags
FROM deltas
WHERE week = (SELECT MAX(week) FROM deltas)
AND rc_4w_delta IS NOT NULL
ORDER BY rc_4w_delta DESC
LIMIT 100
""").df()
print(growth.head(30))The top 100 rows are your rising-SKU shortlist for the week.
Step 5: How do I convert the shortlist into an influencer collab brief?
Group rising SKUs by ingredient and brand-newness. A real trend has 5+ peer brands; a campaign push has one.
import collections
# Trend signal: ingredient gaining traction across multiple brands
tag_counter = collections.Counter()
for _, row in growth.iterrows():
for t in (row.tags or []):
tag_counter[t] += 1
print("Ingredient trend signal (count of rising SKUs):")
for tag, n in tag_counter.most_common(10):
print(f" {tag}: {n} rising SKUs")
# Brand new-entry signal: brands appearing for the first time
known_brands_path = pathlib.Path("known_brands.json")
known = set(json.loads(known_brands_path.read_text())) if known_brands_path.exists() else set()
current_brands = set(df.brand_norm.dropna().unique())
new_brands = current_brands - known
# Collab shortlist: rising SKUs from new brands tagged with trending ingredients
trending_tags = {t for t, n in tag_counter.most_common(5)}
shortlist = growth[
growth.brand_norm.isin(new_brands)
& growth.tags.apply(lambda ts: bool(set(ts or []) & trending_tags))
].head(20)
print(shortlist[["product_name", "brand_norm", "tags", "rc_4w_delta"]])
known_brands_path.write_text(json.dumps(sorted(known | current_brands)))A rising SKU from a brand new to Nykaa, tagged with a trending ingredient, is the canonical "indie brand worth onboarding to a creator campaign" signal.
Sample output
The pipeline emits a weekly digest that looks like this. The underlying actor records remain the structured rows from the Nykaa Scraper with the schema documented in the market-research guide.
{
"week": "2026-05-12",
"trending_ingredients": [
{"tag": "niacinamide", "rising_skus": 47},
{"tag": "peptides", "rising_skus": 38},
{"tag": "mandelic_acid", "rising_skus": 21}
],
"new_brands_on_nykaa": ["Earth Rhythm", "Conscious Chemist", "Foxtale"],
"collab_shortlist": [
{
"product_name": "Foxtale Mighty Peptide Cream",
"brand": "Foxtale",
"rc_4w_delta": 12450,
"tags": ["peptides", "clean"],
"url": "https://www.nykaa.com/foxtale-mighty-peptide-cream/p/..."
}
]
}This digest plugs directly into a creator-campaign brief: a shortlist of products in active discovery mode, the brands behind them, and the ingredient story that connects them.
Common pitfalls
Three things go wrong building trend pipelines. Sale-window pollution — rating_count jumps disproportionately during Pink Friday and similar events because reviewers churn through orders; exclude sale weeks from growth-delta computation or you'll over-rank legacy SKUs. Ingredient false-positives — regex tags pick up "vitamin C-free" as "vitamin_c"; use negative lookahead patterns or weight by full-text proximity. Brand-history bootstrapping — your "new brand" detector only works after 4+ weeks of history; treat the first month as warmup and don't publish brand-new alerts until then.
A fourth subtle issue: Nykaa Luxe SKUs (prestige brands like Charlotte Tilbury, La Mer, Dior) trade at MRP almost always; segment prestige from mass before computing discount-depth trends, otherwise the mass-market promotional cycle dominates the signal.
Thirdwatch's actor handles Nykaa's production-grade anti-bot tooling by intercepting the page's embedded JSON payload, with a DOM fallback when the JSON shape shifts. A daily 14-category sweep at 100-200 results each finishes in under ten minutes — cheap enough to run via cron without a dedicated worker. Pair this pipeline with our Myntra Scraper (Tira beauty crossover), AJIO Scraper (AJIO Luxe), and Amazon Scraper for full multi-channel India beauty trend coverage.
Related use cases
Frequently asked questions
What's the simplest pipeline architecture?
Daily cron triggers the actor with 10-15 category sweeps, results land in JSON, a Python script normalizes brands and tags ingredients via regex, results write to a Postgres or DuckDB table. A nightly view computes growth scores by SKU. A weekly script produces the trend digest. Total moving parts: cron, actor, Python, DB, view. No Kafka, no Airflow needed for under 100K rows.
Which ingredients should I tag for trend tracking?
The 2026 actives short list: niacinamide, retinol, hyaluronic acid, salicylic acid, glycolic acid, mandelic acid, peptides, ceramides, vitamin C, kojic acid, alpha arbutin, tranexamic acid. Add brand-specific marketing terms separately (clean, vegan, paraben-free, fragrance-free). Regex tag each SKU against the list at ingest time.
How do I generate an influencer collab shortlist from this data?
Rank SKUs by rolling rating_count growth over 4-8 weeks. The top 50 are products in active discovery mode — the right brief for influencer campaigns. Cross-reference with brand-set diff to find SKUs from brands that are also new to Nykaa, which signal a brand willing to spend on awareness. Match to creator categories (lipstick reviewers, K-beauty enthusiasts, clean-beauty advocates).
Do I need a data warehouse for this?
Not at indie or mid-market scale. A weekly 10-category × 100 SKU pull is ~5K rows. A year is ~250K rows. DuckDB on a single laptop handles this in milliseconds. SaaS warehouses (BigQuery, Snowflake) start mattering past ~10M rows or if multiple analysts query simultaneously.
How do I distinguish a real trend from a brand marketing push?
Real trends show up across multiple brands simultaneously over 4-8 weeks. A marketing push shows up as one brand spiking, no peers. Group your growth-scored SKU set by ingredient tag and look for ingredients where 5+ distinct brands are gaining concurrently. That's a category trend; a single-brand spike is a campaign.
Related
100 free credits, no credit card.
About 30 real searches. Add the MCP to Claude or Cursor in two minutes.