Skip to main content
Thirdwatchthirdwatch
Compliance & registries

Build a SEC Filing Database for Investment Analysis (Python)

Create a structured SEC filing database with 10-K and 10-Q financials for investment analysis. Python pipeline with Thirdwatch SEC EDGAR Scraper and pandas.

May 26, 2026 · 6 min read · 1,287 words
See the scraper →

Thirdwatch's SEC EDGAR Scraper extracts structured financials from US public company filings -- revenue, net income, EPS, total assets, R&D expense, and multi-year history -- into JSON you can pipe directly into PostgreSQL, BigQuery, or a DataFrame. Pay per result, no terminal subscription. Built for developers building investment tools, founders creating fintech products, and analysts who want a private financial database without a Bloomberg dependency.

Why build your own SEC filing database

Every institutional investor and fintech product relies on SEC filing data. The difference between a $200K/year Refinitiv subscription and a self-built database is control over schema, refresh cadence, and cost. According to McKinsey's 2025 report on alternative data in asset management, alternative data spending by buy-side firms exceeded $7 billion in 2025 -- yet the single largest source of alpha-generating financial data remains free, public SEC filings that most firms access through expensive intermediaries.

The engineering challenge is extraction. According to the SEC's EDGAR filing statistics, EDGAR stores over 21 million filings as HTML and XBRL documents with inconsistent formatting across companies and time periods. Building a reliable parser that handles GAAP taxonomy variations, fiscal year misalignment, and amended filings is a 3-6 month engineering project. The Thirdwatch actor collapses that to an API call -- pass tickers, get structured JSON with revenue, net_income, total_assets, eps, operating_income, gross_profit, rd_expense, stockholders_equity, cash, and historical trend arrays.

For developers and founders, this unlocks three product patterns: a screening database (filter companies by financial metrics), a monitoring feed (alert on new filings or metric changes), and an analytics layer (compute derived ratios and sector benchmarks from raw financial data).

How does this compare to the alternatives?

Four paths to a structured SEC financial database:

Approach Time to first query Schema control Refresh automation Ongoing engineering
Build XBRL parser from scratch 3-6 months Full You build it High -- taxonomy changes annually
SEC EDGAR XBRL companion API 2-4 weeks Partial -- limited to API schema You build polling Medium -- API versioning
Commercial data feed (Refinitiv, Bloomberg) 1-2 weeks Vendor schema Vendor handles Low but vendor lock-in
Thirdwatch SEC EDGAR Scraper 1 day Full -- raw JSON, your schema Apify scheduling Low -- Thirdwatch maintains extraction

The SEC EDGAR Scraper sits between full DIY and full vendor lock-in: you own the data schema and refresh cadence but outsource the XBRL parsing and extraction logic.

How to build a SEC filing database in 6 steps

Step 1: How do I set up the project?

Install dependencies and configure your Apify token.

export APIFY_TOKEN="apify_api_xxxxxxxxxxxxxxxx"
pip install apify-client pandas sqlalchemy psycopg2-binary

Step 2: How do I define the company universe?

Start with a focused universe. S&P 500 tickers are freely available; a sector slice is easier to validate.

from apify_client import ApifyClient
import os

client = ApifyClient(os.environ["APIFY_TOKEN"])

# Start with a sector -- 15 large-cap tech companies
UNIVERSE = ["AAPL", "MSFT", "NVDA", "GOOGL", "META",
            "AMZN", "TSLA", "CRM", "ORCL", "ADBE",
            "INTC", "AMD", "AVGO", "QCOM", "NOW"]

Step 3: How do I extract annual filings with financials?

Pull the latest 10-K for each company with structured financial data enabled.

all_filings = []

# Batch in groups of 5 to keep run times manageable
for i in range(0, len(UNIVERSE), 5):
    batch = UNIVERSE[i:i+5]
    run = client.actor("thirdwatch/sec-edgar-scraper").call(run_input={
        "queries": batch,
        "filingType": "10-K",
        "includeFinancials": True,
        "maxResults": 3,  # Last 3 annual filings per company
    })
    items = client.dataset(run["defaultDatasetId"]).list_items().items
    all_filings.extend(items)
    print(f"Batch {i//5 + 1}: {len(items)} filings extracted")

print(f"Total: {len(all_filings)} annual filings")

Step 4: How do I normalize and load into a database?

Transform the raw JSON into a relational schema with computed financial ratios.

import pandas as pd
from sqlalchemy import create_engine

df = pd.DataFrame(all_filings)

# Compute derived metrics
df["gross_margin"] = df["gross_profit"] / df["revenue"]
df["operating_margin"] = df["operating_income"] / df["revenue"]
df["net_margin"] = df["net_income"] / df["revenue"]
df["rd_intensity"] = df["rd_expense"] / df["revenue"]
df["roe"] = df["net_income"] / df["stockholders_equity"]
df["asset_turnover"] = df["revenue"] / df["total_assets"]

# Dedup on (cik, filing_type, report_date) -- handles amended filings
df = df.sort_values("filing_date", ascending=False)
df = df.drop_duplicates(subset=["cik", "filing_type", "report_date"], keep="first")

engine = create_engine("postgresql://user:pass@localhost/filings")
df.to_sql("sec_filings", engine, if_exists="append", index=False)
print(f"Loaded {len(df)} unique filings into database")

Step 5: How do I add quarterly filings for granular tracking?

Layer in 10-Q filings to get quarterly resolution.

quarterly_filings = []

for i in range(0, len(UNIVERSE), 5):
    batch = UNIVERSE[i:i+5]
    run = client.actor("thirdwatch/sec-edgar-scraper").call(run_input={
        "queries": batch,
        "filingType": "10-Q",
        "includeFinancials": True,
        "dateFrom": "2024-01-01",
        "maxResults": 10,
    })
    items = client.dataset(run["defaultDatasetId"]).list_items().items
    quarterly_filings.extend(items)

q_df = pd.DataFrame(quarterly_filings)
q_df["gross_margin"] = q_df["gross_profit"] / q_df["revenue"]
q_df["operating_margin"] = q_df["operating_income"] / q_df["revenue"]
q_df = q_df.drop_duplicates(subset=["cik", "filing_type", "report_date"], keep="first")
q_df.to_sql("sec_filings", engine, if_exists="append", index=False)
print(f"Added {len(q_df)} quarterly filings")

Step 6: How do I schedule automatic updates?

Use Apify's scheduling to keep the database current. Set dateFrom to the last sync date.

import datetime

last_sync = datetime.date.today() - datetime.timedelta(days=7)

run = client.actor("thirdwatch/sec-edgar-scraper").call(run_input={
    "queries": UNIVERSE,
    "filingType": "all",
    "includeFinancials": True,
    "dateFrom": last_sync.isoformat(),
    "maxResults": 50,
})

new_items = client.dataset(run["defaultDatasetId"]).list_items().items
if new_items:
    new_df = pd.DataFrame(new_items)
    new_df = new_df.drop_duplicates(
        subset=["cik", "filing_type", "report_date"], keep="first")
    new_df.to_sql("sec_filings", engine, if_exists="append", index=False)
    print(f"Synced {len(new_df)} new filings")
else:
    print("No new filings since last sync")

Sample output

Two records from a mixed 10-K/10-Q extraction. Each record weighs approximately 2 KB.

[
  {
    "company_name": "Salesforce, Inc.",
    "cik": "1108524",
    "ticker": "CRM",
    "filing_type": "10-K",
    "filing_date": "2026-03-07",
    "report_date": "2026-01-31",
    "filing_url": "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=1108524",
    "revenue": 37900000000,
    "net_income": 5830000000,
    "total_assets": 99823000000,
    "eps": 6.01,
    "operating_income": 7130000000,
    "gross_profit": 28425000000,
    "rd_expense": 5685000000,
    "stockholders_equity": 59850000000,
    "cash": 12643000000,
    "revenue_history": [
      {"period": "FY2026", "value": 37900000000},
      {"period": "FY2025", "value": 34857000000},
      {"period": "FY2024", "value": 31352000000}
    ],
    "net_income_history": [
      {"period": "FY2026", "value": 5830000000},
      {"period": "FY2025", "value": 4136000000},
      {"period": "FY2024", "value": 208000000}
    ]
  },
  {
    "company_name": "Advanced Micro Devices, Inc.",
    "cik": "2488",
    "ticker": "AMD",
    "filing_type": "10-Q",
    "filing_date": "2026-05-01",
    "report_date": "2026-03-29",
    "filing_url": "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2488",
    "revenue": 7440000000,
    "net_income": 1120000000,
    "total_assets": 69421000000,
    "eps": 0.69,
    "operating_income": 1350000000,
    "gross_profit": 3720000000,
    "rd_expense": 1674000000,
    "stockholders_equity": 56830000000,
    "cash": 4543000000,
    "revenue_history": [
      {"period": "Q1 FY2026", "value": 7440000000},
      {"period": "Q4 FY2025", "value": 7658000000}
    ],
    "net_income_history": [
      {"period": "Q1 FY2026", "value": 1120000000},
      {"period": "Q4 FY2025", "value": 1340000000}
    ]
  }
]

cik is the permanent SEC identifier -- use it as the primary key for company resolution since tickers change. report_date is the fiscal period end, not the filing date. revenue_history gives you multi-year comparatives in a single record without needing to pull older filings separately.

Common pitfalls

Four issues that break SEC filing databases in production. Amended filings create duplicates -- a 10-K/A supersedes the original 10-K for the same period. Always dedup on (cik, filing_type, report_date) and keep only the most recent filing_date. Fiscal year boundaries vary -- comparing Apple's September year-end with Microsoft's June year-end as if they cover the same period produces misleading cross-sectional analysis. Normalize to calendar quarters using report_date. Null financials on older filings -- XBRL tagging became mandatory for large accelerated filers in 2009 and smaller filers later. Pre-XBRL filings return null for structured financial fields. Revenue recognition changes -- ASC 606 adoption (2018-2019) restated revenue figures for many companies, creating apparent jumps or drops in revenue_history that reflect accounting changes, not business performance.

For production databases, add a source_updated_at timestamp column and implement upsert logic rather than blind appends. Pair with our Google News Scraper to correlate filing events with news coverage, or use Indeed Scraper headcount data for fundamental analysis that combines financial and operational signals.

Related use cases

Frequently asked questions

How many companies can I track in a single filing database?

There is no hard limit. The S&P 500 is the most common starting scope -- 500 companies with quarterly 10-Q and annual 10-K filings means roughly 2,500 filing records per year. Russell 3000 coverage (the broad US equity universe) produces around 15,000 records per year. For a focused sector database (e.g., 50 biotech companies), expect 250 records per year. Batch queries in groups of 20-30 companies per actor run to keep individual run times under 10 minutes.

Can I automate this database to stay current with new filings?

Yes. Schedule the actor on Apify to run daily or weekly with dateFrom set to the last run date. New filings get appended to your database automatically. For 8-K material event monitoring, daily runs catch filings within 24 hours of acceptance. For quarterly financial tracking, weekly runs are sufficient since 10-Q filings have a 40-day window after quarter-end.

Related

Try it yourself

100 free credits, no credit card.

About 30 real searches. Add the MCP to Claude or Cursor in two minutes.