Case study

SecOps-Platform

Internal Security Data Foundation

PostgreSQL-backed security operations platform that ingests AWS, EDR, AWS VPC topology, and Cloudflare DNS data into snapshot-versioned tables, reconciles AWS instances against Falcon sensors on every cycle, and serves as the data layer that the rest of the security stack — dashboards, agents, automated triage — reads from.

Problem

A small security team running multi-region AWS workloads can’t answer basic operational questions without a lot of manual work: “which EC2 instances don’t have EDR installed?”, “which production EC2 hosts have HIGH-severity Inspector findings AND no live CrowdStrike sensor?”, “which RDS databases missing encryption sit behind Cloudflare-fronted DNS records?” These aren’t difficult questions. They’re slow questions, because the data lives in five different consoles with five different APIs, five different export formats, and no native way to join across them.

But the deeper problem isn’t slowness. It’s that at this cost, most of these questions are hard to justify asking regularly. Cross-referencing CrowdStrike sensor coverage against AWS instance inventory, broken down by role / EKS cluster / snowflake classification. Auditing encryption posture across every S3 bucket in production. Walking attack paths through flattened security-group chains to identify which internet-exposed instances reach the most internal resources. None of these are difficult tasks per se; they’re tasks that require a data pipeline that doesn’t exist out of the box.

Existing tools occupy adjacent slots: AWS Security Hub aggregates AWS-native findings but doesn’t reconcile across vendors. Prowler and ScoutSuite produce point-in-time reports without persisting state. Wiz and Lacework do all of this commercially, but at price points that aren’t justified for a small team that wants its own queryable database the rest of its stack can read from.

SecOps-Platform was built to be that database.

System

Collection runs on APScheduler for recurring jobs and an RQ/Redis worker pool for async batch processing. Each cycle walks ten-plus AWS APIs (EC2, RDS, S3, Route53, GuardDuty, Inspector V2 including ECR container scanning, ELB/ALB/NLB, CloudFront), pulls hosts and detections from CrowdStrike, collects AWS VPC topology (security groups, NACLs, route tables, ENIs), and resolves Cloudflare DNS records. Everything writes into PostgreSQL with lossless JSONB storage.

The default auth path is unattended: a non-MFA, read-only collection role scoped to inventory APIs only, suitable for a scheduled service that should never sit on long-lived human credentials. Sessions that need elevated multi-region access fall back to Okta OIDC federation requesting temporary STS credentials in parallel across regions.

After ingestion, every cycle automatically re-runs reconciliation as Phase 3 — matching EC2 against Falcon sensors by instance ID, falling back to private IP, and emitting matched / unprotected / stale classifications. This isn’t a one-shot match because CrowdStrike retains records of long-dead hosts: a fleet running heavy autoscaling can have tens of thousands of Falcon host records but only a few hundred actually live at any point in time. Any query that uses CrowdStrike numbers directly — “what’s our endpoint coverage percentage?”, “which endpoints don’t have a sensor?”, “who’s running an out-of-date sensor version?” — returns nonsense without reconciliation, because the denominator is dominated by stale records. Phase 3 tags every Falcon record as live-in-current-snapshot or stale, and the downstream views read against the live set only. The same module runs S3 posture analysis (public access, encryption, versioning) and RDS posture analysis (encryption, public access, backup, multi-AZ).

Every collection run produces an immutable snapshot. Drift between snapshots is queryable — “which buckets had public access enabled in the last 30 days?” is one SQL query, not a manual diff. Snapshots can be backed up to S3 and restored, so the recovery story isn’t just “Postgres is up.”

Downstream consumers query the database four ways:

  • Direct SQL for ad-hoc analysis
  • SQL view layer (SECURITY_VIEWS/) for canonical questions: unprotected hosts, open ingress on production, active threats, S3 bucket exposure, sensor versions broken down by role / EKS cluster / snowflake classification
  • REST API (40-plus endpoints, JWT auth) for programmatic consumption and four dashboards: infrastructure, security, patching, and WARP
  • MCP PostgreSQL tools for Eleanor-AI agents and LLM-Daily-Summary finding triage containers

The database is the boundary. Collectors only write; consumers only read; nothing in the system reasons about security data without going through PostgreSQL first. That separation is what makes the whole stack composable.

Architecture

                    Okta OIDC federation
                    (3 regions, parallel)


   ┌────────────────────────────────────────────────────┐
   │            COLLECTION LAYER (10+ collectors)        │
   │                                                    │
   │  AWS APIs                  CrowdStrike API         │
   │  ─────────                 ───────────────         │
   │  EC2, RDS, S3, Route53,    OAuth2, batch API       │
   │  CloudFront, ELB/ALB/NLB,  hosts, detections,      │
   │  Inspector V2 (incl. ECR), alerts, incidents       │
   │  GuardDuty, SGs, NACLs,                            │
   │  route tables, ENIs +                              │
   │  Cloudflare DNS resolver                           │
   └────────────────────────┬───────────────────────────┘
                            │ raw JSONB writes

   ┌────────────────────────────────────────────────────┐
   │            POSTGRESQL (source of truth)             │
   │                                                    │
   │  31 tables · 66 Alembic migrations · GIN indexes    │
   │  Immutable snapshots, CASCADE-deletable             │
   └────────────────────────┬───────────────────────────┘


   ┌────────────────────────────────────────────────────┐
   │            ANALYSIS LAYER (post-collection)         │
   │                                                    │
   │  • AWS ↔ CrowdStrike reconciliation (every cycle)  │
   │      (instance ID match → private IP fallback)     │
   │  • S3 / RDS posture analysis                       │
   │  • AWS VPC topology flattening                     │
   │      (SGs + NACLs + routes → reachability graph)   │
   │  • BFS attack-path traversal (per snapshot)        │
   │  • Cloudflare → AWS DNS bridging                   │
   │  • RDS PII sanitizer (pre-storage)                 │
   │  • Security posture views (sensor versions by      │
   │      role / EKS / snowflake, S3 exposure, RDS,     │
   │      open ingress, EDR coverage, patching)         │
   └────────────────────────┬───────────────────────────┘


   ┌────────────────────────────────────────────────────┐
   │                  CONSUMERS                          │
   │                                                    │
   │  Dashboards     Eleanor-AI       LLM-Daily-Summary │
   │  REST API       MCP PostgreSQL   Finding triage    │
   │  SQL ad-hoc     tool access      → JIRA tickets    │
   └────────────────────────────────────────────────────┘

Key design decisions

Database-first, lossless storage

Every AWS and CrowdStrike API response goes into JSONB columns alongside the normalized fields the platform actually queries. The reason isn’t just future flexibility — it’s data integrity for the analysis layer. Reconciliation, posture checks, and custom views all read against raw vendor responses, not against a lossy normalization. If a vendor changes a field shape, adds a new attribute, or starts populating a previously-empty key, the platform can pick that up by adding a view; the underlying records were never thrown away.

That has two downstream effects worth making explicit:

  • Custom logic is built on source-of-truth data. The reconciliation matcher, the network-flattening pass, the S3/RDS posture checks — none of them are reasoning about a summary. They’re reasoning about exactly what the vendor returned. If a future audit question asks “what did CrowdStrike actually report for this host on this date?”, the answer is one row away.
  • Views can be added that nobody thought of at collection time. Several have been: sensor-version breakdowns by role / EKS / snowflake classification, attack-path queries, Inspector ECR coverage views — all written months after the underlying data was collected, querying historical snapshots that a strict-schema design would have lost.

GIN indexes on the JSONB columns keep ad-hoc queries fast enough for interactive dashboards. Storage cost is real but small at this scale and not the design driver.

AWS ↔ CrowdStrike reconciliation

The single most-used analysis pass in the platform, and the one that makes everything downstream of it trustworthy. The matcher uses instance ID as primary key and private IP as fallback (CrowdStrike doesn’t always populate AWS metadata cleanly), classifying every EC2/Falcon row as matched, unprotected, or stale.

The reason it’s load-bearing: CrowdStrike’s host inventory is append-mostly. Hosts that scale up, run for a few hours, and terminate leave their Falcon records behind. Over a year of autoscaling, a fleet with a few hundred live hosts at any given moment can accumulate tens of thousands of Falcon host records — most of them ghosts. A naive query like “how many of our Falcon hosts have an outdated sensor?” returns a number dominated by dead hosts that haven’t checked in for months. The number is technically correct and operationally useless.

Reconciliation fixes this by joining Falcon records against the current AWS instance snapshot: a Falcon host whose corresponding EC2 instance is gone gets tagged stale, regardless of what its sensor version says. Every downstream view (no_sensor.sql, sensor_versions_by_role.sql, the patching dashboard, Eleanor-AI’s coverage queries) reads against the live set, not the raw Falcon table. That’s what made:

  • The endpoint coverage metric on the security dashboard (the one that moved from 90.69% to 97.44%) actually mean what it says
  • Eleanor-AI’s coverage queries return numbers that match what an engineer sees in the AWS console
  • The certificate rotation readiness check possible at all (couldn’t have been answered against the raw Falcon inventory in any reasonable time, because most of the rows were ghosts)

The reconciliation module also runs separate posture-analysis passes alongside the AWS↔CrowdStrike join — different algorithm, same code location. S3 buckets are checked for public access, encryption, and versioning; RDS instances for encryption, public access, backup configuration, and multi-AZ. Those are per-resource checks against multiple criteria on a single inventory, not cross-inventory joins, so they don’t have a “live vs. stale” axis the way the EDR reconciliation does.

AWS VPC topology flattening

VPC network rules are nested: a security group can reference another security group, which references a prefix list, which contains CIDRs, plus NACLs at the subnet level, plus route tables, plus ENI-to-SG mappings. Asking “can this instance talk to that bucket?” requires walking that graph by hand. (To be precise: this is AWS VPC topology only — the platform does not collect on-prem, SDN, or third-party network appliance topology.)

The collector flattens all of it into a reachability table where every row is “resource A can reach resource B over port C via mechanism D.” Once that table exists, the analysis layer runs BFS up to a configurable hop count over a SQL reachability query (joining SG rules + route tables + optional NACL validation), scoped to a snapshot. Two analysis paths share the same primitives but answer different questions:

  • Internet-exposure paths — starting from any internet-reachable instance, what’s reachable inward in N hops?
  • East-west / lateral movement — starting from any compromised internal instance, what’s reachable inside the VPC?

Snapshot time-travel

Every collection run is an immutable snapshot, joined to all related rows via foreign key with ON DELETE CASCADE. Old snapshots can be dropped wholesale to manage storage; recent snapshots stay for drift detection and compliance evidence. “What was our S3 public-access posture on the day of the SOC 2 audit?” is a SQL query against the snapshot from that date, not a panicked attempt to reconstruct it.

Okta OIDC instead of long-lived AWS keys

Multi-region collection authenticates via Okta federation, requesting temporary STS credentials in parallel for all three accounts. No long-lived IAM access keys exist anywhere in the platform — not in env vars, not in .aws/credentials, not in secrets manager. If the platform is compromised, an attacker gets short-lived session credentials that have to be refreshed through Okta-protected flows.

This isn’t a unique pattern, but it’s worth noting because it removes a class of credential-leak incidents entirely.

Plugin architecture (in progress)

Adding a new data source under the original monolithic-collector pattern took 39–48 hours: schema design, migration, collector code, tests, integration. Most of that was rewriting the same collector skeleton with different API calls.

The plugin framework (src/plugins/ with core/, integrations/, resilience/, validation/, testing/ subtrees) separates collector logic from the platform skeleton. New plugins are ~40 lines of declarative configuration plus per-source transform functions, targeting 4–8 hours per integration. The framework is built and ready; migrating the existing monolithic collectors is paid down opportunistically, so most of them are still in their original form. New collectors land as plugins.

Impact

The platform is the data layer; most of its impact shows up in what other systems can do because it exists.

CapabilityWithout the platformWith the platform
EDR coverage cross-referenceManual SOQL between AWS console and CrowdStrike portalOne SQL query
Attack path analysisCase-by-case, manualRecursive CTE on flattened topology
S3 encryption audit (multiple regions, hundreds of buckets)Periodic manual sweepContinuous, snapshot-tracked
Q1 2026 Eleanor-AI workloadWould require this data layer145 queries against the database
LLM-Daily-Summary triageWould require this data layer2,259 findings auto-triaged since Aug 2025
Certificate rotation readiness1–2 weeks manual~2 minutes via Eleanor-AI

A representative number: the certificate rotation assessment across the EC2 fleet ran end-to-end in about two minutes, classifying every instance by sensor version and flagging the one host needing upgrade. The same question by hand is 1–2 weeks of cross-console work — expensive enough that running it on every CrowdStrike rotation would be hard to justify against everything else competing for the same hours. The platform’s job is to make that class of question cheap enough to be routine.

Tradeoffs and what I’d do differently

No RBAC. Currently every authenticated user has full read access to every endpoint. For a small team this is fine; if the platform gets shared with developers or auditors, that ceiling hits fast. RBAC is the first item on the to-do list.

The reconciliation logic is fragile to vendor schema changes. AWS occasionally changes the shape of API responses (e.g. a new field nested differently). The lossless JSONB protects the platform from breaking; it doesn’t protect the views from going stale. Migrations to new schema versions are manual, infrequent, and sometimes painful. If I were rebuilding, I’d add a schema-drift detection layer that flags JSONB shape changes between snapshots automatically.

The biggest single lesson: building the data layer first, before building the agents and automation that read from it, was the right call by a wide margin. Most security tooling tries to do the analysis and the data collection at the same time, and ends up with bespoke pipelines per use case. Putting the database in the middle as a real boundary made every downstream system smaller, more reliable, and faster to build. Eleanor-AI is ~30% of the size it would have been if it had to do its own data collection.


Architecture and aggregate impact only. Schemas, queries, and reconciliation rules are not part of this writeup.