← Back to writing
September 1, 2023 4 min read Updated Apr 05, 2026

Scaling the Ingest: Architectural Lessons from TraxinteL

Ingestion is a state machine, not a scraper. Lessons learned from building high-scale distributed collection pipelines.

Written by
Professional headshot of Ben Moataz
Ben Moataz

Systems Architect, Consultant, and Product Builder

Independent systems architect helping teams turn intelligence, evidence, and automation workflows into reliable products and clearer operating decisions.

Why I'm qualified to write this

This article is grounded in hands-on work across Collection and orchestration, Correlation and scoring, and Evidence and forensics, including systems such as TraxinteL, SOVRINT, and WingAgent.

I write from hands-on work across product systems, evidence pipelines, ranking layers, monitoring surfaces, and automation runtimes that have to stay reliable under operational pressure.

  • Years spent building product systems, automation infrastructure, and operator-facing platforms.
  • Project records and case studies tied directly to the same capability lanes discussed in the writing.
  • A public archive designed to connect essays back to real systems, delivery constraints, and consulting work.

When we set out to build the core architecture for TraxinteL, we were faced with a daunting requirement: ingest millions of disparate data points from thousands of varied sources, enrich them in near-real-time, and preserve every byte of evidence with cryptographic integrity.

In the early prototypes, we used what I call the “Naïve Scraper” approach. You write a script, it fetches a page, parses it, and writes the result to a database. This works for a hundred targets. It collapses for a million.

The fundamental lesson we learned at TraxinteL is that ingestion is not a scraper problem—it is a state-machine problem.

In this essay, I’ll break down the architectural shifts we made to move from brittle scripts to a high-scale, event-driven intelligence core.


1. From Monoliths to Event-Driven Fleets

The first major mistake in scaling ingestion is the “Large Job” pattern. You create a single process that handles the entire lifecycle of a capture: authentication, navigation, extraction, and storage. If any part of this process fails (a timeout, a network hiccup, a change in the DOM), the entire job dies and you lose the state.

At TraxinteL, we decomposed this monolith into an Event-Driven Fleet.

Instead of a “Scraper,” we built discrete Task Workers. Each worker is a stateless, short-lived microservice responsible for exactly one transition in the intelligence pipeline. For example:

  • Task A: Fetch the HTML and upload the raw asset to S3.
  • Task B: Extract entities from the cached S3 asset.
  • Task C: Correlate entities against the global identity graph.

By decoupling these steps, we gained immense reliability. If the extraction logic in Task B failed, we didn’t need to re-fetch the data in Task A (saving proxy costs and stealth). We simply retried Task B with updated code.


2. Ingestion as a State Machine

The true breakthrough came when we stopped viewing ingestion as a “flow” and started viewing it as a State Machine orchestrated by AWS Step Functions.

Every intelligence capture is a state machine. It has an initial state (Pending), several transitional states (Fetched, Parsed, Enriched), and a terminal state (Completed or Failed).

Using Step Functions allowed us to manage these transitions with “System Memory.” If a worker fleet was temporarily overwhelmed or a proxy provider went down, the state machine would simply “pause” and retry with exponential backoff. We no longer had “lost data” because the state of every single ingestion task was persisted in a resilient orchestration layer.

This moved our engineering focus from “writing better scrapers” to “defining better transitions.”


3. The Idempotency Problem

At scale, you will inevitably run the same task twice. A worker might crash after completing its work but before reporting success. A message queue might deliver a duplicate event.

If your system is not idempotent, you will end up with duplicate entities, corrupted logs, and massive storage overhead.

We solved this at the foundation. Every ingestion task has a unique Correlation-ID derived from the source URI and the timestamp. Before any worker writes to the database, it performs a “Conditional Put.” If the ID already exists and the state is “Completed,” the worker simply stops.

This idempotency allowed us to blast the system with retries without fear of data corruption. Reliability is impossible without idempotency.


4. Evidence Preservation: The “Raw First” Rule

In most data pipelines, the goal is to transform data as quickly as possible. You fetch a page, extract the JSON, and discard the HTML.

In an Intelligence Core, this is a fatal error. The raw asset is the evidence.

At TraxinteL, we implemented the “Raw First” rule: no processing happens until the raw, unmodified response from the target is stored in an immutable S3 bucket.

  • If we find a bug in our parser six months from now, we can re-run the extraction against the raw evidence.
  • If a client questions the validity of a piece of intelligence, we can present the raw original asset.
  • If an adversary deletes the source post, we still possess the “Scanned Archive.”

Storage is cheap. Trust is expensive. Save everything.


5. Summary: Operational Survivability

Scaling ingestion isn’t about how fast you can scrape; it’s about how gracefully you can fail.

The TraxinteL architecture succeeded because it assumed the web was broken, proxies were unreliable, and code was buggy. We didn’t build a “Perfect Scraper”; we built a Resilient Orchestrator.

By moving from scripts to state machines, focusing on idempotency, and treating raw data as immutable evidence, we built a system that could handle the chaos of the internet at scale.

This is the shift that separates the hobbyist from the operator. Stop building scrapers. Start building state machines.

Relevant Work

Expertise areas and case studies tied to the same article.

Related Reading

More writing on adjacent systems problems.

Next Article

What I Mean When I Say “Shipping Systems”

Shipping systems means shipping behavior under load, over time. A philosophical anchor for the operator-grade engineer.