Scaling the Ingest: Architectural Lessons from TraxinteL

Ingestion is a state machine, not a scraper. Lessons learned from building high-scale distributed collection pipelines.

Ben Moataz · September 1, 2023 · 4 min read · Updated Apr 05, 2026

When we set out to build the core architecture for TraxinteL, we were faced with a daunting requirement: ingest millions of disparate data points from thousands of varied sources, enrich them in near-real-time, and preserve every byte of evidence with cryptographic integrity.

In the early prototypes, we used what I call the “Naïve Scraper” approach. You write a script, it fetches a page, parses it, and writes the result to a database. This works for a hundred targets. It collapses for a million.

The fundamental lesson we learned at TraxinteL is that ingestion is not a scraper problem—it is a state-machine problem.

In this essay, I’ll break down the architectural shifts we made to move from brittle scripts to a high-scale, event-driven intelligence core.

1. From Monoliths to Event-Driven Fleets

The first major mistake in scaling ingestion is the “Large Job” pattern. You create a single process that handles the entire lifecycle of a capture: authentication, navigation, extraction, and storage. If any part of this process fails (a timeout, a network hiccup, a change in the DOM), the entire job dies and you lose the state.

At TraxinteL, we decomposed this monolith into an Event-Driven Fleet.

Instead of a “Scraper,” we built discrete Task Workers. Each worker is a stateless, short-lived microservice responsible for exactly one transition in the intelligence pipeline. For example:

Task A: Fetch the HTML and upload the raw asset to S3.
Task B: Extract entities from the cached S3 asset.
Task C: Correlate entities against the global identity graph.

By decoupling these steps, we gained immense reliability. If the extraction logic in Task B failed, we didn’t need to re-fetch the data in Task A (saving proxy costs and stealth). We simply retried Task B with updated code.

2. Ingestion as a State Machine

The true breakthrough came when we stopped viewing ingestion as a “flow” and started viewing it as a State Machine orchestrated by AWS Step Functions.

Every intelligence capture is a state machine. It has an initial state (Pending), several transitional states (Fetched, Parsed, Enriched), and a terminal state (Completed or Failed).

Using Step Functions allowed us to manage these transitions with “System Memory.” If a worker fleet was temporarily overwhelmed or a proxy provider went down, the state machine would simply “pause” and retry with exponential backoff. We no longer had “lost data” because the state of every single ingestion task was persisted in a resilient orchestration layer.

This moved our engineering focus from “writing better scrapers” to “defining better transitions.”

3. The Idempotency Problem

At scale, you will inevitably run the same task twice. A worker might crash after completing its work but before reporting success. A message queue might deliver a duplicate event.

If your system is not idempotent, you will end up with duplicate entities, corrupted logs, and massive storage overhead.

We solved this at the foundation. Every ingestion task has a unique Correlation-ID derived from the source URI and the timestamp. Before any worker writes to the database, it performs a “Conditional Put.” If the ID already exists and the state is “Completed,” the worker simply stops.

This idempotency allowed us to blast the system with retries without fear of data corruption. Reliability is impossible without idempotency.

4. Evidence Preservation: The “Raw First” Rule

In most data pipelines, the goal is to transform data as quickly as possible. You fetch a page, extract the JSON, and discard the HTML.

In an Intelligence Core, this is a fatal error. The raw asset is the evidence.

At TraxinteL, we implemented the “Raw First” rule: no processing happens until the raw, unmodified response from the target is stored in an immutable S3 bucket.

If we find a bug in our parser six months from now, we can re-run the extraction against the raw evidence.
If a client questions the validity of a piece of intelligence, we can present the raw original asset.
If an adversary deletes the source post, we still possess the “Scanned Archive.”

Storage is cheap. Trust is expensive. Save everything.

5. Summary: Operational Survivability

Scaling ingestion isn’t about how fast you can scrape; it’s about how gracefully you can fail.

The TraxinteL architecture succeeded because it assumed the web was broken, proxies were unreliable, and code was buggy. We didn’t build a “Perfect Scraper”; we built a Resilient Orchestrator.

By moving from scripts to state machines, focusing on idempotency, and treating raw data as immutable evidence, we built a system that could handle the chaos of the internet at scale.

This is the shift that separates the hobbyist from the operator. Stop building scrapers. Start building state machines.

Written by

Ben Moataz

Systems Architect, Consultant, and Product Builder

This article is grounded in hands-on work across Collection and orchestration, Correlation and scoring, and Evidence and forensics, including systems such as TraxinteL, SOVRINT, and WingAgent.

I write from hands-on work across product systems, evidence pipelines, ranking layers, monitoring surfaces, and automation runtimes that have to stay reliable under operational pressure.

→ Years spent building product systems, automation infrastructure, and operator-facing platforms.
→ Project records and case studies tied directly to the same capability lanes discussed in the writing.
→ A public archive designed to connect essays back to real systems, delivery constraints, and consulting work.

About Ben Work with Ben →

Get new essays by email

Field notes on intelligence systems, evidence engineering, and automation that survives reality. No noise.

Subscribe via RSS → Email capture isn't wired up yet — the RSS feed is live now.

Scaling the Ingest: Architectural Lessons from TraxinteL

1. From Monoliths to Event-Driven Fleets

2. Ingestion as a State Machine

3. The Idempotency Problem

4. Evidence Preservation: The “Raw First” Rule

5. Summary: Operational Survivability

Expertise and case studies tied to this article.

Collection and orchestration

Correlation and scoring

Evidence and forensics

TraxinteL

SOVRINT

WingAgent

More writing on adjacent systems problems.

Automation That Survives Reality

Worker Fleets in Practice: Retries, Idempotency, and Failure Taxonomies

From Analyst-Heavy to System-Heavy: Scaling Without Burning Humans

What I Mean When I Say “Shipping Systems”

Get new essays by email