tickdistill-learn

How Do You Process Years of Tick Data Cheaply? Single-Pass ETL with Discard

By TickDistill — order-flow microstructure signals. Educational content, not financial advice.

The short answer

Processing years of tick data cheaply comes down to one principle: read each day of raw data exactly once, run every signal processor over that single read, write the derived outputs, and immediately discard the raw. Storage cost is real, but it is not the only reason — because our tick sources are free, even if storage were free, re-deriving from scratch later costs time, not money, so there is no reason to keep data you have already metabolized.

What is single-pass streaming ETL, and why does it matter?

Single-pass ETL is a data-engineering pattern where each unit of raw input is read precisely once, transformed into all required outputs in that one pass, and then released. The alternative — one read per output — multiplies IO proportionally to the number of downstream consumers and keeps raw data resident indefinitely.

For high-frequency market data the practical difference is large. A day of aggregated-trade records for a single perpetual contract fits comfortably in memory as a Polars DataFrame. A full history across years does not. Single-pass processing keeps the working set bounded: load one day, emit all signals for that day, release the memory, move to the next.

What does “ingest → metabolize → discard” mean in practice?

The pipeline has three logically separate stages:

Ingestion. Download one day of raw tick data, validate it through quality gates (see How We Validate Market Data Before It Becomes a Signal), and normalize it into a canonical internal event schema — TradeEvent fields for trade data, BookEvent fields for order-book data. Processors never see the raw exchange format.
Metabolization. Every registered signal processor for that day’s market reads the same normalized event stream in a single pass. Each processor emits its derived outputs — z-scored metrics, flags, bucketed levels — to its own columnar Parquet partition. Crucially, no processor triggers a second read of the raw file.
Discard. Once all processors have checkpointed their outputs and the QA gate confirms the derived partitions are intact, the raw file for that day is deleted. It is not archived; it is removed.

The three stages are kept strictly separate: ingestion does not overlap with metabolization, and discard only happens after the derived outputs are verified.

Why is discarding the raw data the right choice?

The decision to discard is driven by two reinforcing reasons: source economics and the storage moat. Storage cost is a genuine driver — years of raw tick history grow without bound — but it is not the only one: even if storage were free, the source being free means re-deriving is cheap.

Approach	Storage cost	Re-derive cost	Risk
Keep raw + re-derive on demand	High (full tick history)	Low	Storage grows indefinitely
Keep raw + keep derived	Very high	None	Doubles footprint; still grows
Discard raw, keep derived only	Minimal	Time (no money)	Must re-run pipeline to add a new signal

When tick sources are free, the “re-derive cost” in the keep-raw column is simply another pipeline run — CPU and bandwidth, not a purchase. The derived outputs — z-scored signals stored as compressed columnar Parquet — occupy a small fraction of the raw tick volume. The derived representation is also not reverse-engineerable into the original tape, which is a legal requirement: we sell a derived work, not a redistribution of market data.

A new signal added after the initial backfill requires a fresh pipeline pass over the same free source. This is a deliberate trade-off: the pipeline is built to be cheap to re-run, so the cost of that trade-off stays low.

What are the invariants that make this pipeline trustworthy?

A single-pass pipeline with discard is only reliable if three properties hold:

Idempotency. Reprocessing a given calendar day must produce identical (logically reproducible) output. If reprocessing day D once gives result R and reprocessing it again gives a different derived result R′, the pipeline has a correctness bug. Idempotency means any day can be safely reprocessed after a failure or a logic update — the old partition is overwritten cleanly, no duplicates accumulate.

Resumability. The pipeline maintains a manifest — a checkpointed record of which days have been processed and verified. An interruption mid-run (network drop, process crash) means the next run picks up at the last verified checkpoint rather than restarting from day one. Days are the atomic unit of work: a day is either fully processed and checkpointed, or it is not recorded at all.

One read, N processors. The raw event stream for a given day flows through all registered processors exactly once. Adding a new processor does not change the cost of processing existing days; it only affects re-runs needed to back-fill the new signal’s history.

These three invariants together mean the pipeline behaves predictably under failure, under extension, and under audit.

How does this connect to point-in-time correctness?

It is tempting to credit the single read for this, but that would be wrong. Single-pass is an IO discipline: it bounds how many times the raw file is read, nothing more. A day’s events are loaded into memory as a DataFrame, which gives random access to every row of the day — including rows that are “in the future” relative to any given timestamp. So single-pass does not, by itself, prevent look-ahead.

Point-in-time correctness is a separate, deliberate property: every baseline is computed as a causal rolling window over past observations only, so a processor never uses data from after timestamp T. The two work together — single-pass keeps IO bounded, causal windows keep computation honest — but they are not the same thing.

Every baseline normalization — the σ denominator that converts a raw measurement into a z-score — is computed as a causal rolling window over past observations only. The pipeline excludes recurring mechanical windows (such as perpetual funding settlements) from the rolling baseline, because those windows produce structurally elevated volume that would otherwise distort the normalization for every surrounding period. This is what point-in-time correctness means at the pipeline level: the derived output for timestamp T is a function only of data available before T.

For a deeper treatment of why look-ahead bias is the most common way backtests flatter themselves, see What Is Point-in-Time Correctness and Why Does It Prevent Look-Ahead Bias?.

What is the floor calibration step, and why does it come first?

Before the main backfill pass runs, the pipeline performs a short calibration phase over a small sample of recent data. The purpose is to determine a conservative global floor — the minimum signal magnitude below which events are not stored in the primitive feature store. This floor is set once per market and then locked for the entire backfill.

The floor serves two functions. First, it keeps the primitive store free of noise: events smaller than the floor carry little structural information and would inflate storage and downstream query costs without adding signal value. Second, the floor defines the lower bound for user-configurable signal thresholds (the “knobs”): a user cannot set a sensitivity below the floor, because events below the floor were never recorded.

The exact calibration procedure is proprietary. What matters architecturally is that the floor is a constant global parameter, not a rolling value — it does not change day by day during the backfill. The causal rolling baselines (the σ denominators) do evolve during the pass; the floor does not.

How does the backfill sequence work in practice?

The backfill has two phases that must not be confused:

Calibration phase (short, one-time). Run over a compact sample of recent data to lock the per-market floor. The raw data from this phase is also discarded.
Main pass (single-pass + discard, §17.2). For each calendar day in the target range: download → validate (QA gate) → normalize to canonical events → run all processors in one pass → append derived partitions → checkpoint manifest → discard raw. Repeat.

The main pass works in day-sized units because a single day of trade data fits in memory. If a single day were too large, the unit would shrink to intra-day chunks — the architecture does not assume any particular file size.

The pipeline starts from a recent anchor date and extends backward. Adding historical depth later (deeper back-history for paid tiers) is a resumable extension of the same pass, not a fresh restart. Adding a new signal requires re-running only the metabolization stage for the desired date range, not re-downloading data that the free source still provides.

What is stored, and what is permanently gone?

Layer	Kept	Discarded
Raw tick records (prices, sizes, timestamps)	Never	After metabolization
Normalized canonical events	Never (in-memory only)	After the day’s pass
Derived signal outputs (z-scores, flags, levels)	Yes — Parquet, per-signal, per-day	—
Primitive records (above-floor events, σ-normalized)	Yes — PrimitiveStore, per-market	—
Pipeline manifest (checkpoint log)	Yes	—

The derived outputs and primitive records are the only durable artifacts. Both are expressed in terms that cannot reconstruct the original tape: exact prices and sizes are gone. This is simultaneously the legal requirement for operating as a derived-data vendor and the design that keeps storage costs proportional to the number of signals rather than to the volume of raw history.

For why we keep derived outputs and discard raw data, see Why Sell the Measurement, Not the Alpha?.

FAQ

If the raw data is free, why not just keep it? Storing years of raw tick data introduces ongoing storage costs, maintenance overhead, and legal exposure as a potential redistribution of market data. Because the source is free to re-download, the only cost of discarding is the time to re-run the pipeline — which is acceptable. The derived outputs are what the product actually sells, so keeping the raw is waste by definition.

What happens if the pipeline crashes in the middle of processing a day? The manifest records only fully verified days. A partial run leaves no checkpoint entry for the interrupted day. On the next run, the pipeline re-downloads and reprocesses that day from scratch. Because the pipeline is idempotent, the result is identical to what a clean first run would have produced.

Can I add a new signal without re-running the entire history? Yes, with a constraint. Signals that depend only on free source data require a fresh pipeline pass over the desired date range — the source is re-downloaded and re-processed. Signals that depend on the primitive feature store (events captured above the floor during the original pass) can be computed from the stored primitives without re-downloading, for the date range already in the store. The architecture separates these two cases explicitly.

What does “idempotent” mean for a pipeline that discards its inputs? It means reprocessing a given day always produces the same derived output, regardless of how many times it runs. The day’s raw data is re-fetched from the free source, run through the same normalized pipeline, and the result overwrites the previous partition. Because the source data and the processing logic are both deterministic, the output is identical. See also How Do Reproducible Backtests and Permalink Hashes Work?.

Why is the floor calibration done separately from the main pass? The floor must be fixed before the main pass begins so that every day of the backfill applies the same threshold. If the floor were recalibrated during the pass, early days and late days would use different thresholds, making the primitive store internally inconsistent. A locked floor is what makes the primitive records comparable across the full history.

TickDistill sells clean, computed order-flow inputs — not trading advice or guaranteed alpha. Backtests are illustrative and not a promise of future results.

This site is open source. Improve this page.