By TickDistill — order-flow microstructure signals. Educational content, not financial advice.
Processing years of tick data cheaply comes down to one principle: read each day of raw data exactly once, run every signal processor over that single read, write the derived outputs, and immediately discard the raw. Storage cost is real, but it is not the only reason — because our tick sources are free, even if storage were free, re-deriving from scratch later costs time, not money, so there is no reason to keep data you have already metabolized.
Single-pass ETL is a data-engineering pattern where each unit of raw input is read precisely once, transformed into all required outputs in that one pass, and then released. The alternative — one read per output — multiplies IO proportionally to the number of downstream consumers and keeps raw data resident indefinitely.
For high-frequency market data the practical difference is large. A day of aggregated-trade records for a single perpetual contract fits comfortably in memory as a Polars DataFrame. A full history across years does not. Single-pass processing keeps the working set bounded: load one day, emit all signals for that day, release the memory, move to the next.
The pipeline has three logically separate stages:
TradeEvent fields for trade data, BookEvent fields for order-book data. Processors never see the raw exchange format.The three stages are kept strictly separate: ingestion does not overlap with metabolization, and discard only happens after the derived outputs are verified.
The decision to discard is driven by two reinforcing reasons: source economics and the storage moat. Storage cost is a genuine driver — years of raw tick history grow without bound — but it is not the only one: even if storage were free, the source being free means re-deriving is cheap.
| Approach | Storage cost | Re-derive cost | Risk |
|---|---|---|---|
| Keep raw + re-derive on demand | High (full tick history) | Low | Storage grows indefinitely |
| Keep raw + keep derived | Very high | None | Doubles footprint; still grows |
| Discard raw, keep derived only | Minimal | Time (no money) | Must re-run pipeline to add a new signal |
When tick sources are free, the “re-derive cost” in the keep-raw column is simply another pipeline run — CPU and bandwidth, not a purchase. The derived outputs — z-scored signals stored as compressed columnar Parquet — occupy a small fraction of the raw tick volume. The derived representation is also not reverse-engineerable into the original tape, which is a legal requirement: we sell a derived work, not a redistribution of market data.
A new signal added after the initial backfill requires a fresh pipeline pass over the same free source. This is a deliberate trade-off: the pipeline is built to be cheap to re-run, so the cost of that trade-off stays low.
A single-pass pipeline with discard is only reliable if three properties hold:
Idempotency. Reprocessing a given calendar day must produce identical (logically reproducible) output. If reprocessing day D once gives result R and reprocessing it again gives a different derived result R′, the pipeline has a correctness bug. Idempotency means any day can be safely reprocessed after a failure or a logic update — the old partition is overwritten cleanly, no duplicates accumulate.
Resumability. The pipeline maintains a manifest — a checkpointed record of which days have been processed and verified. An interruption mid-run (network drop, process crash) means the next run picks up at the last verified checkpoint rather than restarting from day one. Days are the atomic unit of work: a day is either fully processed and checkpointed, or it is not recorded at all.
One read, N processors. The raw event stream for a given day flows through all registered processors exactly once. Adding a new processor does not change the cost of processing existing days; it only affects re-runs needed to back-fill the new signal’s history.
These three invariants together mean the pipeline behaves predictably under failure, under extension, and under audit.
It is tempting to credit the single read for this, but that would be wrong. Single-pass is an IO discipline: it bounds how many times the raw file is read, nothing more. A day’s events are loaded into memory as a DataFrame, which gives random access to every row of the day — including rows that are “in the future” relative to any given timestamp. So single-pass does not, by itself, prevent look-ahead.
Point-in-time correctness is a separate, deliberate property: every baseline is computed as a causal rolling window over past observations only, so a processor never uses data from after timestamp T. The two work together — single-pass keeps IO bounded, causal windows keep computation honest — but they are not the same thing.
Every baseline normalization — the σ denominator that converts a raw measurement into a z-score — is computed as a causal rolling window over past observations only. The pipeline excludes recurring mechanical windows (such as perpetual funding settlements) from the rolling baseline, because those windows produce structurally elevated volume that would otherwise distort the normalization for every surrounding period. This is what point-in-time correctness means at the pipeline level: the derived output for timestamp T is a function only of data available before T.
For a deeper treatment of why look-ahead bias is the most common way backtests flatter themselves, see What Is Point-in-Time Correctness and Why Does It Prevent Look-Ahead Bias?.
Before the main backfill pass runs, the pipeline performs a short calibration phase over a small sample of recent data. The purpose is to determine a conservative global floor — the minimum signal magnitude below which events are not stored in the primitive feature store. This floor is set once per market and then locked for the entire backfill.
The floor serves two functions. First, it keeps the primitive store free of noise: events smaller than the floor carry little structural information and would inflate storage and downstream query costs without adding signal value. Second, the floor defines the lower bound for user-configurable signal thresholds (the “knobs”): a user cannot set a sensitivity below the floor, because events below the floor were never recorded.
The exact calibration procedure is proprietary. What matters architecturally is that the floor is a constant global parameter, not a rolling value — it does not change day by day during the backfill. The causal rolling baselines (the σ denominators) do evolve during the pass; the floor does not.
The backfill has two phases that must not be confused:
The main pass works in day-sized units because a single day of trade data fits in memory. If a single day were too large, the unit would shrink to intra-day chunks — the architecture does not assume any particular file size.
The pipeline starts from a recent anchor date and extends backward. Adding historical depth later (deeper back-history for paid tiers) is a resumable extension of the same pass, not a fresh restart. Adding a new signal requires re-running only the metabolization stage for the desired date range, not re-downloading data that the free source still provides.
| Layer | Kept | Discarded |
|---|---|---|
| Raw tick records (prices, sizes, timestamps) | Never | After metabolization |
| Normalized canonical events | Never (in-memory only) | After the day’s pass |
| Derived signal outputs (z-scores, flags, levels) | Yes — Parquet, per-signal, per-day | — |
| Primitive records (above-floor events, σ-normalized) | Yes — PrimitiveStore, per-market | — |
| Pipeline manifest (checkpoint log) | Yes | — |
The derived outputs and primitive records are the only durable artifacts. Both are expressed in terms that cannot reconstruct the original tape: exact prices and sizes are gone. This is simultaneously the legal requirement for operating as a derived-data vendor and the design that keeps storage costs proportional to the number of signals rather than to the volume of raw history.
For why we keep derived outputs and discard raw data, see Why Sell the Measurement, Not the Alpha?.
If the raw data is free, why not just keep it? Storing years of raw tick data introduces ongoing storage costs, maintenance overhead, and legal exposure as a potential redistribution of market data. Because the source is free to re-download, the only cost of discarding is the time to re-run the pipeline — which is acceptable. The derived outputs are what the product actually sells, so keeping the raw is waste by definition.
What happens if the pipeline crashes in the middle of processing a day? The manifest records only fully verified days. A partial run leaves no checkpoint entry for the interrupted day. On the next run, the pipeline re-downloads and reprocesses that day from scratch. Because the pipeline is idempotent, the result is identical to what a clean first run would have produced.
Can I add a new signal without re-running the entire history? Yes, with a constraint. Signals that depend only on free source data require a fresh pipeline pass over the desired date range — the source is re-downloaded and re-processed. Signals that depend on the primitive feature store (events captured above the floor during the original pass) can be computed from the stored primitives without re-downloading, for the date range already in the store. The architecture separates these two cases explicitly.
What does “idempotent” mean for a pipeline that discards its inputs? It means reprocessing a given day always produces the same derived output, regardless of how many times it runs. The day’s raw data is re-fetched from the free source, run through the same normalized pipeline, and the result overwrites the previous partition. Because the source data and the processing logic are both deterministic, the output is identical. See also How Do Reproducible Backtests and Permalink Hashes Work?.
Why is the floor calibration done separately from the main pass? The floor must be fixed before the main pass begins so that every day of the backfill applies the same threshold. If the floor were recalibrated during the pass, early days and late days would use different thresholds, making the primitive store internally inconsistent. A locked floor is what makes the primitive records comparable across the full history.
TickDistill sells clean, computed order-flow inputs — not trading advice or guaranteed alpha. Backtests are illustrative and not a promise of future results.