tickdistill-learn

What Is Point-in-Time Correctness? Why No-Look-Ahead Makes or Breaks a Backtest

By TickDistill — order-flow microstructure signals. Educational content, not financial advice.

The short answer

Point-in-time correctness is the guarantee that every computation at time t uses only data from strictly before t. Violating this constraint — accidentally or structurally — is called look-ahead bias, and it is the most common way order-flow research produces backtest results that cannot be reproduced in live trading. TickDistill treats point-in-time correctness as a hard engineering invariant: every baseline, every normalization, every mask is causal by construction, and every backtest result can be independently reproduced from the same inputs.

What does “point-in-time correct” mean, exactly?

Point-in-time correctness means that the value of any signal emitted at timestamp t is a deterministic function of data with timestamps t' < t only. No observation from t' ≥ t enters the computation — not the current bucket, not a future bucket, not in the normalization denominator, not in the exclusion mask calibration.

The strict inequality matters. Including the current observation (t' ≤ t instead of t' < t) is still a form of contamination: the baseline that normalizes a measurement must not include that same measurement, or the z-score becomes self-referential.

Why look-ahead bias is so easy to introduce by accident

Look-ahead bias does not require deliberate cheating. It emerges from common implementation shortcuts:

Source	How it happens
In-sample normalization	Computing the mean/std over the entire history, then using it to normalize each historical point
Rolling window off-by-one	A `pandas.rolling().mean()` default that includes the current row in the window
Global volatility estimate	Using the full-period σ as the denominator for a z-score computed at each past point
Classifier training	Training a trade-side classifier on the same period you backtest the signal
Mask calibration	Identifying “noisy” windows after the fact and masking them retroactively

Each of these makes a past computation depend on future information. The backtest looks cleaner than it is; live performance does not benefit from knowledge of the future.

The causal baseline: `t' < t` strictly

A causal baseline is a rolling statistic — mean, standard deviation, or exponentially weighted equivalent — computed at each point using only the observations that were available at that point in history.

The public z-score formula is:

z_t = ( x_t − μ_t ) / σ_t

where μ_t and σ_t are estimated from { x_{t'} : t' < t } exclusively. This is standard practice for normalizing order-flow quantities against a causal baseline (Easley, López de Prado, O’Hara 2012).

Two choices of baseline are common in practice:

Baseline type	Formula	Property
Rolling window of N observations	`μ_t = (1/N) Σ_{i=t-N}^{t-1} x_i`	Equal weight, sharp cutoff
Exponentially weighted (EWM)	`μ_t = (1−λ) Σ_{k=0}^{∞} λ^k x_{t-1-k}`	Smooth decay, infinite memory

The decay parameter λ corresponds to a half-life h via λ = exp(−ln2/h). A longer half-life makes the baseline more stable across regime changes; a shorter half-life makes it more adaptive. The calibration of this parameter is proprietary — what matters for correctness is that whichever estimator is used, it uses t' < t only. TickDistill uses a causal EWM baseline, and the current observation never enters the estimate that normalizes it.

Mechanical windows: why some events must be excluded from the baseline

Even a perfectly causal baseline can be distorted by recurring mechanical events — moments when volume or imbalance is large for structural reasons rather than informational ones.

A clear example is the perpetual futures funding settlement at 00:00, 08:00, and 16:00 UTC (public exchange schedule, Binance and most major venues). At these moments, a funding payment causes predictable positioning activity that is unrelated to informed order flow. Including funding spikes in the baseline causes the baseline σ to inflate, which then suppresses the z-score of genuine order-flow events in surrounding windows.

The solution is an exclusion mask: data within a mechanical window is excluded from updating the baseline. The mask is applied causally — it defines which observations are allowed to enter the rolling statistic. Observations inside the mask are not deleted; the signal may still be computed over them, but the baseline parameters are not updated from them.

μ_t = EWM over { x_{t'} : t' < t  AND  t' ∉ mask }
σ_t = EWM-std over the same filtered set

Which windows to mask, and at what granularity, is a calibration decision that depends on the instrument, the signal, and the empirical effect of the mechanical event on the signal’s distribution. The general principle — exclude mechanical events from the normalization baseline — is textbook practice; the specific calendar is proprietary.

Warm-up periods: when a causal baseline is not yet reliable

A rolling or exponentially weighted estimator requires a minimum number of observations before its estimates are stable. Emitting z-scores before the warm-up completes produces values with high estimation error, which corrupt any downstream comparison.

TickDistill enforces two distinct warm-up criteria before emitting any signal value:

Signal window warm-up. A signal that is itself a rolling statistic (e.g., VPIN, a moving imbalance) requires its own window to be filled before it produces a meaningful value.
Baseline warm-up. The causal baseline (μ_t, σ_t) requires a sufficient number of non-masked observations before its estimates stabilize. For an EWM baseline with half-life h, stability is reached after approximately 5h observations — the point at which the weight of the initialization drops below roughly 3%.

No signal point is emitted until both criteria are satisfied. A missing warm-up is equivalent to a form of look-ahead: the estimator behaves as if it has more historical information than it does.

Anti-look-ahead: the test that verifies the guarantee (Test 5)

The claim of point-in-time correctness is verifiable. The test is direct: compute signal values over a stream, then modify trades at timestamps t' > t, and confirm that the signal value at t is identical.

Formally, for any t and any perturbation of { x_{t'} : t' > t }:

signal(t | history up to t)  =  signal(t | history up to t, perturbed future)

If this equality fails, the computation has a look-ahead dependency. This test is mandatory in TickDistill’s test suite and covers every path: the signal window, the baseline estimator, the mask exclusion, and the BVC price-change estimator σ_dP (which uses its own causal window over past price differences between sub-bars, never the full sample).

Reproducibility: why point-in-time correctness enables version-pinned backtests

Point-in-time correctness is a prerequisite for reproducibility. A backtest result from a point-in-time-correct pipeline is a deterministic function of four inputs: (signal, params, range, version) — because each signal is itself a pure parametric function f(primitive, params). Given the same four inputs, the same output must emerge, regardless of when the query runs.

This enables two capabilities:

Permalink/content-hash. Every backtest result can be identified by a hash of its inputs. The result is shareable and reproducible indefinitely.
Version pinning. When a signal formula is updated (v1 → v2), backtest queries pinned to v1 continue to reproduce the v1 result exactly. Code and data definitions are frozen together.

Neither capability is possible if the computation is contaminated by look-ahead, because future data would make the output depend on when the query runs, not only on the declared inputs. See also What makes a backtest reproducible? Permalinks and version pinning.

How this connects to sigma-normalization and signal quality

Sigma-normalization — expressing a signal in units of standard deviations from its own rolling baseline — is only honest if the baseline is causal. An in-sample standard deviation is not a yardstick; it is a measurement taken with a ruler that was calibrated using the answer.

The practical consequence is that live signals and historically backfilled signals use the same code path: the causal baseline estimator, the same mask, the same warm-up logic. There is no separate “backtest mode” that uses full-sample statistics. The backtest is the same computation run over historical data. See Why Order-Flow Signals Should Be Measured in Standard Deviations.

How the pipeline enforces these guarantees

Three architectural properties enforce point-in-time correctness end-to-end:

Single-pass streaming. Each day of data is processed in order, one observation at a time. The state at time t is built from the stream up to t; no random access to future records is possible. See Single-pass streaming ETL and discard.
Immutable daily partitions. Processed outputs are stored as immutable Parquet partitions. Reprocessing a day overwrites its slice cleanly and produces the identical result (idempotence). This is verified by the QA gate. See How We Validate Market Data Before It Becomes a Signal.
Causal baseline module. The baseline estimator is a shared module used by every signal processor. It carries its own state forward per stream and enforces the t' < t constraint at the interface level, so individual signal implementations cannot accidentally access current or future baseline values.

FAQ

What is look-ahead bias, in one sentence? Look-ahead bias is the use of information from time t' ≥ t when computing a signal value for time t, causing backtest results to reflect knowledge the strategy could not have had.

Why does an in-sample standard deviation cause look-ahead bias? An in-sample standard deviation is computed over the entire historical period. Using it to normalize a point in the middle of that period means the denominator includes observations that occurred after that point — information the model would not have had in real time.

What is an exclusion mask and why does it not create look-ahead bias? An exclusion mask is a set of timestamp intervals whose observations are not allowed to update the rolling baseline. The mask must itself be defined causally — based on a public, fixed event schedule (like exchange funding times), not identified from the data after the fact. A mask derived by examining data is a form of look-ahead; a mask derived from a published schedule is not.

Does warm-up affect live trading or only backtests? Both. In a live deployment, a signal cannot emit values until its baseline has accumulated the required number of non-masked observations. In a historical backtest, the same warm-up logic applies: signal points are absent from the first segment of history until both the signal window and the baseline window are filled.

How can I verify that a signal is point-in-time correct? The direct test: compute signal values on a stream, modify observations after a target time t, and confirm the signal value at t is unchanged. Any dependency on future data will cause the value to change. This test must cover the signal formula, the baseline estimator, the mask, and any classifier sub-component that has its own rolling estimate.

TickDistill sells clean, computed order-flow inputs — not trading advice or guaranteed alpha. Backtests are illustrative and not a promise of future results.

This site is open source. Improve this page.