Question. Both v1 (deployed Oracle) and M5 (v2 candidate) reject Berkowitz on the OOS 2023+ panel, but for different reasons. Where does each rejection live? If a partition can be localized, that partition is a v3-forecaster lead and the disclosure can move from "per-anchor only" to "per-anchor uniformly except in partition X."
For each methodology, the joint Berkowitz LR is decomposed into the marginal contribution of three nested restrictions: mean=0, var=1, AR(1)=0. The share columns indicate which restriction is doing the rejecting.
| methodology | n | lr_full | p_full | rho_hat | mean_z | var_z | lr_mean_only | lr_var_only | lr_ar1_only | share_mean | share_var | share_ar1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| v1_deployed_oracle | 1730 | 37.601 | 0.000 | -0.043 | 0.074 | 0.839 | 9.354 | 23.347 | 1.881 | 0.270 | 0.675 | 0.054 |
| m5_v2_candidate | 1730 | 173.088 | 0.000 | 0.308 | 0.018 | 0.990 | 0.569 | 0.086 | 162.728 | 0.003 | 0.001 | 0.996 |
Berkowitz's joint LR uses lag-1 pairs in panel-row order. The lag-1 alternative captures different structure depending on how the panel is sorted:
cross_sectional_within_weekend: pairs are (symbol_i, symbol_{i+1}) on the same Friday. Captures common-mode residual that the methodology's factor adjustment didn't fully partial out.temporal_within_symbol: pairs are (fri_ts_t, fri_ts_{t+1}) for the same symbol. Captures persistent per-symbol mis-calibration over time.
| methodology | label | n_pairs | rho | p_value |
|---|---|---|---|---|
| v1_deployed_oracle | cross_sectional_within_weekend | 1557 | 0.354 | 0.000 |
| v1_deployed_oracle | temporal_within_symbol | 1720 | -0.041 | 0.087 |
| m5_v2_candidate | cross_sectional_within_weekend | 1557 | 0.353 | 0.000 |
| m5_v2_candidate | temporal_within_symbol | 1720 | -0.032 | 0.184 |
Berkowitz and DQ at τ=0.95 re-run within each partition. Look for partitions where p-values are non-rejecting — those are the locally-uniform PIT regions. Look for partitions with the largest LR per row — those are the localized rejection sources.
Top 20 most-rejecting (Berkowitz LR descending):
| methodology | partition_col | partition | n | berkowitz_lr | berkowitz_p | rho_hat | var_z | dq_95 | dq_95_p |
|---|---|---|---|---|---|---|---|---|---|
| m5_v2_candidate | earnings_adjacent | no_earnings | 1648 | 182.459 | 0.000 | 0.323 | 0.970 | 33.810 | 0.000 |
| m5_v2_candidate | pooled | all | 1730 | 173.088 | 0.000 | 0.308 | 0.990 | 32.082 | 0.000 |
| m5_v2_candidate | vix_bucket | high | 580 | 95.831 | 0.000 | 0.389 | 1.017 | 13.960 | 0.016 |
| m5_v2_candidate | regime_pub | normal | 1160 | 90.223 | 0.000 | 0.268 | 1.057 | 18.704 | 0.002 |
| m5_v2_candidate | regime_pub | high_vol | 380 | 87.599 | 0.000 | 0.431 | 0.825 | 24.739 | 0.000 |
| m5_v2_candidate | symbol | SPY | 173 | 85.828 | 0.000 | -0.005 | 0.302 | nan | nan |
| m5_v2_candidate | symbol | MSTR | 173 | 64.776 | 0.000 | -0.072 | 2.103 | 44.344 | 0.000 |
| m5_v2_candidate | vix_bucket | mid | 570 | 61.779 | 0.000 | 0.317 | 1.041 | 16.180 | 0.006 |
| m5_v2_candidate | symbol | TLT | 173 | 51.008 | 0.000 | -0.015 | 0.428 | nan | nan |
| m5_v2_candidate | symbol | QQQ | 173 | 45.407 | 0.000 | -0.023 | 0.438 | 6.917 | 0.227 |
| m5_v2_candidate | symbol | GLD | 173 | 45.061 | 0.000 | 0.037 | 0.443 | 6.917 | 0.227 |
| m5_v2_candidate | regime_pub | long_weekend | 190 | 38.047 | 0.000 | 0.309 | 0.809 | 4.563 | 0.472 |
| v1_deployed_oracle | pooled | all | 1730 | 37.601 | 0.000 | -0.043 | 0.839 | 48.629 | 0.000 |
| v1_deployed_oracle | regime_pub | normal | 1160 | 36.931 | 0.000 | 0.019 | 0.819 | 38.496 | 0.000 |
| m5_v2_candidate | symbol | TSLA | 173 | 34.062 | 0.000 | 0.039 | 1.740 | 33.504 | 0.000 |
| v1_deployed_oracle | earnings_adjacent | no_earnings | 1648 | 32.827 | 0.000 | -0.029 | 0.839 | 46.258 | 0.000 |
| m5_v2_candidate | vix_bucket | low | 580 | 30.792 | 0.000 | 0.216 | 0.909 | 33.044 | 0.000 |
| m5_v2_candidate | symbol | HOOD | 173 | 30.471 | 0.000 | -0.124 | 1.671 | 52.026 | 0.000 |
| v1_deployed_oracle | symbol | MSTR | 173 | 26.078 | 0.000 | -0.040 | 0.551 | 8.552 | 0.128 |
| v1_deployed_oracle | vix_bucket | mid | 570 | 23.564 | 0.000 | -0.053 | 0.785 | 17.601 | 0.003 |
Non-rejecting partitions (Berkowitz p ≥ 0.05, n ≥ 50):
| methodology | partition_col | partition | n | berkowitz_lr | berkowitz_p | rho_hat | var_z |
|---|---|---|---|---|---|---|---|
| m5_v2_candidate | symbol | NVDA | 173 | 4.023 | 0.259 | -0.069 | 1.188 |
| m5_v2_candidate | earnings_adjacent | with_earnings | 82 | 6.338 | 0.096 | 0.016 | 1.354 |
| m5_v2_candidate | symbol | GOOGL | 173 | 7.143 | 0.067 | 0.064 | 0.772 |
| v1_deployed_oracle | symbol | AAPL | 173 | 1.298 | 0.730 | -0.069 | 0.939 |
| v1_deployed_oracle | symbol | GOOGL | 173 | 1.846 | 0.605 | 0.098 | 1.035 |
| v1_deployed_oracle | symbol | SPY | 173 | 2.586 | 0.460 | -0.038 | 1.079 |
| v1_deployed_oracle | vix_bucket | low | 580 | 4.142 | 0.247 | -0.055 | 0.930 |
| v1_deployed_oracle | symbol | QQQ | 173 | 4.160 | 0.245 | -0.102 | 0.949 |
| v1_deployed_oracle | earnings_adjacent | with_earnings | 82 | 6.148 | 0.105 | -0.031 | 0.772 |
| v1_deployed_oracle | symbol | GLD | 173 | 6.623 | 0.085 | -0.002 | 0.765 |
Four findings surface from this analysis:
-
v1 and M5 fail Berkowitz for different reasons (pooled, methodology-side ordering). v1's pooled rejection is 68% variance compression (var_z ≈ 0.84) and 5% AR(1) — the deployed band at τ=0.95 plus the 0.020 buffer is slightly too wide, so PITs cluster toward 0.5 instead of spanning U(0,1). M5's pooled rejection is 99.6% AR(1) (rho ≈ 0.31, var_z ≈ 0.99) — per-row magnitude is calibrated; consecutive-row PITs are correlated.
-
The cross-sectional AR(1) is identical across methodologies (~0.35) and is a data property, not a methodology artefact. Re-ordering both v1's and M5's PITs by (fri_ts, symbol) and computing lag-1 within-weekend gives ρ ≈ 0.354 for both. Within-symbol temporal lag-1 is ≈ 0 for both. The methodologies produce different pooled Berkowitz LR purely because their default panel orderings probe different lag structures:
run_reviewer_diagnostics.pyorders v1 by(symbol, fri_ts)(temporal-first; misses the real autocorrelation), whiledensity_tests_m5orders M5 by(fri_ts, symbol)(cross-sectional-first; picks it up). Both methodologies fail to absorb the common-mode weekend residual after their respective factor-adjusted points. -
Per-symbol M5 reveals heterogeneous variance — the second v3 lead. Single-symbol Berkowitz on M5 (within-symbol ordering, so AR(1) is near-zero per finding 2) shows wildly different
var_z: SPY (0.30), QQQ (0.44), GLD (0.44), TLT (0.43) all have compressed PIT distributions (M5's bands too wide for these), while MSTR (2.10), TSLA (1.74), HOOD (1.67) have inflated distributions (M5's bands too narrow). M5's per-regime conformal quantile pools across all symbols within a regime; per-symbol residual scale is not uniform within a regime. NVDA (var_z=1.19, p=0.26) and GOOGL (var_z=0.77, p=0.07) are the locally-uniform exceptions. -
Non-rejecting partitions exist and have a clean shape. Partitions where Berkowitz p ≥ 0.05 with n ≥ 50 are: M5/NVDA, M5/GOOGL, M5/with_earnings (n=82, p=0.096); v1/AAPL, v1/GOOGL, v1/SPY, v1/QQQ, v1/GLD, v1/vix_low, v1/with_earnings. Within-symbol calibration is locally uniform for v1 across nearly all symbols (no AR(1) within-symbol; mean and variance are close to Gaussian). The pooled rejection is entirely a cross-sectional phenomenon for both methodologies.
- Disclosure. Paper 1 §6 / §9 can update from "per-anchor calibration only" to: per-anchor calibration is uniform within-symbol across the panel; the pooled Berkowitz rejection is fully attributable to (a) common-mode residual autocorrelation across symbols within a weekend (cross-sectional ρ ≈ 0.35) and (b) heterogeneous per-symbol residual variance under M5's per-regime quantile pooling. Both are isolated v3 leads.
- v3 lead 1: common-mode residual partial-out. Regress per-row residual on the cross-sectional weekend mean residual (pseudo factor-2); refit the per-regime conformal quantile on the doubly-residualised score. Expected to remove the cross-sectional ρ ≈ 0.35 and tighten the band by ~10–15% at matched coverage.
- v3 lead 2: per-symbol Mondrian. Move from
Mondrian(regime)toMondrian(regime × {symbol-class})where symbol-class is one of {equity_index, single_stock_meta, equity_high_beta, gold, bond}. Specifically tightens SPY/QQQ/TLT/GLD bands and widens MSTR/TSLA/HOOD bands — re-allocates width across the universe rather than reducing total width. - Not a methodology change for v1 or M5. This analysis strengthens the disclosure and supplies two cleanly-scoped v3 leads. It does not justify reverting M5 or modifying v1.
Source data:
reports/tables/v1b_density_rejection_pit_m5.csv— per-row M5 PITs + violation flagsreports/tables/v1b_density_rejection_per_partition.csv— Berkowitz + DQ per partitionreports/tables/v1b_density_rejection_lag1_decomposition.csv— cross-sectional vs temporal lag-1reports/tables/v1b_density_rejection_berkowitz_decomposed.csv— pooled LR decomposition
Reproducible via scripts/run_density_rejection_diagnostics.py.