AIM Analyzer v12 — Structural Analysis Tool

Layer-by-layer breakdown

Odd ratio per layer

Layer size cascade

Structural Complexity Map

opt-in — runs chunked analysis across the file

Splits the data into fixed-size chunks and runs per-chunk structural analysis. Reveals where structure lives inside heterogeneous files — different regions of a GGUF model, video container, or mixed binary will show distinct fingerprint classes. For homogeneous files the map is uniform.
Cost: ~16–25 analysis passes on the loaded data. Takes 1–3 seconds.

What this measures The encoder measures inter-symbol periodicity at lag k. If knowing X_i−k significantly reduces uncertainty about X_i, the data has a stride-k structure the encoder can exploit. Stride detection is independent of bit-plane structure — a file can score well on bit-plane metrics but also have strong stride correlation.

Stride confidence = (H(k=1) − H(k_winner)) / H(k=1). Near 0% means no useful stride structure. 20%+ means the encoder will find meaningful stride correlation.

Note: Analysis is performed on up to 200,000 bytes (the file analysis cap). The encoder applies its own 1 MB cap internally; for files larger than 200 KB, stride results may not reflect full-file periodicity.

Conditional entropy by stride k — lower = more periodicity

Stride table

Reading this view The swept bit starts near its natural set-frequency. After the L0 bit-clearing pass, it collapses to 0.0 in the aligned layer — that's the structural claim. The Bit column in the Decay Profile shows which bit was targeted at each layer.

What this measures Even-alignment increases mod-N alignment because rounding odd values down promotes divisibility. This quantifies how much latent mod-N structure is unlocked for free by a single bit-plane pass.

mod-N structure: raw vs post-even-alignment

How this works Raw data has H bits/byte of Shannon entropy. After bit-plane decomposition, the aligned stream and each flag layer's bitset carry their own entropy. If H(aligned) + H(flags) < H(raw), entropy was genuinely reduced — the data had structure the bit-plane pass could separate. If it equals or exceeds the input, the decomposition relocated entropy without reducing it. Both are real findings.

Flag entropy model: each flag layer is treated as a bitset of N bits with K set. Entropy = N × H(K/N) where H is the binary entropy function. This measures the information content of knowing which positions were flagged.

This page's entropy model uses fixed bit-0 (LSB clearing) as a controlled target to isolate its entropy contribution — one of eight possible targets. The Decay Profile and encoder use adaptive bit selection (sparsest bit per layer), which is why halt depths there may differ. The table below extends this fixed-bit-0 analysis to all 8 positions for comparison.

All-bit AIM target comparison — entropy outcome (lower output = better)

Sweep winner = fewest total flags · Entropy winner = lowest total output entropy

Bit 0 / LSB — stream entropy breakdown

Bit 0 / LSB — per-layer flag entropy

The core hypothesis In structured data, the LSB (bit 0) carries disproportionately less entropy than higher-order bits because value clustering means the LSB is quantization noise on top of a smoother signal. The bit-plane pass strips that noise and encodes it separately. For uniform random data, every bit carries identical entropy (~1.0 bit) — the decomposition can only relocate, not reduce. The asymmetry in this profile is the mechanistic explanation of AIM's structural claim.

Reading the chart: A bit near 1.0 is carrying maximum entropy (set in ~50% of values). A bit near 0.0 is nearly deterministic — almost always set or almost always clear. Bits far from 1.0 are the cheap ones: they cost little to encode separately, and separating them exposes the smoother structure underneath.

Cross-reference: The Bit Clearing Sweep shows which bit position produces the fewest total flags. The Entropy Analysis shows which produces the lowest total output entropy. This page explains why they answer slightly different questions — fewer flags ≠ lower entropy if the aligned stream picks up entropy from the clearing operation.

Raw data — bit entropy by position

After L0 even-alignment — aligned stream

Delta — entropy change per bit after L0

Negative = bit became cheaper to encode. Positive = bit became more expensive.

LSB analysis — the structural claim

Five modes, one question: does AIM decomposition make data more compressible?

Note: modes 1–4 simulate full-depth recursion. The actual encoder halts early via HALT_ANS_STRIDE for structured data — see the Predicted Halt section below and the Decay Profile page for the estimated cutoff.

Blob — aligned bytes + all flag layers concatenated, single gzip. Baseline: forces gzip to handle heterogeneous content in one pass. (Full depth.)

Generic split — each stream independently gzip'd, sizes summed. Better, but applies gzip uniformly to every layer regardless of size — tiny layers get hit with header overhead that exceeds their content. (Full depth.)

Targeted split — per-layer optimal selection: bitset-raw, delta-raw, bitset+gzip, delta+gzip, picks smallest. Skips gzip for layers below ~32 bytes. This is what the three-stream architecture delivers. (Full depth.)

Targeted + final gzip — outer gzip applied to the full targeted-encoded concatenation. (Full depth.)

Predicted halt (Mode 5) — estimated output size if HALT_ANS_STRIDE fires at the predicted depth. Uses the HALT predictor from the Decay Profile page. This is the most realistic estimate of what the actual encoder would produce for structured data.

Ratio < 1.0 = AIM wins.

All-bit results — gzip ratio vs raw (lower = better, <1.0 = AIM wins)

Single blob ratio — all bits

aligned + flags concatenated → gzip

Generic split ratio — all bits

gzip(aligned) + gzip(flags) independently, summed

Targeted split ratio — all bits ★

per-layer: bitset-raw / delta-raw / bitset+gzip / delta+gzip — picks smallest

Targeted + final gzip — all bits

outer gzip over all optimally-encoded streams concatenated — tests for residual cross-stream redundancy

Mode 5 — Predicted halt output size ★

Estimated encoder output if HALT_ANS_STRIDE fires at predicted depth

Stream size breakdown — best bit target

Reading these results

The invariant When dataset length is a multiple of 8, the bit-plane recursion tree is perfectly symmetric at every depth — no remainder elements to break parity propagation. This means: even byte 0 → terminal halt at 0 (complete structural collapse); odd byte 0 → terminal halt at 1 (irreducible LSB). This is deterministic, not probabilistic. It is a zero-cost integrity check: compute the prediction before reconstruction, compare after. If they disagree, reconstruction failed.

Why it matters beyond integrity: It tells you something real about the data. A dataset that predicts and delivers terminal 0 has had its arithmetic structure completely resolved — every layer found structure all the way down. Terminal 1 means an irreducible identity remains — the last LSB the transform couldn't absorb.

Reliability scope (clarification from aim_core_v3.py SPEC DELTA §8): The invariant holds deterministically only when the full symmetric recursion tree plays out — primarily deep-decay datasets with an even byte 0 (e.g. Gradient: byte_0=0, always predicts and delivers terminal 0). For other cases:

Random data: prediction holds ~50% of the time (coin flip, not deterministic).
Rapid-collapse data (Prime Gaps: halts at depth 1): terminal value is 0 regardless of byte_0 parity because the deep symmetric tree never develops.

This is not a bug — it is a limitation of the invariant's scope. Use it as a integrity check only for factor-of-8, deep-decay, even-byte-0 datasets.

What the fingerprint captures Two datasets with identical Shannon entropy can have completely different fingerprints. Random noise and natural language both run 13 layers — but their bit distributions at L0 cleanly separate them (random: flat ~0.5; language: clustered at bits 5–6 from ASCII 32–127). The fingerprint is the structural class signature: not what the data means, but what kind of mathematical object it is at the byte level.

Practical use beyond compression: An unknown binary file that fingerprints as "structured / rapid collapse" is likely a numerical sequence or table. "Language / ASCII" means text-like encoding. "Uniform noise" means the data is either truly random or already compressed/encrypted (which looks the same to this instrument). "Oscillating deep" is a gradient or ramp. These classifications work without knowing the file's domain, format, or meaning.

Your fingerprint

What it suggests

Reference library — similarity scores

Scored by halt depth (60%) + L0 odd ratio (40%)

Interpretation Single even-alignment is the baseline. Each mod-N column shows the total flag count when mod-N is interleaved. A value lower than Single would be a genuine win; in practice, chaining consistently produces more flags than single even-alignment alone.

Why chaining fails Flag positions from an even-alignment pass are indices in a position space — they carry no reason to exhibit modular structure. A mod-N pass on those positions generates flag ratios of 0.75–0.93 at every layer, spreading entropy across more layers without reducing it. The mod-N operation is looking for periodic structure in a list of index values that have none. This is not a failure of mod-N as an operation — it's a failure of target selection. The Seer step (Part 1 of the AIM formula) would reject this application: studying position indices through a modular lens reveals no coherent relationship to the target.

Total flag positions across all layers (lower = better)

How to read this Each bit position defines a different structural target. Clearing bit b means: if a value has that bit set, subtract 2^b and record the position as a flag. This is identical to what AIM does for bit 0 (subtract 1 from odd values), generalized to any power of two.

Fewer total flags = more structure. If a bit is already clear in most values, few positions get flagged and the recursion terminates quickly. That means the data is naturally aligned to that bit's boundary — a genuine structural property. The sweep finds which power-of-two boundary the data is most aligned to, without assuming it's always 2 (even-alignment).

What each bit targets

Total flags per target bit — full recursive decay (lower = more structure)

Flag count comparison — all 8 bits

Per-bit decay profile — L0 flag ratio

L0 = fraction of raw values with target bit set

AIM Tree vs linear chain A disconnected chain splits the decomposition into two branches: Branch A recurses on even-pass flag positions; Branch B recurses on mod-N positions from aligned values. Both are lossless. The cost is the sum of both branch totals.

Key discrepancy: The browser always uses the L0 flat-bitset flag cost — H(K/N)×N bits for one layer. Python recurses on flag positions across all layers, which is cheaper whenever flags have positional structure. For Prime Gaps bit 4, Python's Model B confirms genuine reduction; the browser's Model A also agrees here. For random noise bit 0, the two models should now agree since v11 fixed the random noise generator to use Mulberry32 (see JS↔C Encoder Delta page).

Predictor formula (from §N1): For each bit b, split the data into two sub-distributions: values with bit b set (cleared to aligned values) and values with bit b clear (unchanged). Predicted net = p×H_set + (1−p)×H_clear + H(p) − H(raw). If predicted_net < −raw_H×0.01, predict "reduction". The winner is the most negative predicted_net.

RESOLVED Random Noise: LCG vs. Mulberry32 (fixed in v11)

v9 (historical)genRandomNoise used LCG: s = (s×1664525 + 1013904223) >>> 0, seed=42. The increment 1013904223 is odd, so every output alternated parity. All odd values landed at even-indexed positions, meaning L1 received all-even position indices and halted immediately — halt depth ≈ 1 instead of the correct ~13.

v11 (current)genRandomNoise now uses Mulberry32 seeded at 42: a small, fast PRNG producing statistically uniform output with no parity artifact. Halt depth now matches Python Mersenne Twister at ~13 layers.

StatusFixed. Random Noise demo now produces the correct noise fingerprint (~13 layers). This entry is retained for historical transparency.

MODERATE Entropy Flag Model: L0-only flat bitset vs. full recursion

JS Model A (browser)Flag layer cost = H(K/N) × N bits. This is a single-layer flat bitset — it treats flag storage as one flat N-bit structure with K bits set.

Python Model Bcompute_all_bit_entropies() recurses on flag positions using REAL, summing bitset entropy across all recursive layers: Σ H(K_l/N_l) × N_l. This is cheaper whenever flag positions have positional structure.

Impact on resultsModel B always produces equal or lower total output bits than Model A. Verdict (reduction/relocation/increase) can differ for borderline cases. For Prime Gaps bit 4, both agree on "reduction". For Fibonacci bit 0, both agree on "increase". Near-threshold datasets may show different verdicts.

ThresholdBoth use ±1% of raw bits as the reduction/increase boundary (explicit in JS, confirmed in Python port).

MINOR max_depth: C encoder max_depth=8 is a mathematical proof, not a parameter

C encoder (v34)max_depth = 8 — hardcoded. Recursion terminates naturally when the symbol range reaches [0,1] (only two distinct values possible). Since each byte has 8 bits, the recursive even/odd split can produce at most 8 meaningful layers before the symbol space collapses. This is a mathematical property of the transform, not a configurable limit.

JS Analyzer & Python v3Use max_depth = 40 as a safety limit on the analysis loop. In practice the encoder's natural halt at depth 8 means layers beyond 8 are never produced; the analysis loop exits on empty flag lists long before 40.

ImpactNo behavioral difference for well-structured data. The demo UI now labels the max_depth=8 constraint as a C encoder property, not a parameter. Noise-class data can produce spurious layers 9–13 in the JS simulator because it does not replicate the exact encoder halt conditions; treat layers beyond 8 as a simulator artifact for noise inputs.

MINOR Gzip compression level: browser ≈ 6, Python defaults 9

Browser (CompressionStream)Uses the browser's built-in CompressionStream API, which typically applies gzip level 6. Results vary by browser engine.

Python stdlib gzipDefaults to compresslevel=9 (maximum). Pass gz_level=6 to compute_compression_benchmark() or run_all() to reproduce browser numbers more closely.

ImpactPython compression ratios will be equal or slightly better than browser ratios. The ~32-byte GZIP_OVERHEAD constant is shared and correct for both.

MODERATE Byte-alignment invariant: narrower reliability than spec implies

Spec claimFor factor-of-8 datasets, parity of byte 0 deterministically predicts the terminal halt value.

Empirical clarificationThe invariant holds reliably only when the terminal value is structurally determined by initial parity — primarily datasets that halt cleanly (odd_count=0) with an even byte 0 (e.g. Gradient: always byte_0=0). For random data it holds ~50% of the time (coin flip, not deterministic). For datasets that collapse very quickly (Prime Gaps halts at depth 1 with a single odd value), the terminal value is 0 regardless of byte_0 parity, because the deep symmetric-tree propagation never occurs.

ImpactThe invariant check in the browser UI correctly shows "not applicable" for non-factor-of-8 data and "prediction held" for qualifying cases. The claimed determinism is real for the subset of datasets where the full symmetric recursion tree plays out (deep decay, even byte_0). Not a reliable integrity check for noise or rapid-collapse data.

MINOR Entropy sampling: browser samples ≤500 aligned values for H estimation at L1+

BrowserFor layers beyond L0, Shannon entropy of the aligned stream is estimated from the first 500 values to maintain UI responsiveness. The flag-layer bitset entropy (the compression-relevant quantity) is computed exactly.

PythonComputes Shannon entropy of aligned streams exactly at all layers, with no sampling limit.

ImpactThe aln_h value shown per layer may be an approximation in the browser for large inputs. The compression verdict (reduction/relocation/increase) is not affected — it uses bitset entropy, not aligned-stream entropy.

MINOR mod_pass() absent from spec Python; present in JS and Python v3

Spec v19Mentions modular alignment in chaining experiments but never provides a standalone Python implementation.

JS & Python v3modPass(data, N) / mod_pass(data, N) is a first-class function used by linear chain sweep and disconnected chain analysis.

ImpactNo behavioral difference — the chaining experiments work correctly. This is a documentation gap, not a discrepancy in results.

MINOR Powers of 2: JS uses Math.pow() float; Python uses int **

JSMath.pow(2, i%8) — returns a float (1.0, 2.0, 4.0, …, 128.0). Masking with & 0xFF is applied to extract the byte value.

Python2 ** (i % 8) — integer result (1, 2, 4, …, 128).

ImpactNo difference. Both produce the same 8-cycle sequence: [1, 2, 4, 8, 16, 32, 64, 128]. The spec notes this explicitly (SPEC DELTA in gen_powers_of_two).

MODERATE HALT_ANS_STRIDE: encoder halts early; analyzer models full depth

C encoder (v34)At each recursion depth, the encoder computes the cost of rANS-encoding the aligned stream directly (the "ANS stride" path) and compares it to the cost of continuing recursion. If ANS encoding wins, recursion halts early. This is HALT_ANS_STRIDE in aim_v34.c. For structured data, the encoder typically halts 2–4 layers before the natural flag-empty terminus.

JS Analyzer (v11)The Decay Profile now shows a predicted HALT_ANS_STRIDE depth (annotated on the chart) computed from an approximate cost model: ANS stride cost ≈ H(k_winner) × N_d bits vs subtree cost. Layers beyond the predicted halt are shown dimmed as structural information only — the encoder would not process them.

Impact on resultsCompression size estimates for modes 1–4 in the Compression Benchmark assume full-depth recursion. The predicted halt mode (Mode 5) corrects this for structured data. The predictor may be off by ±1 depth for borderline cases. See the Compression Benchmark page for the predicted-halt estimate.

MODERATE Stride detection: encoder measures H(X_i | X_{i-k}); analyzer now replicates this

C encoder (v34)For each stride k in {1,2,3,4,6,8,12,16}, the encoder computes conditional entropy H(X_i | X_{i-k}) using a joint frequency table, and selects the k that minimizes this entropy. This stride value informs both the HALT_ANS_STRIDE cost estimate and the ANS encoding schedule.

JS Analyzer (v11+)The Stride Detection page replicates this computation using the same joint-frequency-table method, capped at min(1 MB, file size) to match encoder behavior. v12 adds the C encoder's STRIDE_GAIN_THRESH = 0.05 bits/byte: if the gain is below this threshold, stride is suppressed (kWinner reset to 1, confidence 0). Previously any gain was reported, over-reporting stride significance for marginal cases.

StatusThreshold applied in v12. Stride detection now matches C encoder behavior including the suppression threshold.

RESOLVED caim mode: byte-accurate estimate added in v12

C encoder (v34)The encoder runs two passes in parallel: (a) the recursive bit-plane decomposition path, and (b) a caim mode that sweeps all 8 bits (selecting least-set bit each step), concatenates all bitsets into one blob, then gzips the blob and the aligned stream. It takes whichever produces the smaller output.

JS Analyzer (v12)computeCaimEstimate() now replicates the C caim_encode() algorithm exactly: 8-bit adaptive sweep, bitset concatenation, gzip of flags blob + gzip of aligned stream. The byte-accurate comparison appears on the Structural Fingerprint page (caim vs recursive verdict card) and the Compression Benchmark page.

Residual gapBrowser CompressionStream uses gzip level ~6; C uses level 9. The JS caim estimate is slightly conservative — actual caim output will be marginally smaller. This is noted in the display. All size comparisons in the UI account for this bias direction.

RESOLVED Bit sweep: encoder selects sparsest bit; analyzer now replicates this (new in v12)

C encoder (v34)At each recursion depth d, sweep() counts set values for bits 0 through (7-d) and selects the bit with the fewest set values. This adaptive selection minimizes flag count per layer and is the core of the AIM recursive algorithm.

JS Analyzer (v11 — historical)Always used evenPass() which cleared bit 0 (LSB) at every depth. This caused halt depth to be massively overestimated for structured data (e.g., Prime Gaps would show 13 layers instead of 1-2).

JS Analyzer (v12 — current)realTransform() now uses sweepPass() with the same adaptive selection as the C encoder: at depth d, considers bits 0..(7-d) and picks the sparsest non-zero bit. Maximum depth is correctly capped at 8.

StatusFixed in v12. Decay profiles, fingerprint classification, and halt predictor all now reflect the correct sweep-based behavior. The evenPass() function is retained only for the chain experiments and entropy models, which are separate analytical paths.

INFORMATIONAL Decode overflow bug (historical, resolved in v25): GGUF ~87.8% result affected

Prior to v25File offset arithmetic used 32-bit unsigned integers (u32). For files larger than ~4 GB, offset values silently truncated on overflow. The encoder appeared to succeed but the decoder read incorrect byte ranges, producing a decoded file that was structurally wrong — not a valid decompression of the input.

v34 (current)All file offsets use 64-bit unsigned integers (u64). Large-file handling is correct.

Impact on published resultsThe ~87.8% compression ratio reported for a large GGUF file was produced by a v16 encoder/decoder pair subject to this overflow bug. The decoded output reflected data loss from the offset truncation, not genuine compression. This result should not be cited as a validated AIM benchmark. Current encoder versions have not been re-benchmarked on the same GGUF at this time.

LOW DR3L / symbol-mapping: ⊕ operator is more general than bit-clear

Paper formulationThe original AIM paper's ⊕ operator is defined as a general symbol-space mapping — not specifically bit-clear (even-alignment). Any invertible mapping from a symbol alphabet to a sub-alphabet qualifies.

DR3L experimentdreal_minimal_Bits.py explored mapping decimal digit sequences (0–9) into a reduced symbol space using arbitrary bijections. DR3L demonstrated that the symbol space is a free variable: one can choose a mapping that exploits domain-specific structure rather than always using bit-clear. The experiment found 2.17× expansion for natural decimal sequences — the mapping did not exploit structure efficiently enough to beat the raw representation.

RelevanceThe shipped encoder uses bit-clear as its ⊕ operator because it is fast, invertible, and works universally on byte data. DR3L showed that alternate symbol mappings are possible but require domain knowledge to be beneficial. Not a current gap in the analyzer — included for research context.