SD1.5SD U-Net
U-Net- Params
- 0.9 B
- Training
- DDPM
centroid_y seed share
measured
Tests whether smaller, older U-Nets show the effect — they do.
An empirical investigation · 2026
Every diffusion image generator takes a seed — a number tutorials call “just for reproducibility.” We measured what it actually does. In some architectures the seed deterministically controls roughly half of where things go in the picture: the framing, the layout, the vertical placement of the subject. In others that grip is almost gone, and the prompt takes over. The role is not fixed across models — it is a graded property of the architecture and how the model was trained.
The problem is not that a seed exists. Every generative process needs randomness. The problem is that this number is one of the heaviest weights on the output — co-equal with the prompt you actually wrote — and at the same time completely opaque to the author: labelled “random,” surfaced as a throwaway integer, with no readout of what it is deciding. A dominant input you cannot see or reason about is not a tool; it is an uncredited co-author. The aim here is to make its weight visible and controllable — not to remove it.
point 54.4% · 95% CI [42.8 – 66.0]
The seed does about half the work of choosing where the subject lands.
point 30.2% · 95% CI [16.7 – 50.5]
Same rectified-flow MMDiT as Flux, but un-distilled (real CFG) and smaller — keeps part of the seed’s grip.
point 5.0% · 95% CI [1.0 – 16.0]
The compositional role of the seed nearly vanishes; the prompt takes over.
Figures are the seed’s share of variance in centroid_y — the
vertical placement of the Otsu-thresholded foreground mass — the cleanest single
composition axis. Bootstrap means with 95% block-bootstrap CIs (B = 5000,
blocked on prompts; 64 seeds × 10 prompts per model). Source:
MATRIX_RESULTS.md §1.
Five cells, one matched protocol
| Model | Family | Params | Training objective | Res. | CFG | centroid_y seed % (95% CI) | Regime vs Flux |
|---|---|---|---|---|---|---|---|
| NoobAI | SDXL U-Net | 2.6 B | DDPM | 1024 | 4.0 | 54.4[42.8–66.0] | inverted (strong) |
| Animagine | SDXL U-Net | 2.6 B | DDPM | 1024 | 5.0 | 48.1[36.1–60.1] | inverted (strong) |
| SD1.5 | SD U-Net | 0.9 B | DDPM | 512 | 7.0 | 35.9[24.6–50.8] | inverted (weaker) |
| SD3.5 | MMDiT | 8 B | DDPM | 1024 | 4.0 | 30.2[16.7–50.5] | intermediate (graded) |
| Flux | MMDiT | 12 B | rectified flow | 1024 | 4.0 | 5.0[1.0–16.0] | baseline (reference) |
| PixArt-Sigma | DiT + cross-attn | 0.6 B | DDPM | 1024 | 4.5 | pending | in flight |
S03 · Variance decomposition
Each model's centroid_y variance splits three ways: the share fixed by the seed, the share fixed by the prompt, and the seed×prompt interaction. Reading top to bottom — NoobAI, Animagine, SD1.5, SD3.5, Flux — the orange seed band shrinks from roughly half to almost nothing while the cyan prompt band grows to dominate. That handoff is the finding.
centroid_y from
variance.json (64 seeds × 10 prompts per model). Seed band in
orange (--accent-unet), prompt band in cyan
(--accent-mmdit), seed×prompt interaction in muted grey.
The seed share collapses from 49.8% on NoobAI to 2.6% on Flux while the
prompt share rises from 23.2% to 62.0%. These are point estimates from the
per-feature variance file and differ slightly from the headline figures in S01/S02
(NoobAI 54.4, SD3.5 30.2, Flux 5.0), which are block-bootstrap means with 95%
intervals (B = 5000); the small offset between point estimate and bootstrap
mean is expected.
The graded inversion
The first framing of this work was binary: the seed governs vertical
composition in U-Net models and surrenders
that control in MMDiT models. SD3.5 broke
the binary. It lands at 30% seed control of
centroid_y — squarely between
NoobAI’s 54% and
Flux’s 5%. The inversion is not a
switch that one design choice flips. It is a slope that two independent
choices descend, roughly half each.
NoobAI and Flux differ on five confounded axes at once: backbone (U-Net vs MMDiT), training objective (DDPM vs rectified flow), parameter count, text encoder, and training data. With only those two models, every causal story is observationally identical. You cannot say whether the seed lost its grip because the architecture changed or because the loss changed — the two moved together.
SD3.5 was the critical experiment because it breaks the confound. It shares its backbone with Flux (MMDiT) but shares its training objective with NoobAI (DDPM-family, not rectified flow). It also matches Flux on text encoder (T5 + CLIP) and parameter scale class (8 B vs 12 B). So SD3.5 holds the training axis at NoobAI’s setting while flipping the architecture axis to Flux’s. Whatever it does, it isolates one axis at a time.
Reading the three filled cells of a 2×2 design
(architecture × training objective) gives an additive partition
on centroid_y:
Each design choice removes roughly half the seed’s compositional control, and neither alone explains the full inversion. Changing architecture while holding the objective costs about 24 percentage points (54 → 30); changing the objective while holding the architecture costs about 25 more (30 → 5). The permutation tests confirm both steps independently: SD3.5 sits significantly below its U-Net neighbour Animagine (Δ ≈ −20 pp, BH-adjusted p = 0.0002) and significantly above Flux (Δ = +21.5 pp, BH-adjusted p = 0.0002). SD3.5 carves out statistically distinct ground from both families.
centroid_y variance, at
matched 64 × 10 protocol. The slope is read in two
independent steps: swap the backbone (NoobAI → SD3.5),
then swap the training objective (SD3.5 → Flux). Each
removes about half the control.
Before the SD3.5 sweep ran, three outcomes were written down with probabilities: Scenario A (looks like Flux, p = 0.55), Scenario B (looks like NoobAI, p = 0.15), and Scenario C — intermediate, seed ≈ 20–35%, p = 0.30. The observed 30.2% with CI [16.7 – 50.5] landed inside Scenario C’s point-estimate band. The graded reading was the predicted outcome of a frozen forecast, not a narrative assembled after seeing the number.
Three of the four cells are now filled. The missing cell is a U-Net trained with rectified flow — the opposite corner from SD3.5. The additive partition predicts it should land near ≈30%, reached from the other direction: U-Net’s +24 pp of architecture offsetting rectified flow’s −25 pp of objective. A clean ~30% there would confirm the two axes are genuinely additive rather than interacting.
A second axis is under test now. PixArt-Sigma running is a DiT — a transformer diffusion model, but not MMDiT: it injects text through cross-attention, more like a U-Net at the conditioning level. If it sits near the U-Net family (~40–50%), the operative split is the multimodal-attention pattern specific to MMDiT, not transformers in general. The pre-registered prediction is 25–40% (“matches transformer, not MMDiT”, p = 0.55).
Section 05 — the panel
The matrix is not a convenience sample. Every model earns its place by holding some variables fixed while one changes, so that any single observed difference can always be cross-checked against another pair. The headline number on each card is the vertical-centroid seed contribution — the share of variance in where the subject sits in the frame that is set by the seed rather than the prompt. It runs from 54.4% on NoobAI down to 5.0% on Flux.
Four twin pairs hold the panel honest, each isolating one confound:
centroid_y seed share
measured
Tests whether smaller, older U-Nets show the effect — they do.
centroid_y seed share
measured
The headline cell, measured first.
centroid_y seed share
measured
Independent-lineage replication of NoobAI.
centroid_y seed share
not run
Optional second replication.
centroid_y seed share
in flight
DiT but not MMDiT: isolates transformer-vs-MMDiT.
centroid_y seed share
measured
The inversion cell.
centroid_y seed share
measured
The disentanglement cell: same arch as Flux, training like NoobAI.
Reference
The open text-to-image ecosystem is a handful of architectural lineages, each with a different way of injecting text, a different training objective, and a different age. The seed-composition behaviour we measure tracks two of these axes — the backbone (how text and image interact) and the training objective (how the model learns to denoise). This table lays the families side by side so you can see where each measured model sits.
| Family | Backbone | Text enters via | Training objective | Text encoder | Params | Released | Reception & adoption | Seed→centroid_y |
|---|---|---|---|---|---|---|---|---|
| Stable Diffusion 1.xSD 1.4 / 1.5 · ToonYou, etc. | U-Net | cross-attention | DDPM · ε-prediction | CLIP ViT-L/14 | ~0.9 B | Aug–Oct 2022 | Foundational launched the open ecosystem; thousands of fine-tunes/LoRAs, still used | 35.9% |
| Stable Diffusion 2.xSD 2.0 / 2.1 | U-Net | cross-attention | DDPM · ε/v-prediction | OpenCLIP ViT-H | ~0.9 B | Nov 2022–23 | Rejected encoder swap + dataset filtering broke prompts; community stayed on 1.5 | not run |
| SDXLSDXL 1.0 · NoobAI · Animagine · Pony · Illustrious | U-Net | cross-attention | DDPM · ε/v-prediction | CLIP-L + OpenCLIP bigG | 2.6 B | Jul 2023 | Dominant the open standard; ~70–80% of open-weight T2I use, vast fine-tune base | 48–54% |
| PixArtPixArt-α / PixArt-Σ | DiT | cross-attention | DDPM-style · often step-distilled | T5-XXL | 0.6 B | 2024 | Niche research-respected (efficient, T5); small consumer/fine-tune community | 10.0% |
| Stable Diffusion 3 / 3.5SD3.5-Large | MMDiT | text+image tokens, every block | rectified flow · real CFG | T5-XXL + dual CLIP | 8 B | 2024 | Rocky SD3 launch hit license + quality backlash; 3.5 recovered some, lost ground to Flux | 30.2% |
| FluxFlux.1-dev / schnell | MMDiT | text+image tokens, every block | rectified flow · guidance-distilled | T5-XXL + CLIP-L | 12 B | Aug 2024 | Ascendant new high-end open standard; fast-growing fine-tune momentum | 5.0% |
| Video (Phase 2)Wan 2.2 · HunyuanVideo | MMDiT (video) | text+image+time tokens | rectified flow | T5 / umT5 | 2024–25 | Rising leading open video models; adoption climbing fast | planned |
Training objective colour: DDPM / score-matching rectified flow / flow-matching. Seed→centroid_y is the measured share of vertical-composition variance the seed explains (bootstrap mean).
The research program
Every experiment, grouped by what it does for the argument — with what it measures,
where it runs on the GH200, how long it takes, its priority, and what each outcome would imply and
where it points next. Generated from the live registry at tools/experiments.json;
run tools/inquisition for live status.
The measured cells that establish the effect and its graded shape across architectures.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| P0 | doneNoobAI XL — headline cell | SDXL U-Net + DDPM (baseline of the seed-dominant family) | GH200 · ~11 GB | 27 min | Proof the effect exists: the seed fixes ~half of vertical composition in SDXL. Anchors the whole program. |
| P0 | doneAnimagine XL — fine-tune replication | holds architecture, varies training corpus | GH200 · ~11 GB | 58 min | Rules out a single-checkpoint artifact → lets us say 'SDXL family', not 'one model'. |
| P0 | doneSD1.5 — scale/vintage cell | holds U-Net+DDPM, varies scale (0.9B) and base training | GH200 · ~5 GB | 28 min | Effect predates SDXL → it is a U-Net+DDPM property, not SDXL-specific or scale-specific. Pushes the cause toward backbone/objective. |
| P0 | doneFlux.1-dev — inversion cell | MMDiT + rectified flow + guidance distillation | GH200 · ~24 GB | 1.4 h | The inversion. Opens the central question: is it the MMDiT backbone or the rectified-flow objective? |
| P0 | doneSD3.5-Large — disentanglement cell | MMDiT + DDPM — breaks the architecture/training confound | GH200 · ~20 GB | 1.9 h | Graded middle (30%) → the inversion is NOT binary; both axes contribute. Forces the 2×2 design. (Note: SD3.5 is rectified-flow, like Flux — so vs Flux it isolates distillation+scale, not objective.) |
The decisive cells that separate backbone from training objective from distillation. Highest research leverage.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| P1 | queuedSDXL-Lightning — distilled U-Net (breaks distillation confound) | distilled U-Net — if it collapses to single digits, distillation dominates even architecture | GH200 · ~11 GB | 6 min | Distilled U-Net. Collapses ⇒ distillation alone suppresses the seed even on a U-Net (distillation is the dominant cause). Stays ~50% ⇒ distillation is not the cause. Isolates distillation from architecture.if distillation drives it: ~5-15%; if architecture drives it: stays ~50% |
| P1 | queuedPixArt-alpha @ real CFG — non-distilled DiT-x-attn | non-distilled DiT cross-attention — the other half of the distillation test | GH200 · ~18 GB | 20 min | Non-distilled DiT at real CFG. Rises toward U-Net ⇒ PixArt's collapse was distillation. Stays ~10% ⇒ it is the DiT backbone. Removes the distillation confound from the PixArt result. |
| P1 | doneInstaFlow-0.9B — the missing 2x2 cell | U-Net + RECTIFIED FLOW — the empty quadrant of the 2x2 | GH200 · ~6 GB | 10 min | THE missing corner: U-Net + rectified flow. ~30% ⇒ objective is ~half the effect (additive model confirmed); ~50% ⇒ objective irrelevant, backbone rules; ~5% ⇒ rectified flow alone kills it. This single cell decides the causal decomposition.centroid_y seed ~30% if architecture and training contribute additively |
Extra SDXL fine-tunes that tighten the generalization claim. Low new information, high robustness.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| P2 | queuedPony Diffusion XL — SDXL replication #3 | third independent SDXL fine-tune lineage | GH200 · ~11 GB | 58 min | Third independent SDXL fine-tune. Tightens within-family CIs; strengthens 'SDXL family' generalization. Little new direction. |
| P2 | queuedIllustrious XL — retry | SDXL U-Net replication | GH200 · ~11 GB | 58 min | Fourth SDXL fine-tune (blocked on a diffusers config). Same role as Pony; unblock then run. |
Referee defenses: show the effect is not an artifact of sampler, prompt set, or guidance regime.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| P1 | queuedSampler ablation (NoobAI: Euler-a vs DDIM) | does stochastic (ancestral) sampling dissolve the seed-composition coupling? | GH200 · ~11 GB | 27 min | Euler-a vs DDIM on NoobAI. Holds ⇒ the effect is not a DDIM-determinism artifact. A standard reviewer challenge; answer it pre-emptively. |
| P1 | queuedPrompt-set sensitivity (2 alt sets x NoobAI,Flux) | is the effect specific to our 10 prompts? | GH200 · ~11 GB | 27 min | Two alternate prompt sets × NoobAI, Flux. Split stays ⇒ the seed%/prompt% partition is not an artifact of one prompt set. Controls prompt-set sampling uncertainty. |
| P2 | queuedCFG dose-response on Flux + SD3.5 | can ANY CFG on MMDiT recover seed control? (U-Net showed monotonic decline) | GH200 · ~24 GB | 41 min | CFG dose-response on MMDiT, mirroring the NoobAI sweep. MMDiT seed% staying low across all CFG ⇒ the inversion is not a guidance-regime artifact. Completes the CFG story across architectures. |
Orthogonal evidence and statistical backbone: inversion arm, learned features, Bayesian model.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| — | donePixArt-Sigma — DiT-not-MMDiT control | transformer with CROSS-ATTENTION text injection (not multimodal mixing) | GH200 · ~18 GB | 20 min | REFUTED prediction: landed 10% (Flux regime), not 25-40%. Raises guidance-distillation as the driver.PRE-REG 25-40% prob 0.55 — REFUTED; actual 10.0 [4.0-22.3] |
| P2 | doneDINOv2 patch-token features (re-measure existing sweeps) | richer composition features than Otsu centroid | GH200 GPU | ~min | Learned DINOv2 patch features vs hand-built centroid/palette. Inversion holding in DINOv2 space ⇒ not an artifact of our feature choice. Generalizes the feature basis. (Done on SD3.5 + PixArt.) |
| P1 | queuedDDIM inversion arm (bug fixed) — NoobAI + Flux | independent evidence: invert real image -> noise -> regen, measure composition recovery | GH200 · ~11 GB | 43 min | Recover the seed from a real image, regenerate, measure. Forward and inverse agreeing ⇒ composition really lives in the noise (U-Net) or really does not (MMDiT). Orthogonal evidence for the causal claim. |
| P2 | queuedBeta GLMM across all cells (PyMC, NUTS) | bounded-[0,1] hierarchical model of seed fraction | CPU · this box | ~min | Bounded-proportion Bayesian model across all cells → one coherent posterior on the seed fraction with partial pooling. The statistical backbone reviewers will ask for. |
Where the program goes if the matrix holds: the fastest-growing modality.
| Pri | What | What it measures | Where to run | Duration | Importance · implication · direction |
|---|---|---|---|---|---|
| P3 | futurePhase 2: video models (Wan I2V + Hunyuan) | does the inversion propagate to video models that inherit T2I backbones? | GH200 · ~30 GB | 1.3 h | Do video models inherit the seed-composition behaviour of their T2I backbones? Extends the finding to the fastest-growing modality (Wan I2V, HunyuanVideo). Phase 2 of the program. |
Priority: P0 done · P1 referee-critical / resolves the
core causal question · P2 strengthens rigor or replication · P3 expansion.
Where-to-run is the GH200 at this box; VRAM is observed or estimated. Duration uses measured
seconds-per-image where a sweep has completed, else a documented estimate.
S06 · Statistical rigor
The headline finding is a claim about variance: the seed explains roughly half of NoobAI’s vertical composition and almost none of Flux’s. A variance-fraction is a ratio of estimated quantities, bounded to [0, 1], computed over a small, hand-curated prompt set, and then compared across many model pairs. Each of those four properties is a way the claim could be an artefact rather than a fact — sampling noise, the wrong noise model, no formal test, or sheer multiplicity of comparisons. The four methods below each close one of those gaps. Together they convert “the numbers look very different” into “the difference is +47.1 percentage points, permutation p ≤ 0.0002, surviving FDR correction across the whole family of tests.”
METHOD 01
What it does
Resamples the 10 prompts with replacement 5000 times, recomputing the variance decomposition each time, and reads the 2.5th / 97.5th percentiles as a 95% confidence interval on every fraction.
Why we used it
A point estimate needs error bars; blocking on prompts respects the seed×prompt grid structure rather than treating cells as exchangeable.
Result it produced
NoobAI centroid_y seed% = 49.8 [46–53]; Flux = 2.7 [1.4–4.6].
METHOD 02
What it does
Shuffles the model labels within each prompt 5000 times to build a null distribution for the between-model difference, then asks how often a shuffle matches or beats the observed gap.
Why we used it
It yields a non-parametric p-value for “is NoobAI’s seed% really different from Flux’s?” with no distributional assumptions on a bounded ratio.
Result it produced
NoobAI vs Flux centroid_y: Δ = 0.471, p ≤ 0.0002 (Monte Carlo floor).
METHOD 03
What it does
A Bayesian generalized linear mixed model with a Beta likelihood on the logit scale and crossed random effects for seed, prompt, and interaction; NUTS draws the posterior over each variance component.
Why we used it
Variance fractions are bounded proportions in [0, 1]; a Beta model respects that support — where Gaussian ANOVA would not — and returns posterior credible intervals directly.
Role
Cross-checks the bootstrap CIs with partial pooling, robust to noisy small cells (S = 64, P = 10).
METHOD 04
What it does
Sorts the raw p-values from the ~18–30 pairwise comparisons and adjusts each by its rank, rejecting only those that clear the stepped threshold so the expected false-discovery share stays at 5%.
Why we used it
With dozens of comparisons some will look significant by chance; BH controls the false-discovery rate without the brutal power loss of Bonferroni on correlated tests.
Result it produced
12 of 18 composition comparisons reject at the 0.001 level after BH.
Why all four, not one. The bootstrap puts error bars on the estimate but cannot test a difference; the permutation test tests the difference but assumes nothing about the estimate’s shape; the Beta GLMM honours the bounded support that both of those approximate; and BH keeps the whole family of comparisons honest. Remove any one and the claim reverts to “the numbers look different,” which is exactly the gap this layer was built to close. Full protocol in 02-method/STATISTICAL_RIGOR.md and the pairwise table in MATRIX_RESULTS.md §7.
S07 · NoobAI · CFG dose-response
How much of the U-Net seed-dominance is a guidance-regime effect rather than an irreducible property of the architecture? We sweep NoobAI across five classifier-free-guidance scales and re-measure how much of each composition feature’s variance the seed explains. The seed’s compositional grip declines monotonically as guidance climbs — but even at CFG = 10, vertical placement stays ~43% seed-driven, roughly nine times Flux’s 5%. The inversion is therefore not just “MMDiT operates at high effective CFG.” Half the gap survives any guidance choice.
seed_frac point
estimate from a two-way variance decomposition; the percentage tells you
how much of where the subject lands is dictated by the random latent
rather than the prompt. The dashed amber guide marks CFG = 4,
the setting most practitioners use; the dashed cyan line marks the Flux
MMDiT reference (centroid_y ≈ 5%). At CFG = 1 — no
guidance, pure prompt-conditional sampling — the seed governs
82.3% of vertical and 83.2% of horizontal placement: the
folklore “the seed is the composition” is most true here.
As guidance pulls harder toward the prompt the seed’s grip falls
monotonically across all three features. The takeaway:
NoobAI at the strongest feasible guidance (CFG = 10) still sits at
43.0% on centroid_y — with a 95% bootstrap lower bound of
30.8%, roughly twice Flux’s upper bound of 16.0%. The
guidance regime erases about half of the U-Net → Flux gap;
the other half is irreducible architecture. “MMDiT is just high
effective CFG” is ruled out — if it were true, NoobAI at CFG = 10
would reach Flux, and it does not. Values traced to NoobAI
cfg-20260524-231325/cfg_*/bootstrap_variance.json
(MATRIX_RESULTS.md §5).
S08 · complementary test
The variance decomposition is a forward test: vary the seed across a fixed prompt, then measure how much the output composition spreads. It answers “does changing the noise move the subject?” The DDIM-inversion arm asks the mirror-image question. Instead of starting from random noise, it takes a real image, runs the sampler in reverse to recover the latent noise that would have produced it, and then checks whether that recovered noise still carries the compositional fingerprint. Forward: fix the prompt, vary the seed, watch the output. Inversion: fix the output, recover the seed, ask whether composition was ever in the noise to begin with.
This is a supporting arm, not the headline. The headline rests
on the variance decomposition: the NoobAI–vs–Flux composition difference
is significant at p < 0.0002
(permutation, B = 5000 — the
Monte-Carlo floor, meaning zero of 5000 label shuffles beat the observed gap) on
all three compositional axes, centroid_y, centroid_x,
and fg_fraction. Inversion does not add to that p-value. It
corroborates the same finding from an orthogonal direction: a positive result
would mean the seed basins are intrinsic to the model rather than an artifact of
where Gaussian sampling happens to start.
A skeptic can dismiss the forward result as an artifact of sampling from a Gaussian: maybe random integer seeds happen to land in structured regions, and the “seed controls composition” effect is a quirk of how we draw noise rather than a property of the model. Inversion removes that escape hatch. The starting latents are no longer random draws — they are reconstructed from genuine photographs by reversing the deterministic DDIM trajectory at a null prompt.
If the same per-feature pattern holds for inverted seeds the way it holds for random ones — composition seed-driven on U-Net, prompt-driven on MMDiT — then the basin structure is intrinsic to the model, not an accident of Gaussian sampling. If it dissolves, the forward effect was a sampling artifact. That is the whole point: the two arms can disagree, and a clean replication under inversion is a much stronger claim than either arm alone.
The arm builds an N × P
grid: 32 inverted real images crossed with the same
curated 32-prompt set (1024
generations), each combination generated forward
at cfg = 4.0 over 28 DDIM steps. From every
generated image it extracts the 18-dimensional feature vector and
decomposes variance over the
(inverted_seed × prompt) design,
exactly mirroring the forward sweep. The N is deliberately small —
inversion is expensive — and acceptable only because the effect, if
present, is expected to be large.
centroid_x, centroid_y,
fg_fraction — expected seed-driven on U-Net.noobai inversion arm · dtype bug fixed, sweep re-queued
DDIM inversion is far more sensitive to numerical precision than
forward sampling, because it integrates the trajectory backwards and
accumulates error over every step. The SDXL VAE in particular
produces unstable encodings in fp16 — small
rounding errors in the latent feed directly into the recovered
ε and corrupt the fingerprint we are trying to measure. The
encode step therefore casts the VAE to float32 before
touching the image, so the latent that anchors the whole inversion is
computed at full precision.
That fix introduced a second, opposite trap. The VAE was promoted to
fp32 for encoding but never restored, so the forward
generation pass then mixed an fp32 VAE with
fp16 latents and crashed with
Input type (c10::Half) and bias type (float) should be the same.
The arm now explicitly restores the VAE to the pipeline’s compute
dtype after inversion and before any forward pass. We flag this not as
trivia but as a guard rail: a precision bug in the encoder would have
silently degraded the inverted seeds and biased the comparison against
the very effect the arm exists to test.
Frequently asked questions
Drawn from the project's working FAQ, calibrated for researchers, practitioners, and curious readers alike. Every number here traces to a measured cell or a stated confidence interval. Expand a question to read the answer; the accordion is fully keyboard-navigable.
The seed is an integer that selects the specific pattern of random noise the model starts from. Same seed, same prompt, same settings produces the same final image, bit for bit. A different seed means different starting static, and so a different output even with an identical prompt.
Practitioners use seeds two ways: as a reproducibility tool (regenerate exactly this image later) and, on SDXL, as a hidden compositional control (“seed 47 gives a tight portrait; seed 109 gives a wider shot”). The randomness lives at the step where the seed is chosen, not in what the seed then does.
Because the number has been hiding a control surface that was never labeled as one. Every tutorial since 2022 described the seed as “a number to fix for reproducibility — random otherwise.” We measured that on SDXL the seed is doing about half the work of placing the subject in the frame (NoobAI centroid_y seed fraction 54.4%, 95% CI [42.8–66.0]) — comparable in influence to the prompt itself.
That makes the seed a covert compositional knob practitioners have been intuitively exploiting (“seed banks”, “lucky seeds”) for years without anyone formally measuring it. On Flux the same knob has moved — from the seed to the prompt — which has real consequences for any workflow that depended on it.
Not yet. It is currently a preprint — a public manuscript released before formal peer review. The predictions are pre-registered (outcome probabilities committed before the data lands), and the headline inversion is defensible at workshop scope today: p ≤ 0.0002 on the permutation test, replicated across two independent feature extractors, with non-overlapping 95% confidence intervals on the two architectures.
The plan is a workshop submission soon, then a main-track submission after the disentanglement experiments complete. It has not yet been submitted. The preprint format lets the result circulate and get critiqued in the open; everything reported is backed by public data, so anyone can reproduce or contest the numbers.
Yes, on SDXL-family models. Fix the seed to lock composition — seed banks are exactly the right practice on SDXL, Illustrious, NoobAI, Animagine, and Pony, which are all the same U-Net architecture. Our measurement is the explanation for why that workflow works: the seed carries roughly half the compositional variance.
On Flux the seed will not do that. If you move to a Flux pipeline, the seed bank stops carrying composition and you switch to a canonical prompt prefix (encode the framing in the first tokens), IPAdapter at higher weight for character lock, and image-to-video for animation. The finding is descriptive, not prescriptive — if your current pipeline ships images that look right, it validates that choice. It becomes relevant when you migrate between model families.
MMDiT (multimodal diffusion transformer) is the architecture used by Flux and SD3/SD3.5, where text tokens and image tokens mix at every transformer block. A U-Net (SD1.5, SDXL and its fine-tunes) instead injects text at specific resolutions via cross-attention. That structural difference changes how compositional information is routed.
In our measurement, the locus of compositional control moves: on the U-Net it lives largely on the seed (the noise fiber); on MMDiT it lives on the prompt path. We hypothesize MMDiT's text-everywhere attention is what routes composition through the prompt, but we have not yet measured attention attribution — so we say “the relationship inverts between these two models,” not “MMDiT causes the inversion.” The wording matters and we hold it until SD3.5 lands.
Indirectly. Many video models inherit a text-to-image backbone — an MMDiT video model (Wan 2.2, Hunyuan Image-Video) extends the MMDiT family, and older HunyuanVideo-style models extend the U-Net family. The architectural prediction carries: a video model built on an MMDiT backbone should sit in the prompt-dominant regime, one built on a U-Net backbone closer to the seed-dominant regime.
We have not directly measured video variance decomposition. In practice the recommended pattern is to set composition once in a still keyframe — where the seed gives you that control on SDXL — then animate forward with an image-to-video model that takes the composition as given.
Li et al. documented initial-seed effects on object placement within a single architecture family (Stable Diffusion and PixArt-α). It establishes that the seed influences where things go, but does not decompose that effect into a variance fraction or compare U-Net against MMDiT.
Our contribution is the cross-architecture contrast: we measure the variance fraction explicitly and show that the locus of compositional control moves — from the seed (U-Net, ~50%) to the prompt (MMDiT, ~5%). Single-architecture findings show the seed matters in SDXL; ours is the first to show that whether it matters is itself an architectural property.
Yes. The headline inversion on centroid_y is backed by a permutation test with B = 5000 shuffles of the model label. Observed Δ = 0.472 (NoobAI 0.498 − Flux 0.026); the 99th percentile of the null was 0.040, and the observed Δ was exceeded by 1 of 5000 shuffles — p ≤ 0.0002.
Bootstrap CIs alone are necessary but not sufficient for a difference claim. The permutation test is the formal test, it is distributional-assumption-free, and it survives Bonferroni correction for ~50 comparable tests.
No, because we replicated it with an independent feature extractor. DINOv2-large semantic features from the same NoobAI sweep give the top principal components as 75–91% prompt-driven, while centroid_y is 54% seed-driven. That dissociation — “where the subject is” is seed-driven, “what the subject is” is prompt-driven — comes from two extractors trained on different objectives.
On Flux the same DINOv2 measure gives 1.4% seed contribution, agreeing with the 5% centroid number. If centroid were an Otsu artifact, DINOv2 would not have to agree — and it does.
Yes — and we say so explicitly. NoobAI and Flux differ on six axes (architecture, training objective, parameter count, training data, text encoder, distillation). Until the disentanglement cells land we report “the relationship inverts between these two models,” not “the architecture causes it.”
The critical control is SD3.5-Large (MMDiT like Flux, but DDPM-style training like SDXL), which separates architecture from training objective. It is pre-registered with three outcome probabilities: architecture-wins 0.55, intermediate 0.30, training-wins 0.15. An intermediate result is one we committed to interpreting in advance.
Yes. All raw artifacts are already public at
github.com/quivent/lambda — image grids, feature tensors,
h-space activations, manifests, variance decompositions, and bootstrap CIs, at
the specific commit hashes that produced each reported number.
The lambda topology CLI provides one-command reproduction; total
wall time from scratch on a single 96 GB GH200 is about three hours. The
companion documents are public at
github.com/quivent/anime.productions and updated as data lands.
There is no gatekept release.
Five experiments are queued in priority order:
S10 / Document index
The full Inquisition corpus, organized by directory in reading
order. Every entry links to its source on GitHub. Path prefix:
research/inquisition/<dir>/<file>.