Reference-Image Conditioning — Show, Don't Tell, the AI Aesthetic

Reference-Image Conditioning

TL;DR: Reference-image conditioning is controlling what AI generates by showing it an example image rather than describing the look in words. You feed a reference for composition, palette, structure, or overall style, and the model matches it — which works far better than prose for anyone who thinks in examples rather than adjectives, and is the only reliable route for a “look” you can recognize but can’t articulate. Different tools expose different reference slots (Midjourney --sref style / --cref character with a --cw weight dial; Flux Kontext’s up-to-10-image context; Soul guided generation), and they can be layered. The highest-leverage business application: brand coherence — feed the client’s own assets (store photos, prior creative, glossary/distinctive-assets) so the output inherits their visual world instead of a generic AI aesthetic. Reference beats prose for matching a known look; prose still wins for novel scenes with no good reference.

What it means

Every AI image or video generation is conditioned on something. The default is a text prompt — you describe the look in words and the model interprets. Reference-image conditioning swaps (or supplements) that text with an actual image: “make it look like this,” not “make it warm, editorial, with soft side-light and a muted palette.”

The reason this matters is a gap most prompt guides skip: a lot of people can recognize a good look instantly but cannot describe it. A technical, non-visual founder knows the competitor’s ad “looks stylish” but can’t name the lighting direction, the palette weights, or the composition grid that make it so. Asked to write a prose prompt, they produce vague adjectives the model can’t act on. Handed the reference route, they point at the image and the model extracts what they couldn’t put into words. Reference conditioning is show-don’t-tell for AI aesthetics — it routes around the description bottleneck.

What each reference slot controls

References aren’t monolithic — different slots control different layers, and the layers can be combined:

Composition reference — where elements sit in the frame, the spatial layout, negative space. Controls arrangement, not content.
Palette / color reference — dominant, secondary, and accent colors; warmth; mood. Controls the color story.
Structure / silhouette reference — the shape or form of a specific element (a curve, an object’s outline, an architectural line). Controls form — borrow a structure, render it in a new material.
Style reference — the overall aesthetic: rendering, texture, lighting feel, “house style.” Controls how it’s drawn, independent of what’s drawn.
Character / subject reference — a specific recurring person, mascot, or product identity held consistent across generations.

Layering multiple references is where control gets precise: a composition reference + a palette reference + a style reference together pin three independent axes, leaving the model to fill only the content. This maps directly onto the deconstruction layers in glossary/creative-reverse-engineering — you can reverse-engineer a winner into composition + palette + lighting + framing, then feed each back as a reference.

The main tools (June 2026)

Tool	Reference mechanism	Notes
Midjourney	`--sref` (style), `--cref` (character), `--cw` weight 0–100	The industry-standard consistency controls. `--sref` strips aesthetic from one image; `--cref` pulls identity from another; `--cw` dials how strictly. Style and character refs combine in one prompt.
Flux Kontext	Context-aware editing, up to ~10 reference images	Natural-language local edits, character consistency, style transfer. More technical (often ComfyUI), strong multi-reference control.
Soul / Soul guided generation (higgsfield)	Reference image + Soul HEX	Show-don’t-tell aesthetic control inside a production hub; pairs with placement/keyframe steps in the same tool.
Nano Banana Pro	Up to 14 reference images	Blends elements, holds product/person identity; strong for brand-consistent variant generation.
Seedance 2.0	Up to ~9 image references (+ video/audio)	Reference assets feed video generation directly — feed the product as a reference to lock appearance.

The general principle is shared across all of them: a reference image carries information no prompt of reasonable length can. A 30-word style description is a lossy compression of what one reference image specifies exactly.

The brand-coherence application (the high-leverage move)

The most valuable business use isn’t matching a generic “stylish” look — it’s matching the client’s own look. Feed the brand’s actual assets as references and the output inherits their visual world:

A store-decor photo → the AI scene carries the brand’s real environment, materials, and palette.
Prior creative → new variants sit consistently alongside the existing campaign.
The brand’s glossary/distinctive-assets → the recognizable cues (color, shape language, texture) survive into AI-generated work, so it still reads as them.

This connects the AI-production layer to brand strategy: glossary/distinctive-assets (per Sharp / Ehrenberg-Bass) are exactly the cues that should be fed as references, because they’re what trigger recognition. AI generation without reference conditioning tends toward a generic, recognizable-as-AI aesthetic that quietly erodes distinctiveness; reference conditioning on the brand’s own assets is how you keep AI-scaled creative on-brand.

Brand coherence via structure, not literal copy (pattern to watch — N=1)

A sharper version of the coherence move, observed once and flagged for codification if it recurs: “make it look like the client” usually means carry their DNA through structure, not by copying decorative elements.

In a luxury-jewelry production, the client’s in-store decoration was a wave built from fabric and flowers; the video needed a literal water wave. Copying the flowers would have looked wrong; the bridge was the wave’s sculptural form — borrow the structure (the curl, the silhouette) and render it in the new material (water). The store photo became the structure reference; the material changed. The distinctive asset that transferred was the form, not the literal decoration.

The general rule this points at: when adapting a brand’s physical/offline assets into AI creative, identify which layer actually carries the brand DNA — often it’s the structure or silhouette, not the surface decoration — and feed that as the reference. This is N=1; it earns its own treatment only if a second client session validates the same move.

When reference beats prose — and when prose still wins

Reference wins when:

You’re matching a known look — a competitor’s aesthetic, a brand’s existing style, a mood you can show but not describe.
The user is non-visual and can’t articulate the look in actionable terms.
You need consistency across many generations (same style, same character, same product).
The look depends on subtle, hard-to-verbalize qualities (specific lighting feel, palette weighting, texture).

Prose still wins when:

The scene is novel — there’s no good reference for the specific thing you want (an unusual combination, a never-shot scenario).
You need precise semantic content — exact objects, counts, spatial relationships, text — which prose specifies more reliably than an image implies.
You’re deliberately avoiding an existing look (a reference would anchor you to it).
The reference would over-constrain — pulling in unwanted elements along with the wanted ones. (Reference conditioning copies more than you sometimes intend; a too-strong reference weight imports composition you didn’t want with the palette you did. Tools with a weight dial — Midjourney --cw — exist precisely to manage this.)

The practical default for product/brand work: reference for the look, prose for the content. Show the model the aesthetic and the structure; tell it what’s actually in the frame.

Honest limits

Over-constraint is real. Strong references import composition, lighting, and incidental elements along with the intended quality. Use weight controls; isolate the slot you actually want (style-only, structure-only).
Reference ≠ guaranteed fidelity (but the line moved in mid-2026). For static images of ordinary products, reference / image-to-image editing now preserves a real product across generated scenes reliably enough to be the default — feed the real product photo, let the model regenerate only the environment, and verify each output (Nano Banana Pro, Seedream 4.5, Flux 2). Hand-compositing the real photo is now the fallback, reserved for the classes that still break: reflective metal/glass/jewellery, fine on-pack text, exact-colour SKUs, electronics screens, regulated/hallmarked goods — and all video (see marketing/ai-product-video-fidelity). The model still can’t invent a product it hasn’t seen; it re-stages one you supply — so reference conditioning controls aesthetic and staging, and now static product fidelity for the easy majority, but never guarantees exactness without a human check.
Tool-specific behavior varies and shifts. sref/cref behavior changes across Midjourney versions; Kontext behaves differently in ComfyUI vs hosted. The slots and weights named here are June 2026 and worth re-checking.
The brand-structure pattern is N=1. The “carry structure not decoration” move is validated on one production; treat it as a hypothesis, not a law.

Key Takeaways

Reference-image conditioning = controlling AI output by showing an example, not describing it — show-don’t-tell for AI aesthetics.
It routes around the description bottleneck for non-visual users and for looks you can recognize but can’t articulate.
Different slots control different layers (composition / palette / structure / style / character); layer them for precision.
Brand coherence is the high-leverage use — feed the client’s own assets and glossary/distinctive-assets so output reads as them, not as generic AI.
Reference for the look, prose for the content; weight controls manage over-constraint.

marketing/ai-product-image-generation — the static-image application: re-stage a real product across scenes with reference/i2i (the default for ordinary static products as of mid-2026)
marketing/ai-product-video-fidelity — where reference conditioning controls aesthetic but real-product compositing handles fidelity (the wave-structure move lives in its worked example)
tools/ai-video-production-stack — the tools and their reference slots, in capability-map form
glossary/distinctive-assets — what to feed as brand references; the cues that should survive into AI creative
glossary/creative-reverse-engineering — deconstruct a winner into layers, then feed each layer back as a reference
glossary/focal-hierarchy — a composition reference is one way to control focal order
glossary/framing-archetype — a reference image often encodes the framing archetype directly
marketing/ai-video-marketing — the strategy context
glossary/creative-formula-vs-creative-skin — references can carry formula (safe) or skin (risky); know which you’re feeding
glossary/human-anchored-ai-multiplication — the framework-level case for conditioning on your own human-shot assets rather than generating from prose

Sources

Style Reference (Midjourney docs) — --sref behavior
Character Reference (Midjourney docs) — --cref and --cw weight
How to Create Consistent Characters in Midjourney 2026 (PromptsEra) — sref/cref as the 2026 consistency standard
Midjourney vs Flux 2026 (Neuronad) — Flux Kontext multi-reference (up to 10 images), comparison
Nano Banana Pro (Google blog) — up to 14 reference images, identity consistency
Seedance 2.0 Complete Guide (WaveSpeed) — multi-reference video input
Primores client production (anonymized luxury-jewelry session, June 2026) — the brand-coherence-via-structure pattern

By Andrej Ruckij