Skip to content

Reference-Image Conditioning — Show, Don't Tell, the AI Aesthetic

Reference-Image Conditioning

TL;DR: Reference-image conditioning is controlling what AI generates by showing it an example image rather than describing the look in words. You feed a reference for composition, palette, structure, or overall style, and the model matches it — which works far better than prose for anyone who thinks in examples rather than adjectives, and is the only reliable route for a “look” you can recognize but can’t articulate. Different tools expose different reference slots (Midjourney --sref style / --cref character with a --cw weight dial; Flux Kontext’s up-to-10-image context; Soul guided generation), and they can be layered. The highest-leverage business application: brand coherence — feed the client’s own assets (store photos, prior creative, glossary/distinctive-assets) so the output inherits their visual world instead of a generic AI aesthetic. Reference beats prose for matching a known look; prose still wins for novel scenes with no good reference.

What it means

Every AI image or video generation is conditioned on something. The default is a text prompt — you describe the look in words and the model interprets. Reference-image conditioning swaps (or supplements) that text with an actual image: “make it look like this,” not “make it warm, editorial, with soft side-light and a muted palette.”

The reason this matters is a gap most prompt guides skip: a lot of people can recognize a good look instantly but cannot describe it. A technical, non-visual founder knows the competitor’s ad “looks stylish” but can’t name the lighting direction, the palette weights, or the composition grid that make it so. Asked to write a prose prompt, they produce vague adjectives the model can’t act on. Handed the reference route, they point at the image and the model extracts what they couldn’t put into words. Reference conditioning is show-don’t-tell for AI aesthetics — it routes around the description bottleneck.

What each reference slot controls

References aren’t monolithic — different slots control different layers, and the layers can be combined:

  • Composition reference — where elements sit in the frame, the spatial layout, negative space. Controls arrangement, not content.
  • Palette / color reference — dominant, secondary, and accent colors; warmth; mood. Controls the color story.
  • Structure / silhouette reference — the shape or form of a specific element (a curve, an object’s outline, an architectural line). Controls form — borrow a structure, render it in a new material.
  • Style reference — the overall aesthetic: rendering, texture, lighting feel, “house style.” Controls how it’s drawn, independent of what’s drawn.
  • Character / subject reference — a specific recurring person, mascot, or product identity held consistent across generations.

Layering multiple references is where control gets precise: a composition reference + a palette reference + a style reference together pin three independent axes, leaving the model to fill only the content. This maps directly onto the deconstruction layers in glossary/creative-reverse-engineering — you can reverse-engineer a winner into composition + palette + lighting + framing, then feed each back as a reference.

The main tools (June 2026)

ToolReference mechanismNotes
Midjourney--sref (style), --cref (character), --cw weight 0–100The industry-standard consistency controls. --sref strips aesthetic from one image; --cref pulls identity from another; --cw dials how strictly. Style and character refs combine in one prompt.
Flux KontextContext-aware editing, up to ~10 reference imagesNatural-language local edits, character consistency, style transfer. More technical (often ComfyUI), strong multi-reference control.
Soul / Soul guided generation (higgsfield)Reference image + Soul HEXShow-don’t-tell aesthetic control inside a production hub; pairs with placement/keyframe steps in the same tool.
Nano Banana ProUp to 14 reference imagesBlends elements, holds product/person identity; strong for brand-consistent variant generation.
Seedance 2.0Up to ~9 image references (+ video/audio)Reference assets feed video generation directly — feed the product as a reference to lock appearance.

The general principle is shared across all of them: a reference image carries information no prompt of reasonable length can. A 30-word style description is a lossy compression of what one reference image specifies exactly.

The brand-coherence application (the high-leverage move)

The most valuable business use isn’t matching a generic “stylish” look — it’s matching the client’s own look. Feed the brand’s actual assets as references and the output inherits their visual world:

  • A store-decor photo → the AI scene carries the brand’s real environment, materials, and palette.
  • Prior creative → new variants sit consistently alongside the existing campaign.
  • The brand’s glossary/distinctive-assets → the recognizable cues (color, shape language, texture) survive into AI-generated work, so it still reads as them.

This connects the AI-production layer to brand strategy: glossary/distinctive-assets (per Sharp / Ehrenberg-Bass) are exactly the cues that should be fed as references, because they’re what trigger recognition. AI generation without reference conditioning tends toward a generic, recognizable-as-AI aesthetic that quietly erodes distinctiveness; reference conditioning on the brand’s own assets is how you keep AI-scaled creative on-brand.

Brand coherence via structure, not literal copy (pattern to watch — N=1)

A sharper version of the coherence move, observed once and flagged for codification if it recurs: “make it look like the client” usually means carry their DNA through structure, not by copying decorative elements.

In a luxury-jewelry production, the client’s in-store decoration was a wave built from fabric and flowers; the video needed a literal water wave. Copying the flowers would have looked wrong; the bridge was the wave’s sculptural form — borrow the structure (the curl, the silhouette) and render it in the new material (water). The store photo became the structure reference; the material changed. The distinctive asset that transferred was the form, not the literal decoration.

The general rule this points at: when adapting a brand’s physical/offline assets into AI creative, identify which layer actually carries the brand DNA — often it’s the structure or silhouette, not the surface decoration — and feed that as the reference. This is N=1; it earns its own treatment only if a second client session validates the same move.

When reference beats prose — and when prose still wins

Reference wins when:

  • You’re matching a known look — a competitor’s aesthetic, a brand’s existing style, a mood you can show but not describe.
  • The user is non-visual and can’t articulate the look in actionable terms.
  • You need consistency across many generations (same style, same character, same product).
  • The look depends on subtle, hard-to-verbalize qualities (specific lighting feel, palette weighting, texture).

Prose still wins when:

  • The scene is novel — there’s no good reference for the specific thing you want (an unusual combination, a never-shot scenario).
  • You need precise semantic content — exact objects, counts, spatial relationships, text — which prose specifies more reliably than an image implies.
  • You’re deliberately avoiding an existing look (a reference would anchor you to it).
  • The reference would over-constrain — pulling in unwanted elements along with the wanted ones. (Reference conditioning copies more than you sometimes intend; a too-strong reference weight imports composition you didn’t want with the palette you did. Tools with a weight dial — Midjourney --cw — exist precisely to manage this.)

The practical default for product/brand work: reference for the look, prose for the content. Show the model the aesthetic and the structure; tell it what’s actually in the frame.

Honest limits

  • Over-constraint is real. Strong references import composition, lighting, and incidental elements along with the intended quality. Use weight controls; isolate the slot you actually want (style-only, structure-only).
  • Reference ≠ guaranteed fidelity. A product fed as a reference image still gets re-rendered by generative models — for fidelity-critical products you composite the real photo rather than relying on reference conditioning (see marketing/ai-product-video-fidelity). Reference conditioning controls aesthetic, not exact product fidelity.
  • Tool-specific behavior varies and shifts. sref/cref behavior changes across Midjourney versions; Kontext behaves differently in ComfyUI vs hosted. The slots and weights named here are June 2026 and worth re-checking.
  • The brand-structure pattern is N=1. The “carry structure not decoration” move is validated on one production; treat it as a hypothesis, not a law.

Key Takeaways

  • Reference-image conditioning = controlling AI output by showing an example, not describing it — show-don’t-tell for AI aesthetics.
  • It routes around the description bottleneck for non-visual users and for looks you can recognize but can’t articulate.
  • Different slots control different layers (composition / palette / structure / style / character); layer them for precision.
  • Brand coherence is the high-leverage use — feed the client’s own assets and glossary/distinctive-assets so output reads as them, not as generic AI.
  • Reference for the look, prose for the content; weight controls manage over-constraint.

Sources