Multimodal in Production: What Actually Works

§01 What Multimodal Actually Is

The temptation is to define multimodal AI as "an LLM that can see." That's not what it is. And building on that definition is why most multimodal implementations fail in production. A real multimodal system has three distinct components working in coordination — and each one can be the failure point.

Wrong model

Text-first with vision bolted on

You take a language model, add a vision encoder as an input adapter, fine-tune on image-caption pairs, and call it multimodal. The transformer still fundamentally processes text tokens. Images get tokenized and prepended to the prompt. The model wasn't structurally designed to reason about spatial relationships, temporal sequences in video, or the interaction between what something looks like and what it means. It can describe images because it's seen descriptions of images. That's different from visual understanding.

text-first · vision as afterthought

Right model

Specialized encoders + fusion + LLM reasoning

Modality-specific encoders handle perception: a vision transformer for images, an audio encoder for speech, a document parser for PDFs. Each encoder produces a learned representation in a shared embedding space. A fusion mechanism — cross-attention or late-fusion concatenation — integrates the modalities into a coherent representation. The LLM reasons over that integrated representation, not raw pixels. Specialized perception models, LLMs for language and orchestration, plain code for rules. Each doing what it does best.

specialized encoders · designed for fusion

◈

The infrastructure consequence: Vision encoders add substantial memory overhead. A single high-resolution image can consume as much GPU memory as thousands of tokens of text. This creates wildly variable resource consumption per request — a text query might cost 0.1 GPU-seconds, a high-res image analysis might cost 1.2. Your autoscaling needs to handle this variance. Static capacity planning based on text-only request profiles will leave you either overprovisioned or broken during traffic spikes.

§02 The Contact Sheet

In darkroom photography, a contact sheet shows every frame from a roll — the ones that worked, the ones overexposed, the ones still blank. Here's the multimodal contact sheet as of early 2026:

Frames Developed — Sharp and Ready

DEVELOPED

Document Analysis with Visuals

Charts, tables, diagrams embedded in PDFs. Before multimodal, you extracted text and lost all visual context. Now you can ask nuanced questions requiring synthesis of both. 50-page technical report with embedded charts — the model reasons across all of it.

→ Harvey AI serves 97% of Am Law 100 using visual+text RAG on legal documents

DEVELOPED

UI Debugging from Screenshots

Upload a screenshot of a broken UI and ask the model to diagnose what's wrong. It reads layout, parses element positions, understands visual hierarchy. Developers now screenshot error states and ask the model directly. This works.

→ "Here's a screenshot of the broken component — what's wrong?" now a viable workflow

DEVELOPED

Visual Search in E-Commerce

Upload an image, find similar products. Concept understood by VLMs since mid-2024, production-reliable by late 2025. Dropbox implemented embedding-based visual search and saw 17% reduction in empty search sessions. Image → embedding → retrieval pipeline is now a solved pattern.

→ Dropbox: 17% reduction in empty searches. Zendesk: 7% MRR improvement via visual search

DEVELOPED

Medical Imaging Documentation

Radiology reports, pathology slides, clinical documentation. VLMs assist with image analysis and documentation generation. Not replacing radiologists — augmenting workflow. Human-in-the-loop is non-negotiable here, but the model handles routine documentation reliably enough to deploy.

→ Clinical documentation from imaging + notes: reliable enough for production with human review

DEVELOPED

Visual QA on Charts and Dashboards

Screenshot a Grafana dashboard or a business chart and ask "what's the anomaly here" or "when did this metric peak." Works with modern VLMs. Useful for ops teams that want to describe what they're seeing without manually extracting data. Reliable in 2025.

→ "What does this chart show and when did the spike occur?" — works reliably in production

DEVELOPED

Property Condition Assessment

Upload photos of a property, get a structured condition report. Water damage, foundation cracks, roof deterioration — VLMs trained on real estate imagery can flag issues and generate documentation. Still needs human verification for legal disclosure purposes, but the draft is reliable.

→ Direct application: PropTechUSA.ai pipeline — photos in, condition report out, human reviews

Frames Partially Developed — Usable with Caution

PARTIAL

Short Video Understanding

15–30 second clips, clearly shot. Models can describe what happens, identify objects and actions, summarize scenes. Reliability drops sharply with motion blur, overlapping events, or ambiguous framing. Long-form video remains unreliable at the reasoning layer — not a production use case yet.

⚡ works for clean short clips · fails on complex temporal sequences

PARTIAL

Real-Time Visual Agents (Browser/Computer Use)

Models that watch a screen and take actions. Works for well-defined, structured tasks on predictable UIs. Breaks on dynamic pages, pop-ups, CAPTCHA, non-standard layouts. The hit rate is high enough to demo, not yet high enough to run unsupervised on production workflows.

⚡ structured tasks: reliable · open-ended navigation: still needs supervision

PARTIAL

Multi-Image Spatial Reasoning

Give the model a before/after pair or a floorplan + photos and ask it to reason across them. Works when the relationship is explicit. Fails when the task requires building an implicit spatial model — understanding 3D from 2D projections, inferring occlusion, or reasoning about relative scale.

⚡ explicit relationships: good · implicit 3D spatial reasoning: inconsistent

Unexposed Frames — Not Ready

UNEXPOSED

Long-Form Video Analysis

A 30-minute video with complex interleaved events, speaker changes, and temporal causation. Models can process frames. They cannot reliably build coherent temporal narratives across long video. The architecture is fundamentally not designed for this yet.

✗ not production-ready · architectural research problem, not deployment problem

UNEXPOSED

Fine-Grained Spatial Counting

Count exactly how many objects are in a dense scene. Count fasteners on a circuit board. Measure pixel-precise distances. VLMs hallucinate counts. Computer vision models designed for detection (YOLO, SAM) do this reliably. Use the right tool.

✗ use object detection models (YOLO/SAM) — not VLMs — for counting/measurement tasks

UNEXPOSED

Audio-Visual Joint Reasoning

Understanding where a sound comes from in a scene, reconciling contradictions between what is said and what is shown, multi-speaker diarization with visual face-tracking. Each modality separately: getting there. True audio-visual joint reasoning: research territory.

✗ each modality separately is production-ready · joint audio-visual reasoning is not

§03 Infrastructure Reality

The biggest operational challenge in multimodal production is not model quality — it's resource variance. Text requests are predictable. Image requests are not. High-resolution image processing can consume the GPU memory of thousands of tokens of text. Your serving infrastructure needs to handle this:

// Multimodal Infrastructure Challenges · Production Characteristics

Challenge	Specifics	Mitigation
Memory variance	Single hi-res image = 10× text memory	Image resolution caps, resize-on-ingest, request-level memory budgets
Vision encoder overhead	TTFT spikes on large images — prefill dominates	Resize to standard dims before encode (e.g. 896×896); async encode pipeline
Workload unpredictability	Text + image requests have 10× different resource footprints	Request routing by modality, separate GPU pools, per-modality autoscaling
Evaluation gaps	No single headline score predicts production behavior	Slice-wise eval: by device type, image quality, domain, modality combo
Hallucination in vision	VLMs confabulate image content with more confidence than text	Grounding + retrieval over known image databases; human-in-loop for high-stakes
Use-case–model mismatch	Using a VLM for counting/measurement tasks it can't do reliably	Use specialized CV models (YOLO, SAM) for perception; VLM for reasoning only

The best generative AI products don't rely on one giant model for everything.

They combine specialized tools — vision encoders for perception, LLMs for reasoning, retrieval for context, plain code for rules — into coherent systems that solve specific problems end-to-end. The mistake is treating multimodal as a modality upgrade. It's an architecture question. The contact sheet shows which frames developed. Don't force the unexposed ones.

◈

Where this connects to PropTechUSA.ai: Frame 06 — property condition assessment from photos — is in the build queue. The architecture: phone photo uploads to Cloudflare Worker, resized and encoded, sent to VLM with structured prompt requesting condition report, output reviewed by Eric or Donneal before disclosure. Specialized perception (vision encoder), LLM for report generation, human review for compliance. Not one giant multimodal endpoint. Three components doing their jobs.

Justin Erickson · PropTechUSA.ai

87 CF Workers · GED (juvenile detention) · Self-taught · March 2026 · Series 3 Finale

Complete Series 3 · Research

Series 3

Embedding Models: The Invisible Infrastructure Layer

Series 3

The Latency Wall: Why Speed Is the Next Frontier

Series 2

AI Memory Architectures

Series 2

The Agent Coordination Problem

Multimodal in Production // what actually works — the contact sheet

§01 What Multimodal Actually Is

§02 The Contact Sheet

Frames Developed — Sharp and Ready

Frames Partially Developed — Usable with Caution

Unexposed Frames — Not Ready

§03 Infrastructure Reality