ROLL 41 · MULTIMODAL · CONTACT SHEET

Multimodal in Production // what actually works — the contact sheet

Multimodal AI became production-ready in 2025. Not all of it. Some frames came out sharp. Some are overexposed. Some are still blank. This is the contact sheet — every use case developed, every one that's still developing, and why "gluing vision onto an LLM" is not a multimodal system.

2025
Year multimodal went from demo to deployable production systems
10×
Memory overhead of a single high-res image vs thousands of text tokens
230K
Real user–VLM conversations on VisionArena in 2025, across 138 languages
3
Components in a real multimodal system: perception encoder, LLM, fusion layer
// contact sheet · roll 41 · multimodal use cases · processed in darkroom DEVELOPED

§01 What Multimodal Actually Is

The temptation is to define multimodal AI as "an LLM that can see." That's not what it is. And building on that definition is why most multimodal implementations fail in production. A real multimodal system has three distinct components working in coordination — and each one can be the failure point.

Wrong model
Text-first with vision bolted on

You take a language model, add a vision encoder as an input adapter, fine-tune on image-caption pairs, and call it multimodal. The transformer still fundamentally processes text tokens. Images get tokenized and prepended to the prompt. The model wasn't structurally designed to reason about spatial relationships, temporal sequences in video, or the interaction between what something looks like and what it means. It can describe images because it's seen descriptions of images. That's different from visual understanding.

text-first · vision as afterthought
Right model
Specialized encoders + fusion + LLM reasoning

Modality-specific encoders handle perception: a vision transformer for images, an audio encoder for speech, a document parser for PDFs. Each encoder produces a learned representation in a shared embedding space. A fusion mechanism — cross-attention or late-fusion concatenation — integrates the modalities into a coherent representation. The LLM reasons over that integrated representation, not raw pixels. Specialized perception models, LLMs for language and orchestration, plain code for rules. Each doing what it does best.

specialized encoders · designed for fusion
The infrastructure consequence: Vision encoders add substantial memory overhead. A single high-resolution image can consume as much GPU memory as thousands of tokens of text. This creates wildly variable resource consumption per request — a text query might cost 0.1 GPU-seconds, a high-res image analysis might cost 1.2. Your autoscaling needs to handle this variance. Static capacity planning based on text-only request profiles will leave you either overprovisioned or broken during traffic spikes.

§02 The Contact Sheet

In darkroom photography, a contact sheet shows every frame from a roll — the ones that worked, the ones overexposed, the ones still blank. Here's the multimodal contact sheet as of early 2026:

Frames Developed — Sharp and Ready

01
DEVELOPED
Document Analysis with Visuals

Charts, tables, diagrams embedded in PDFs. Before multimodal, you extracted text and lost all visual context. Now you can ask nuanced questions requiring synthesis of both. 50-page technical report with embedded charts — the model reasons across all of it.

→ Harvey AI serves 97% of Am Law 100 using visual+text RAG on legal documents
02
DEVELOPED
UI Debugging from Screenshots

Upload a screenshot of a broken UI and ask the model to diagnose what's wrong. It reads layout, parses element positions, understands visual hierarchy. Developers now screenshot error states and ask the model directly. This works.

→ "Here's a screenshot of the broken component — what's wrong?" now a viable workflow
03
DEVELOPED
Visual Search in E-Commerce

Upload an image, find similar products. Concept understood by VLMs since mid-2024, production-reliable by late 2025. Dropbox implemented embedding-based visual search and saw 17% reduction in empty search sessions. Image → embedding → retrieval pipeline is now a solved pattern.

→ Dropbox: 17% reduction in empty searches. Zendesk: 7% MRR improvement via visual search
04
DEVELOPED
Medical Imaging Documentation

Radiology reports, pathology slides, clinical documentation. VLMs assist with image analysis and documentation generation. Not replacing radiologists — augmenting workflow. Human-in-the-loop is non-negotiable here, but the model handles routine documentation reliably enough to deploy.

→ Clinical documentation from imaging + notes: reliable enough for production with human review
05
DEVELOPED
Visual QA on Charts and Dashboards

Screenshot a Grafana dashboard or a business chart and ask "what's the anomaly here" or "when did this metric peak." Works with modern VLMs. Useful for ops teams that want to describe what they're seeing without manually extracting data. Reliable in 2025.

→ "What does this chart show and when did the spike occur?" — works reliably in production
06
DEVELOPED
Property Condition Assessment

Upload photos of a property, get a structured condition report. Water damage, foundation cracks, roof deterioration — VLMs trained on real estate imagery can flag issues and generate documentation. Still needs human verification for legal disclosure purposes, but the draft is reliable.

→ Direct application: PropTechUSA.ai pipeline — photos in, condition report out, human reviews

Frames Partially Developed — Usable with Caution

07
PARTIAL
Short Video Understanding

15–30 second clips, clearly shot. Models can describe what happens, identify objects and actions, summarize scenes. Reliability drops sharply with motion blur, overlapping events, or ambiguous framing. Long-form video remains unreliable at the reasoning layer — not a production use case yet.

⚡ works for clean short clips · fails on complex temporal sequences
08
PARTIAL
Real-Time Visual Agents (Browser/Computer Use)

Models that watch a screen and take actions. Works for well-defined, structured tasks on predictable UIs. Breaks on dynamic pages, pop-ups, CAPTCHA, non-standard layouts. The hit rate is high enough to demo, not yet high enough to run unsupervised on production workflows.

⚡ structured tasks: reliable · open-ended navigation: still needs supervision
09
PARTIAL
Multi-Image Spatial Reasoning

Give the model a before/after pair or a floorplan + photos and ask it to reason across them. Works when the relationship is explicit. Fails when the task requires building an implicit spatial model — understanding 3D from 2D projections, inferring occlusion, or reasoning about relative scale.

⚡ explicit relationships: good · implicit 3D spatial reasoning: inconsistent

Unexposed Frames — Not Ready

10
UNEXPOSED
Long-Form Video Analysis

A 30-minute video with complex interleaved events, speaker changes, and temporal causation. Models can process frames. They cannot reliably build coherent temporal narratives across long video. The architecture is fundamentally not designed for this yet.

✗ not production-ready · architectural research problem, not deployment problem
11
UNEXPOSED
Fine-Grained Spatial Counting

Count exactly how many objects are in a dense scene. Count fasteners on a circuit board. Measure pixel-precise distances. VLMs hallucinate counts. Computer vision models designed for detection (YOLO, SAM) do this reliably. Use the right tool.

✗ use object detection models (YOLO/SAM) — not VLMs — for counting/measurement tasks
12
UNEXPOSED
Audio-Visual Joint Reasoning

Understanding where a sound comes from in a scene, reconciling contradictions between what is said and what is shown, multi-speaker diarization with visual face-tracking. Each modality separately: getting there. True audio-visual joint reasoning: research territory.

✗ each modality separately is production-ready · joint audio-visual reasoning is not

§03 Infrastructure Reality

The biggest operational challenge in multimodal production is not model quality — it's resource variance. Text requests are predictable. Image requests are not. High-resolution image processing can consume the GPU memory of thousands of tokens of text. Your serving infrastructure needs to handle this:

// Multimodal Infrastructure Challenges · Production Characteristics
ChallengeSpecificsMitigation
Memory variance Single hi-res image = 10× text memory Image resolution caps, resize-on-ingest, request-level memory budgets
Vision encoder overhead TTFT spikes on large images — prefill dominates Resize to standard dims before encode (e.g. 896×896); async encode pipeline
Workload unpredictability Text + image requests have 10× different resource footprints Request routing by modality, separate GPU pools, per-modality autoscaling
Evaluation gaps No single headline score predicts production behavior Slice-wise eval: by device type, image quality, domain, modality combo
Hallucination in vision VLMs confabulate image content with more confidence than text Grounding + retrieval over known image databases; human-in-loop for high-stakes
Use-case–model mismatch Using a VLM for counting/measurement tasks it can't do reliably Use specialized CV models (YOLO, SAM) for perception; VLM for reasoning only
The best generative AI products don't rely on one giant model for everything.

They combine specialized tools — vision encoders for perception, LLMs for reasoning, retrieval for context, plain code for rules — into coherent systems that solve specific problems end-to-end. The mistake is treating multimodal as a modality upgrade. It's an architecture question. The contact sheet shows which frames developed. Don't force the unexposed ones.

Where this connects to PropTechUSA.ai: Frame 06 — property condition assessment from photos — is in the build queue. The architecture: phone photo uploads to Cloudflare Worker, resized and encoded, sent to VLM with structured prompt requesting condition report, output reviewed by Eric or Donneal before disclosure. Specialized perception (vision encoder), LLM for report generation, human review for compliance. Not one giant multimodal endpoint. Three components doing their jobs.
JE
Justin Erickson · PropTechUSA.ai
87 CF Workers · GED (juvenile detention) · Self-taught · March 2026 · Series 3 Finale
Complete Series 3 · Research