Multimodal AI refers to models that can process and reason across multiple types of input simultaneously — typically text and images, but increasingly audio, video, documents, and structured data. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 are multimodal — they can analyze images, describe visual content, solve visual problems, and reason about information across different modalities.
How multimodal models work
Multimodal models use modality-specific encoders that map each input type into a shared vector embedding space, then merge everything for joint Transformer processing:
- Image encoder: A Vision Transformer (ViT) divides the image into 16×16 or 32×32 pixel patches. Each patch → linear embedding → positional encoding → Transformer. Output: a sequence of patch embeddings with the same dimension as text tokens.
- Audio encoder: A log-mel spectrogram is computed from the raw waveform, then processed by a Transformer encoder (e.g., Whisper's encoder). Output: a sequence of audio embeddings.
- Fusion: Image/audio embeddings are concatenated or interleaved with text token embeddings. The combined sequence is passed through the main language model.
- Training: Models are trained on paired multimodal data (image-caption pairs, video-transcript pairs) using a contrastive or generative objective to align representations.
| Model | Modalities | Vision encoder | Context |
|---|---|---|---|
| GPT-4o | Text, image, audio | Custom ViT | 128K tokens |
| Claude 3.5 Sonnet | Text, image | Custom ViT | 200K tokens |
| Gemini 1.5 Pro | Text, image, audio, video | Native multimodal | 1M tokens |
| LLaMA 3.2 Vision | Text, image | ViT-L/14 | 128K tokens |
Practical uses for students and developers
Multimodal capabilities unlock entirely new workflows that pure text models cannot support:
| Use case | Input | What the model does |
|---|---|---|
| Solve handwritten math | Photo of notebook | OCR + parse + solve equation step-by-step |
| Explain textbook diagram | Photo of figure in book | Identify, describe, and explain the visual concept |
| Debug screenshot errors | Screenshot of error/terminal | Read error text + suggest fixes |
| Summarize handwritten notes | Photo of notes page | Transcribe + organize into structured summary |
| Analyze research charts | Image of chart/graph | Read axes, values, trends, and interpret findings |
| Extract table data | Photo of printed table | Convert to CSV/JSON format for further processing |
| UI feedback for developers | Screenshot of UI | Identify layout issues, accessibility, UX suggestions |
Sending an image to Claude with a text question
import anthropic
import base64
from pathlib import Path
client = anthropic.Anthropic()
def ask_about_image(image_path: str, question: str) -> str:
image_data = base64.standard_b64encode(
Path(image_path).read_bytes()
).decode("utf-8")
# Detect format from extension
ext = Path(image_path).suffix.lower()
media_type = {".jpg": "image/jpeg", ".png": "image/png",
".gif": "image/gif", ".webp": "image/webp"}[ext]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
},
},
{"type": "text", "text": question}
],
}]
)
return response.content[0].text
# Example uses
answer = ask_about_image("textbook_page.jpg",
"Explain the diagram on this page and summarize the key concept.")
print(answer)Image resolution and token costs
Vision models process images by dividing them into tiles or patches. Higher resolution means more tiles, more tokens, and higher cost and latency. Understanding this tradeoff helps you optimize for your use case:
| Model | Low-detail mode | High-detail mode | Max resolution | Notes |
|---|---|---|---|---|
| GPT-4o | ~85 tokens (512px tile) | Up to ~2,000 tokens (multiple tiles) | 2048×2048 | Each 512×512 tile = ~170 tokens |
| Claude 3.5 Sonnet | ~1,600 tokens | Up to ~4,000 tokens | 8000×8000 (downscaled) | Scales based on image dimensions |
| Gemini 1.5 Pro | ~258 tokens | Up to ~1,024 tokens per frame | No fixed limit | Video: ~263 tokens per frame |
Optimize image quality for your task
For reading text from documents, use high-detail mode and scan at ≥300 DPI. For general diagram description, low-detail mode is often sufficient and 4–8× cheaper. Never send unnecessarily large images — resize to the minimum required for your task before sending to the API.
Text in images (OCR)
Reading text from images (Optical Character Recognition) is one of the highest-value multimodal use cases. Vision LLMs dramatically outperform traditional OCR systems on complex real-world inputs:
| Input type | Traditional OCR (Tesseract) | Vision LLM (Claude/GPT-4o) |
|---|---|---|
| Printed text, clean scan | Excellent (>99% accuracy) | Excellent (>99%) |
| Handwritten text | Poor (20–60% accuracy) | Good–Excellent (70–95%) |
| Mixed layout (columns, tables) | Struggles with layout | Understands structure natively |
| Mathematical notation | Cannot parse | Good for most notation; verify complex LaTeX |
| Non-Latin scripts | Requires language packs | Excellent (trained on multilingual data) |
| Low quality / degraded scan | Fails badly | Degrades gracefully; better than OCR |
| Text overlaid on complex background | Poor | Good (understands foreground/background) |
Always verify critical OCR output
Vision LLMs can misread very small text (<8pt equivalent in the image), certain handwriting styles, and very degraded scans. For medical records, legal documents, or financial data, always have a human verify extracted text before acting on it.
Emerging modalities: audio, video, 3D
The frontier is moving rapidly toward truly universal models that process any combination of inputs. Here is where each modality stands in 2025:
| Modality | Capability level | Leading model | Key limitation |
|---|---|---|---|
| Image understanding | Mature | Claude 3.5, GPT-4o, Gemini 1.5 | Very small text; fine-grained counting |
| Speech recognition | Mature | Whisper (OpenAI), Gemini Audio | Noisy environments; rare accents |
| Native audio understanding | Emerging | GPT-4o Audio, Gemini 1.5 Pro | Latency for real-time; cost |
| Short video (<1 min) | Developing | Gemini 1.5 Pro, GPT-4o | Temporal reasoning; object tracking |
| Long video (hours) | Early stage | Gemini 1.5 Pro (1M ctx) | Very expensive; limited availability |
| 3D / point clouds | Research stage | GPT-4o with 3D rendering tricks | No native 3D understanding yet |
Video: frames as images
Most video-capable models sample frames at a fixed rate (e.g., 1 frame/second for Gemini 1.5) and process them as a sequence of image embeddings. True native video understanding — tracking objects and events across frames — is still an active research problem. Sora's video generation uses a different architecture: video diffusion transformers (DiT).
Practice questions
- What is the difference between late fusion and early fusion in multimodal AI systems? (Answer: Late fusion: each modality is processed independently through its own encoder; outputs are combined (concatenated, averaged, or learned weighted sum) at the final decision layer. Simple, modular, each encoder can be optimised separately. Misses cross-modal interactions during processing. Early fusion: raw inputs from multiple modalities are combined before or early in the neural network. Enables cross-modal feature learning. Harder to train (requires paired multimodal data), but learns richer joint representations.)
- What is cross-modal attention and why is it central to transformers like BLIP-2? (Answer: Cross-modal attention: queries from one modality attend to keys/values from another. In BLIP-2: the Querying Transformer (Q-Former) has learnable query tokens that attend (via cross-attention) to frozen image encoder outputs — extracting visual information relevant to the text. The visual tokens produced by Q-Former bridge the frozen image encoder and frozen LLM. Cross-modal attention is what allows the LLM to condition text generation on image content without retraining either component.)
- What is the grounding problem in multimodal AI? (Answer: Grounding: connecting language symbols to perceptual content. A model that understands 'red apple' should link the word 'red' to specific wavelengths of reflected light and the visual appearance of that colour. Without grounding, language models manipulate symbols without genuine perceptual reference. Multimodal models achieve partial grounding by training on image-text pairs — the model learns that 'red apple' co-occurs with images containing certain visual patterns. CLIP and similar contrastive models achieve strong perceptual grounding.)
- What are the evaluation challenges specific to multimodal generation (image + text)? (Answer: Text evaluation: BLEU, ROUGE, BERTScore — standard NLP metrics. Image-text alignment: CLIPScore measures cosine similarity between generated image and text prompt embeddings. Visual quality: FID (Fréchet Inception Distance) measures distributional similarity to real images. Compositional accuracy: does the image correctly show 'a red ball to the LEFT of a blue cube'? Hard to measure automatically. Human evaluation: costly gold standard. Current models score well on CLIPScore but still struggle with precise spatial relationships and accurate text rendering.)
- What is the difference between vision-language models (VLMs) for understanding vs generation? (Answer: VLMs for understanding: image → model → text. Tasks: visual QA, image captioning, optical character recognition, chart understanding. Examples: LLaVA, Idefics, PaliGemma, Claude Vision. Architecture: vision encoder + LLM. VLMs for generation: text → model → image. Tasks: text-to-image synthesis. Examples: DALL-E 3, Stable Diffusion, Midjourney. Architecture: text encoder + diffusion model or autoregressive image generator. Unified models (GPT-4o, Gemini 2.5): both input and output images, bridging understanding and generation in one system.)
On LumiChats
LumiChats supports multi-image attachments in chat. Images are sent to vision-capable models (GPT-4o, Gemini, Claude) via OpenRouter's multimodal API. You can attach multiple images to a single message for comparison analysis.
Try it free