Definition

Multimodal AI refers to models that can process and reason across multiple types of input simultaneously — typically text and images, but increasingly audio, video, documents, and structured data. Models like GPT-4o, Gemini 1.5 Pro, and Claude 3.5 are multimodal — they can analyze images, describe visual content, solve visual problems, and reason about information across different modalities.

How multimodal models work

Multimodal models use modality-specific encoders that map each input type into a shared vector embedding space, then merge everything for joint Transformer processing:

Image encoder: A Vision Transformer (ViT) divides the image into 16×16 or 32×32 pixel patches. Each patch → linear embedding → positional encoding → Transformer. Output: a sequence of patch embeddings with the same dimension as text tokens.
Audio encoder: A log-mel spectrogram is computed from the raw waveform, then processed by a Transformer encoder (e.g., Whisper's encoder). Output: a sequence of audio embeddings.
Fusion: Image/audio embeddings are concatenated or interleaved with text token embeddings. The combined sequence is passed through the main language model.
Training: Models are trained on paired multimodal data (image-caption pairs, video-transcript pairs) using a contrastive or generative objective to align representations.

Model	Modalities	Vision encoder	Context
GPT-4o	Text, image, audio	Custom ViT	128K tokens
Claude 3.5 Sonnet	Text, image	Custom ViT	200K tokens
Gemini 1.5 Pro	Text, image, audio, video	Native multimodal	1M tokens
LLaMA 3.2 Vision	Text, image	ViT-L/14	128K tokens

Practical uses for students and developers

Multimodal capabilities unlock entirely new workflows that pure text models cannot support:

Use case	Input	What the model does
Solve handwritten math	Photo of notebook	OCR + parse + solve equation step-by-step
Explain textbook diagram	Photo of figure in book	Identify, describe, and explain the visual concept
Debug screenshot errors	Screenshot of error/terminal	Read error text + suggest fixes
Summarize handwritten notes	Photo of notes page	Transcribe + organize into structured summary
Analyze research charts	Image of chart/graph	Read axes, values, trends, and interpret findings
Extract table data	Photo of printed table	Convert to CSV/JSON format for further processing
UI feedback for developers	Screenshot of UI	Identify layout issues, accessibility, UX suggestions

Sending an image to Claude with a text question

import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def ask_about_image(image_path: str, question: str) -> str:
    image_data = base64.standard_b64encode(
        Path(image_path).read_bytes()
    ).decode("utf-8")

    # Detect format from extension
    ext = Path(image_path).suffix.lower()
    media_type = {".jpg": "image/jpeg", ".png": "image/png",
                  ".gif": "image/gif", ".webp": "image/webp"}[ext]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": media_type,
                        "data": image_data,
                    },
                },
                {"type": "text", "text": question}
            ],
        }]
    )
    return response.content[0].text

# Example uses
answer = ask_about_image("textbook_page.jpg",
    "Explain the diagram on this page and summarize the key concept.")
print(answer)

Image resolution and token costs

Vision models process images by dividing them into tiles or patches. Higher resolution means more tiles, more tokens, and higher cost and latency. Understanding this tradeoff helps you optimize for your use case:

Model	Low-detail mode	High-detail mode	Max resolution	Notes
GPT-4o	~85 tokens (512px tile)	Up to ~2,000 tokens (multiple tiles)	2048×2048	Each 512×512 tile = ~170 tokens
Claude 3.5 Sonnet	~1,600 tokens	Up to ~4,000 tokens	8000×8000 (downscaled)	Scales based on image dimensions
Gemini 1.5 Pro	~258 tokens	Up to ~1,024 tokens per frame	No fixed limit	Video: ~263 tokens per frame

Optimize image quality for your task

For reading text from documents, use high-detail mode and scan at ≥300 DPI. For general diagram description, low-detail mode is often sufficient and 4–8× cheaper. Never send unnecessarily large images — resize to the minimum required for your task before sending to the API.

Text in images (OCR)

Reading text from images (Optical Character Recognition) is one of the highest-value multimodal use cases. Vision LLMs dramatically outperform traditional OCR systems on complex real-world inputs:

Input type	Traditional OCR (Tesseract)	Vision LLM (Claude/GPT-4o)
Printed text, clean scan	Excellent (>99% accuracy)	Excellent (>99%)
Handwritten text	Poor (20–60% accuracy)	Good–Excellent (70–95%)
Mixed layout (columns, tables)	Struggles with layout	Understands structure natively
Mathematical notation	Cannot parse	Good for most notation; verify complex LaTeX
Non-Latin scripts	Requires language packs	Excellent (trained on multilingual data)
Low quality / degraded scan	Fails badly	Degrades gracefully; better than OCR
Text overlaid on complex background	Poor	Good (understands foreground/background)

Always verify critical OCR output

Vision LLMs can misread very small text (<8pt equivalent in the image), certain handwriting styles, and very degraded scans. For medical records, legal documents, or financial data, always have a human verify extracted text before acting on it.

Emerging modalities: audio, video, 3D

The frontier is moving rapidly toward truly universal models that process any combination of inputs. Here is where each modality stands in 2025:

Modality	Capability level	Leading model	Key limitation
Image understanding	Mature	Claude 3.5, GPT-4o, Gemini 1.5	Very small text; fine-grained counting
Speech recognition	Mature	Whisper (OpenAI), Gemini Audio	Noisy environments; rare accents
Native audio understanding	Emerging	GPT-4o Audio, Gemini 1.5 Pro	Latency for real-time; cost
Short video (<1 min)	Developing	Gemini 1.5 Pro, GPT-4o	Temporal reasoning; object tracking
Long video (hours)	Early stage	Gemini 1.5 Pro (1M ctx)	Very expensive; limited availability
3D / point clouds	Research stage	GPT-4o with 3D rendering tricks	No native 3D understanding yet

Video: frames as images

Most video-capable models sample frames at a fixed rate (e.g., 1 frame/second for Gemini 1.5) and process them as a sequence of image embeddings. True native video understanding — tracking objects and events across frames — is still an active research problem. Sora's video generation uses a different architecture: video diffusion transformers (DiT).

Practice questions

What is the difference between late fusion and early fusion in multimodal AI systems? (Answer: Late fusion: each modality is processed independently through its own encoder; outputs are combined (concatenated, averaged, or learned weighted sum) at the final decision layer. Simple, modular, each encoder can be optimised separately. Misses cross-modal interactions during processing. Early fusion: raw inputs from multiple modalities are combined before or early in the neural network. Enables cross-modal feature learning. Harder to train (requires paired multimodal data), but learns richer joint representations.)
What is cross-modal attention and why is it central to transformers like BLIP-2? (Answer: Cross-modal attention: queries from one modality attend to keys/values from another. In BLIP-2: the Querying Transformer (Q-Former) has learnable query tokens that attend (via cross-attention) to frozen image encoder outputs — extracting visual information relevant to the text. The visual tokens produced by Q-Former bridge the frozen image encoder and frozen LLM. Cross-modal attention is what allows the LLM to condition text generation on image content without retraining either component.)
What is the grounding problem in multimodal AI? (Answer: Grounding: connecting language symbols to perceptual content. A model that understands 'red apple' should link the word 'red' to specific wavelengths of reflected light and the visual appearance of that colour. Without grounding, language models manipulate symbols without genuine perceptual reference. Multimodal models achieve partial grounding by training on image-text pairs — the model learns that 'red apple' co-occurs with images containing certain visual patterns. CLIP and similar contrastive models achieve strong perceptual grounding.)
What are the evaluation challenges specific to multimodal generation (image + text)? (Answer: Text evaluation: BLEU, ROUGE, BERTScore — standard NLP metrics. Image-text alignment: CLIPScore measures cosine similarity between generated image and text prompt embeddings. Visual quality: FID (Fréchet Inception Distance) measures distributional similarity to real images. Compositional accuracy: does the image correctly show 'a red ball to the LEFT of a blue cube'? Hard to measure automatically. Human evaluation: costly gold standard. Current models score well on CLIPScore but still struggle with precise spatial relationships and accurate text rendering.)
What is the difference between vision-language models (VLMs) for understanding vs generation? (Answer: VLMs for understanding: image → model → text. Tasks: visual QA, image captioning, optical character recognition, chart understanding. Examples: LLaVA, Idefics, PaliGemma, Claude Vision. Architecture: vision encoder + LLM. VLMs for generation: text → model → image. Tasks: text-to-image synthesis. Examples: DALL-E 3, Stable Diffusion, Midjourney. Architecture: text encoder + diffusion model or autoregressive image generator. Unified models (GPT-4o, Gemini 2.5): both input and output images, bridging understanding and generation in one system.)

On LumiChats

LumiChats supports multi-image attachments in chat. Images are sent to vision-capable models (GPT-4o, Gemini, Claude) via OpenRouter's multimodal API. You can attach multiple images to a single message for comparison analysis.

Try it free

Multimodal AI

How multimodal models work

Practical uses for students and developers

Image resolution and token costs

Text in images (OCR)

Emerging modalities: audio, video, 3D

Practice questions

Try LumiChats for ₹69

Related Terms