GPT-4 is OpenAI's fourth-generation large language model, released in March 2023. It was the first model to pass the bar exam, score in the 90th percentile on the SAT, and achieve expert-level performance on dozens of professional benchmarks. GPT-4o ('o' for omni), released in May 2024, extended GPT-4 to natively process text, images, and audio in a single unified model — eliminating the need for separate vision and speech modules.
Architecture: what makes GPT-4 different from GPT-3
OpenAI has not published GPT-4's full architecture, but reverse engineering, leaks, and official hints indicate it uses a Mixture of Experts (MoE) design with approximately 1.8 trillion total parameters organized into 16 expert sub-networks, with 2 experts active per forward pass. This means GPT-4 uses roughly 111B parameters per token — comparable to GPT-3 in compute but with dramatically more capacity.
| Model | Release | Params (est.) | Architecture | Context Window |
|---|---|---|---|---|
| GPT-3 | 2020 | 175B dense | Dense transformer | 4K tokens |
| GPT-3.5 | 2022 | ~175B | Dense + RLHF | 16K tokens |
| GPT-4 | 2023 | ~1.8T MoE | Mixture of Experts | 128K tokens |
| GPT-4o | 2024 | ~200B active | Omnimodal MoE | 128K tokens |
| GPT-4.5 | 2025 | Undisclosed | MoE + extended pretraining | 128K tokens |
Why MoE matters
A Mixture of Experts model routes each token to a small subset of specialized sub-networks. This gives the model the capacity of a 1.8T parameter model at the inference cost of a ~111B model. Every major frontier model since 2023 — GPT-4, Gemini 1.5, Mixtral, Grok — uses MoE for exactly this reason.
GPT-4o: the omnimodal leap
GPT-4o processes text, images, and audio through a single end-to-end neural network rather than chaining separate specialist models. Previous multimodal systems used a vision encoder to convert images to text tokens, then fed those into the language model. GPT-4o eliminates this bottleneck — it natively understands visual structure, tone of voice, and emotion without lossy intermediate representations.
Calling GPT-4o with vision via the OpenAI API
import openai
import base64
client = openai.OpenAI()
# Encode image to base64
with open("chart.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the trend in this chart."},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{image_data}"
}}
]
}
],
max_tokens=500
)
print(response.choices[0].message.content)GPT-4o vs GPT-4 Turbo
GPT-4o is 2x faster and 50% cheaper than GPT-4 Turbo while matching or exceeding it on most benchmarks. For most production use cases in 2024–2025, GPT-4o is the correct default choice.
Training: RLHF and the alignment stack
GPT-4 was trained in three stages: (1) supervised fine-tuning on high-quality human demonstrations, (2) reward model training where human raters compared model outputs and labeled preferences, and (3) Proximal Policy Optimization (PPO) to maximize the reward model's score. This Reinforcement Learning from Human Feedback (RLHF) pipeline is what transforms a raw language model into a helpful, harmless assistant.
- Pretraining: next-token prediction on ~13 trillion tokens of web, book, and code data
- SFT (Supervised Fine-Tuning): 10K–100K high-quality human-written demonstrations
- Reward Model: trained on human preference comparisons (which output A or B is better?)
- PPO: reinforcement learning to maximize reward model score while staying close to SFT policy
On LumiChats
LumiChats gives you access to GPT-4o alongside Claude, Gemini, and 40+ other models — switch between them in one click to compare outputs on the same prompt.
Try it free