What is GPT-4 and GPT-4o? Architecture, MoE, and How It Works — LumiChats

GPT-4 & GPT-4o

GPT-4 is OpenAI's fourth-generation large language model, released in March 2023. It was the first model to pass the bar exam, score in the 90th percentile on the SAT, and achieve expert-level performance on dozens of professional benchmarks. GPT-4o ('o' for omni), released in May 2024, extended GPT-4 to natively process text, images, and audio in a single unified model — eliminating the need for separate vision and speech modules.

The model that made AI mainstream — and how it actually works.

Category: Flagship AI Models

Architecture: what makes GPT-4 different from GPT-3

OpenAI has not published GPT-4's full architecture, but reverse engineering, leaks, and official hints indicate it uses a Mixture of Experts (MoE) design with approximately 1.8 trillion total parameters organized into 16 expert sub-networks, with 2 experts active per forward pass. This means GPT-4 uses roughly 111B parameters per token — comparable to GPT-3 in compute but with dramatically more capacity.

Model	Release	Params (est.)	Architecture	Context Window
GPT-3	2020	175B dense	Dense transformer	4K tokens
GPT-3.5	2022	~175B	Dense + RLHF	16K tokens
GPT-4	2023	~1.8T MoE	Mixture of Experts	128K tokens
GPT-4o	2024	~200B active	Omnimodal MoE	128K tokens
GPT-4.5	2025	Undisclosed	MoE + extended pretraining	128K tokens

Why MoE matters: A Mixture of Experts model routes each token to a small subset of specialized sub-networks. This gives the model the capacity of a 1.8T parameter model at the inference cost of a ~111B model. Every major frontier model since 2023 — GPT-4, Gemini 1.5, Mixtral, Grok — uses MoE for exactly this reason.

GPT-4o: the omnimodal leap

GPT-4o processes text, images, and audio through a single end-to-end neural network rather than chaining separate specialist models. Previous multimodal systems used a vision encoder to convert images to text tokens, then fed those into the language model. GPT-4o eliminates this bottleneck — it natively understands visual structure, tone of voice, and emotion without lossy intermediate representations.

import openai
import base64

client = openai.OpenAI()

# Encode image to base64
with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the trend in this chart."},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{image_data}"
                }}
            ]
        }
    ],
    max_tokens=500
)
print(response.choices[0].message.content)

GPT-4o vs GPT-4 Turbo: GPT-4o is 2x faster and 50% cheaper than GPT-4 Turbo while matching or exceeding it on most benchmarks. For most production use cases in 2024–2025, GPT-4o is the correct default choice.

Training: RLHF and the alignment stack

GPT-4 was trained in three stages: (1) supervised fine-tuning on high-quality human demonstrations, (2) reward model training where human raters compared model outputs and labeled preferences, and (3) Proximal Policy Optimization (PPO) to maximize the reward model's score. This Reinforcement Learning from Human Feedback (RLHF) pipeline is what transforms a raw language model into a helpful, harmless assistant.

Pretraining: next-token prediction on ~13 trillion tokens of web, book, and code data
SFT (Supervised Fine-Tuning): 10K–100K high-quality human-written demonstrations
Reward Model: trained on human preference comparisons (which output A or B is better?)
PPO: reinforcement learning to maximize reward model score while staying close to SFT policy

LumiChats gives you access to GPT-4o alongside Claude, Gemini, and 40+ other models — switch between them in one click to compare outputs on the same prompt.

Definition

Architecture: what makes GPT-4 different from GPT-3

Model	Release	Params (est.)	Architecture	Context Window
GPT-3	2020	175B dense	Dense transformer	4K tokens
GPT-3.5	2022	~175B	Dense + RLHF	16K tokens
GPT-4	2023	~1.8T MoE	Mixture of Experts	128K tokens
GPT-4o	2024	~200B active	Omnimodal MoE	128K tokens
GPT-4.5	2025	Undisclosed	MoE + extended pretraining	128K tokens

Why MoE matters

A Mixture of Experts model routes each token to a small subset of specialized sub-networks. This gives the model the capacity of a 1.8T parameter model at the inference cost of a ~111B model. Every major frontier model since 2023 — GPT-4, Gemini 1.5, Mixtral, Grok — uses MoE for exactly this reason.

GPT-4o: the omnimodal leap

Calling GPT-4o with vision via the OpenAI API

import openai
import base64

client = openai.OpenAI()

# Encode image to base64
with open("chart.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe the trend in this chart."},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{image_data}"
                }}
            ]
        }
    ],
    max_tokens=500
)
print(response.choices[0].message.content)

GPT-4o vs GPT-4 Turbo

GPT-4o is 2x faster and 50% cheaper than GPT-4 Turbo while matching or exceeding it on most benchmarks. For most production use cases in 2024–2025, GPT-4o is the correct default choice.

Training: RLHF and the alignment stack

Pretraining: next-token prediction on ~13 trillion tokens of web, book, and code data
SFT (Supervised Fine-Tuning): 10K–100K high-quality human-written demonstrations
Reward Model: trained on human preference comparisons (which output A or B is better?)
PPO: reinforcement learning to maximize reward model score while staying close to SFT policy

On LumiChats

LumiChats gives you access to GPT-4o alongside Claude, Gemini, and 40+ other models — switch between them in one click to compare outputs on the same prompt.

Try it free

GPT-4 & GPT-4o

Architecture: what makes GPT-4 different from GPT-3

GPT-4o: the omnimodal leap

Training: RLHF and the alignment stack

GPT-4 & GPT-4o

Architecture: what makes GPT-4 different from GPT-3

GPT-4o: the omnimodal leap

Training: RLHF and the alignment stack

Practice what you just learned

Related Terms