OpenAI’s Codex Spark Puts Cerebras on the Inference Map

OpenAI’s Codex Spark Puts Cerebras on the Inference Map

OpenAI just shipped its first model running on non-Nvidia silicon, and the implications extend well beyond raw speed. GPT-5.3-Codex-Spark, a compact coding model optimized for real-time interaction, delivers over 1,000 tokens per second by running on Cerebras’ Wafer Scale Engine 3 instead of the GPU clusters that power every other model in OpenAI’s fleet. For infrastructure leaders and platform strategists tracking the AI compute landscape, this is the first tangible proof point from a $10 billion partnership that could reshape how inference workloads are served at scale.

Close-up of a silicon wafer showing processor die patterns

Why Speed Is the Product, Not a Feature

Codex Spark is not simply a smaller model running on faster hardware. It represents a deliberate architectural bet that interactive coding requires a fundamentally different compute profile than the long-running agentic tasks that GPT-5.3-Codex handles. OpenAI is now splitting Codex into two complementary tracks: a deep-reasoning mode for multi-file refactors and autonomous execution, and a real-time mode where developers can interrupt, redirect, and iterate with near-instant feedback.

This dual-mode approach addresses a growing tension in agentic coding. Tools that run autonomously for minutes or hours can leave developers feeling disconnected from the process. Spark’s 1,000+ tokens-per-second throughput is designed to keep developers in flow, making targeted edits, revising logic, and testing interface changes without the cognitive friction of waiting.

The speed gains required more than just swapping in faster chips. OpenAI rewrote key parts of its inference stack, introducing persistent WebSocket connections that reduced per-roundtrip overhead by 80%, per-token overhead by 30%, and time-to-first-token by 50%. These infrastructure improvements will roll out to all OpenAI models, meaning Spark’s influence on the platform extends beyond a single model release.

What Cerebras Brings to the Table

The WSE-3 is not a GPU competitor in the traditional sense. It is a single wafer-scale chip containing roughly 4 trillion transistors and 44 GB of on-chip SRAM, fabricated on TSMC’s 5nm process. Where GPU-based inference spreads workloads across clusters of discrete processors connected by external links, the WSE-3 keeps compute and memory on one tightly connected fabric. For compact models generating short bursts of code, this architecture eliminates much of the communication overhead that adds latency in distributed GPU deployments.

The practical result is that Cerebras excels at workloads where low latency matters more than peak throughput. OpenAI has been explicit about the division of labor: GPUs remain foundational for training and cost-effective broad inference, while Cerebras handles the latency-sensitive tier. This is not a GPU replacement story. It is a portfolio diversification play.

The Benchmark Tradeoff

Spark does not match its larger sibling on raw accuracy. On Terminal-Bench 2.0, Codex Spark scores 58.4% accuracy compared to 77.3% for GPT-5.3-Codex. But the time dimension changes the calculus entirely. On SWE-Bench Pro, Spark achieves comparable accuracy in two to three minutes for tasks that take the larger model 15 to 17 minutes.

Terminal-Bench 2.0 benchmark comparison showing GPT-5.1-Codex-mini at 46.1%, GPT-5.3-Codex-Spark at 58.4%, and GPT-5.3-Codex at 77.3% accuracy

For the types of tasks Spark targets, this is an acceptable trade. A developer reshaping a UI component or refining a function signature cares more about immediacy than exhaustive correctness. The model defaults to a conservative style, making minimal targeted edits and skipping automatic test execution unless explicitly asked. It works within a 128K context window and handles text only, with multimodal input and larger context lengths planned for future releases.

Inference Economics and Platform Strategy

The Cerebras partnership arrives at a moment when inference costs are becoming a strategic concern for every AI platform company. As AI coding assistants move into mainstream developer workflows, the providers that can sustain faster and steadier responses at scale will hold retention advantages. OpenAI is clearly thinking about this. The $10 billion, multi-year agreement with Cerebras signals a commitment to building a heterogeneous compute portfolio rather than depending on a single chip supplier.

Cerebras, which recently raised $1 billion at a $23 billion valuation and has discussed IPO ambitions, now has a flagship reference customer that validates its inference thesis. For the broader chip market, this deal reinforces a pattern: Google builds TPUs, Amazon deploys Trainium and Inferentia, Microsoft develops Maia, and now OpenAI adds wafer-scale silicon to its mix. The GPU monoculture in inference is quietly fragmenting.

What Comes Next

Codex Spark is currently available as a research preview for ChatGPT Pro users through the Codex app, CLI, and VS Code extension, with API access limited to select design partners. OpenAI describes it as the first in a planned family of ultra-fast models, with larger variants, extended context windows, and multimodal capabilities on the roadmap.

The deeper signal here is strategic. OpenAI is building toward a compute architecture where the right silicon handles the right workload. GPUs for training and heavy inference, wafer-scale chips for latency-critical interactive tasks. As Cerebras brings more capacity online, OpenAI has indicated it will deploy larger frontier models on the platform as well. For CXOs evaluating AI infrastructure investments, the takeaway is clear: the era of single-vendor compute stacks in AI is ending, and the companies that build flexible, workload-aware infrastructure will hold the competitive advantage.