GPT-5.4, "Agent Native" large model is coming?

robot
Abstract generation in progress

OpenAI finally figured it out.

Just two days after the rumors, on March 5th, local time, OpenAI officially launched GPT-5.4. This model update focuses on the hottest AI Agent direction right now.

Before GPT-5.4, the capabilities of large models could be summarized in one sentence: they can tell you “how to do it,” but they can’t do it themselves.

If you ask it to analyze competitors, it will give you a lengthy report; if you ask it to organize an Excel sheet, it will write some Python code for you to run; if you ask it to book a flight, it will tell you step-by-step which website to go to and which buttons to click.

The wall in the middle is called “computer operation.”

GPT-5.4 is OpenAI’s first general model to break down this wall.

GPT-5.4 compared to previous models|Image source: OpenAI

It can recognize screen content through screenshots, send mouse and keyboard commands, and execute multi-step workflows across different applications. In OpenAI’s own words, this is their “most powerful and efficient frontier model for professional work to date.”

More technically, GPT-5.4 supports up to 1 million tokens in the context window and can call libraries like Playwright to directly control browsers and desktop applications.

This means it is no longer just “dialogue about tasks,” but “the tasks themselves.”

01 OpenAI’s groundwork

If you’ve been following OpenAI’s recent moves over the past few months, you’ll see that GPT-5.4 isn’t an abrupt new product but a clear step along a strategic path.

Just two weeks ago, OpenAI released GPT-5.3-Codex, upgrading Codex from a “code-writing agent” to an “agent capable of almost everything a developer does on a computer,” setting new industry benchmarks on SWE-Bench Pro and Terminal-Bench.

Meanwhile, OpenAI launched the enterprise-focused “Frontier” platform, with HP, Intuit, and Uber as early users.

GPT-5.4 clearly outperforms GPT-5.2 in spreadsheet filling|Image source: OpenAI

Earlier, on March 2nd, OpenAI and AWS expanded their existing $3.8 billion partnership to over $100 billion, lasting 8 years, with AWS becoming the exclusive third-party cloud provider for the OpenAI Frontier platform. The scale of this investment itself is a signal.

The latest $110 billion funding round, supported by Amazon, SoftBank, and Nvidia, also closed around the same time.

This isn’t a company just “developing good products”; it’s a company sprinting to “win the enterprise AI agent market.”

GPT-5.4’s native computer operation capabilities are the key weapon in this sprint.

02 Is it really useful?

Demo videos at launch events always look impressive, but the real test is actual performance.

Financial tech company Walleye Capital reported in internal testing that GPT-5.4 improved accuracy in Excel financial modeling assessments by 30 percentage points, significantly speeding up automated scenario analysis.

Talent assessment platform Mercor’s CEO called it “the best model we’ve tested,” showing outstanding performance in long-term tasks like slide creation, financial modeling, and legal analysis.

An independent developer who uses Codex daily gave a more down-to-earth review: “GPT-5.4 is my new daily driver for Codex. Its thinking is closer to humans, and it’s not as obsessed with technical details as 5.3.” But he also added a caution — “be careful, I’ve encountered several cases where the model misexecuted tasks and concealed it.

GPT-5.4’s improvements in operation and visual capabilities|Image source: OpenAI

This detail is worth noting.

Benchmark data also confirms this capability boost. Reports indicate that GPT-5.4 outperforms 83% of average office workers on the GDPval benchmark. This number sounds impressive, but the real question isn’t “how many people it can surpass,” but “which tasks it can replace humans in.”

However, Dr. Jeff Dalton from the University of Edinburgh’s School of Informatics pointed out a practical issue — in current demos, there is hardly enough detailed evidence to support such grand claims. The capabilities are real, but the boundaries still need more independent validation.

03 The Agent battlefield has no safe zone

If GPT-5.4 represents OpenAI’s ambition for Agents, competitors are not idle.

Anthropic’s Claude 3.7 Sonnet launched the “Computer Use” feature as early as February this year, positioning it as a hybrid reasoning model designed for complex tasks.

Google’s Gemini 2.0 series continues to develop “Agentic” capabilities, with Project Mariner already able to perform multi-step operations autonomously within Chrome.

But the fundamental difference between GPT-5.4 and its competitors is that it is OpenAI’s first product to embed computer operation capabilities directly into a general model — not a separate tool, not an API that needs to be called, but a built-in feature of the model itself.

This “native” aspect, in engineering terms, means lower latency, more natural task transitions, and less “glue code.” For enterprises eager to deploy Agent applications quickly, this difference directly impacts deployment costs.

OpenAI also announced that GPT-5.4 can directly connect to Microsoft Excel and Google Sheets, performing granular analysis and automation at the cell level. This step clearly targets the core of enterprise decision-making processes.

In the Agent arena, it’s never about who runs faster, but who can embed themselves into enterprise workflows first, becoming an indispensable part.

Tech launches are always passionate, but the real test comes on day 91 — when the hype fades, and users start applying this tool in real work scenarios. Will it reliably handle screenshots, accurately click buttons, quietly complete tasks, and deliver results?

The developer’s comment about “concealed errors” is the most cautionary note I’ve seen in this report so far.

The ceiling of AI Agent capabilities has never been “what it can do,” but “whether you dare to trust it to do it.”

Trust is the real currency in this Agent war.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
0/400
No comments
  • Pin