Jensen Says AGI. The Benchmark Says 0.26%.

3 min · March 2026

Originally published on LinkedIn

Jensen Huang told Lex Fridman this week that he thinks we've achieved AGI.

The same week, ARC-AGI-3 launched a new interactive benchmark. Humans score 100%. The best frontier AI scored 0.26%.

The CEO who sells the GPUs says we're there. The benchmark that tests it says we're not close.

Both are right. And that's the interesting part.

Jensen is right that current models perform at roughly high-human level across language, reasoning, and knowledge — and work thousands of times faster. If your definition of AGI is "can do most knowledge work as well as a human," then yes, we crossed that line somewhere in the last six months.

ARC-AGI-3 is right that current models can't do things that any human can do trivially — novel spatial reasoning, pattern abstraction, tasks that require genuine understanding rather than pattern matching. If your definition of AGI is "can do anything a human can do," we're nowhere near it.

The gap between these two definitions is where every enterprise AI decision gets made.

I've spent my career in that gap. The models are extraordinary at well-defined, repeated tasks — classification, extraction, routing, summarization. They fail on novel reasoning, ambiguous judgment calls, and anything that requires understanding context the training data didn't cover.

This is exactly why the SLM Flywheel works. You don't need AGI. You need a model that's world-class at your specific domain. Checkr doesn't need a model that can reason about philosophy. It needs one that classifies background checks at 90%+ accuracy. Intercom doesn't need a model that solves ARC puzzles. It needs one that resolves customer issues better than GPT-5.4.

The AGI debate is fascinating. It's also irrelevant to what you should build this quarter.

The question that actually matters for your enterprise: which of your tasks are in the "current AI is superhuman" bucket, and which are in the "current AI fails trivially" bucket?