the paradox of enterprise AI

if you look at how ai is evolving inside businesses, it doesn’t look much like the lab story

today’s frontier models are great at benchmarks. arc, mmlu, math olympiads. the problem is those benchmarks don’t look anything like work. real employees don’t sit around answering trivia.

humans working look very different from chatgpt.

humans get better over time, task after task
humans learn through back and forth, not input → output
humans take individual lessons and generalize them to other domains

so the work of making ai useful in business runs perpendicular to the pursuit of agi. even if you get to gpt-10, a chatbot interface still can’t do most of what a junior analyst does. agents need context. they need rubrics to know if they’re doing well. they need continuous feedback loops, not one-off inputs.

the evolution here is clear:

first we had rlhf, where humans rated outputs
then rlaif, where models judged each other
then rlvr, where rubrics started to define success
next we’ll have ai generating the rubrics, reasoning through them in language, training on both the output and the logic behind the output

that last step matters. because it mirrors how humans think: you don’t just grade the answer, you also grade the reasoning that got there.

what does this world look like? a model dropped into a live environment, constantly updated with context the same way an employee gets emails, dashboards, new data. someone has to build the pipes — access, permissioning, monitoring — so agents can soak that context.

the agent then tackles small tasks. it reasons, evaluates itself against a rubric, improves over time, generalizes across similar tasks. the hardest part is converting messy business inputs into natural language the model can actually see. web pages, files, spreadsheets → all need to be translated into something a language model can reason over.

the result is a system that looks like a new hire: dropped into an environment, given tasks, evaluated, and trained to get better. except this new hire doesn’t quit, doesn’t forget, and scales infinitely once you get the set-up right.

so where do humans fit? that’s the wrong first question. the simpler one is: where would an agi fit inside a company at all?

labs train models to reason at a phd level. businesses want ai to do the job of a junior analyst. paradoxically, models today are better at zero-shot executive reasoning than they are at repetitive analyst workflows. which is why so much enterprise adoption has looked like automation. rules-based systems wrapping llms. coding assistants that can be evaluated true/false. because code is already a natural language.

enterprises don’t actually want agi. they want competitiveness. that usually means two things:

lower costs → more pricing flexibility, more cash for r&d
higher revenue → stealing share from competitors

and they want to do it without blowing up the core business that generates today’s cash flows. that’s why incumbents miss platform shifts. microsoft didn’t catch the iphone wave. blackberry couldn’t adapt. they were protecting the core.

when it comes to ai, enterprises care about five things:

services to help them deploy
change management to get people to use it
specificity to their context
low risk
fast, high impact

that matrix looks nothing like the labs’ matrix. labs are building models that could replace the ceo. enterprises want to replace bpo’d functions and junior analysts. they want quick wins. not fundamental reorganizations.

that means the evals for enterprise ai look different too. you’re not grading a model on olympiad-level proofs. you’re asking: can this agent learn sequentially, improve with context, and zero-shot onto a slightly different task tomorrow?

which is why code took off first. because it fit the lab evals and the enterprise evals at the same time. it’s natural language. it’s deterministic. correctness can be tested instantly. there is no equivalent eval today for “is this dcf built correctly?”

so what has to be built:

pipelines that convert enterprise context into natural language
pretraining models on that context
reasoning chains that get to outputs
reinforcement on both the reasoning and the outputs

this is where humans come in. not as the agent itself, but as the scaffolding. deciding what context matters. converting that context into something the model can see. reviewing the outputs. rewarding good behavior. correcting bad behavior.

it’s ironic: llms were built to pass the turing test. they feel like talking to a smart human. that illusion makes it easy to believe they can run an entire business. but they weren’t built to navigate systems, chase down context, or manage workflows. that’s why the next wave isn’t going to be horizontal chatbots. it’s going to be vertical agents, trained in specific domains, evaluated on real tasks, improving through the same feedback loops humans use.

the business of ai isn’t just building smarter models. it’s building the infrastructure that turns them into workers.