Agentic AI: what it actually is
The AI debate in software development has split into two camps that have been yelling at each other for two years. On one side, the Twitter clips where an agent builds a complete app in 90 seconds, epic music in the background. On the other, the experienced dev who’ll tell you AI will never truly understand Rust’s borrow checker, so it’s fundamentally limited.
Both are partially right. Both are missing the point.
Two camps, both wrong
We’ve all seen the two-sentence demo that “builds a SaaS in 10 minutes.” The prompt is carefully crafted, the task is bounded, the context is clean, and it cuts before the edge cases. That’s not lying, it’s selection. The problem is that social media amplifies the perfect demo and buries the ten previous attempts that produced broken code.
The absolute skeptic has a different reasoning error: he’s asking the wrong question. “Does AI really understand what it’s doing?” isn’t an engineering question, it’s almost philosophy. The right question is: “Does it work well enough, often enough, to be useful?” That answer is a lot more interesting.
What both camps avoid looking at is the actual mechanics of these systems. Not magic, not fundamental limitation. Just what it is.
What an AI agent actually is
Strip away the marketing, and an AI agent is a language model, some tools, and a loop.
The model receives a context (the conversation, the current state, the results of previous actions) and produces either a final response or a tool call. The tool runs, its result comes back into the context, and it repeats until the model decides it’s done, or a limit is hit.
graph TD
A[User request] --> B[LLM]
B -->|Tool call| C[Tool execution]
C -->|Result| B
B -->|Final response| D[User]
B -->|New call| CThat’s it. No mysterious reasoning, no emergent consciousness. A while loop with an LLM doing the condition check on each iteration.
This simple pattern produces behaviors that look like planning, deduction. But it’s sophisticated statistical prediction: very powerful within the cases covered by training, much less robust off the beaten path.
A system that never plays the same score twice
This is the point devs misunderstand first, and it changes everything: given the same prompt, an LLM produces different results.
That’s not a bug. It’s a fundamental property of these models — temperature and stochastic sampling are part of the design. But for a dev used to deterministic systems, it’s a cultural shock. You can’t write a classic unit test for an agent. assert output == expected doesn’t work when output varies on every run.
graph LR
A[Same prompt] --> B[Run 1]
A --> C[Run 2]
A --> D[Run 3]
B --> E["Result A"]
C --> F["Result B"]
D --> G["Result A'"]In practice:
- Agent tests need to evaluate properties (is the response in the right format? does it contain the expected elements?) rather than exact values
- An agent that works 95% of the time will fail the other 5% unpredictably, not always on the same inputs
- Debugging erratic behavior means thinking in distributions, not reproducible cases
Non-determinism isn’t a flaw to fix. It’s a constraint to build around.
There’s a mathematical consequence that often gets underestimated: error rates compound at every step. An agent at 95% accuracy per action has a 60% chance of completing a 10-step task correctly. Over 100 steps, under 1%. That’s not a question of model intelligence. It’s conditional probability. Long agents fail first because errors accumulate, not because the model “doesn’t understand.”
Why demos look like magic, and why they lie a little
An AI agent demo looks like a magic trick because it assembles the perfect conditions: clearly defined task, clean data, unambiguous context, visible result.
The moment you step outside that corridor, complexity explodes. An /api/orders endpoint that returns 200 sometimes and 422 others depending on implicit session state: the agent will hallucinate a coherent interpretation and keep going. A codebase with undocumented conventions: the model applies the most frequent patterns from its training data, not necessarily yours. A filesystem with circular dependencies: good luck having the agent detect those without explicit tooling.
The perfect demo consistently hides the question of reliability over time. An agent that succeeds at a task in a demo is not an agent that succeeds in production every time.
The demo was recorded after the run that worked. The five previous runs that produced buggy code, nobody shows those.
That’s not a reason to reject the tool. It’s a reason not to build expectations on demos.
Why the Rust dev is partially right
The skeptic isn’t wrong on the fundamentals. He’s just asking the wrong question.
AI doesn’t have understanding in the human sense. It has no mental model of the borrow checker, no internal representation of ownership rules. What it has is a fine statistical ability to recognize patterns of valid Rust code and reproduce them in similar contexts.
On standard code and popular libraries, that works surprisingly well often enough. On code that breaks conventions, on constraints implicit to your system, on interactions between components that were never in public training data, the model will produce something plausible that doesn’t compile. Or worse: that compiles, but is wrong.
The real problem is the invisible error surface. When a junior dev makes a mistake, there’s usually a signal: a visible misunderstanding, a question asked, behavior that betrays the confusion. An LLM produces confident code regardless of the quality of its response. It doesn’t express doubt.
The danger isn’t the AI that says “I don’t know.” It’s the AI that says “here’s the solution” with the same confidence on a common pattern as on an edge case it doesn’t have a handle on.
What this actually changes for a dev in 2026
Agentic AI is an amplifier. What it amplifies is your ability to explore and iterate in areas where you already have a solid mental model.
Where it makes sense: repetitive tasks with bounded scope. Generating unit tests on well-typed code, writing config boilerplate, converting data structures. The model stays in its lane, reliability is high.
Fast exploration of an unfamiliar codebase works too. You land on a project you don’t know, you want to understand the payment module. An agent with file access gives you a starting point in seconds — not infallible, but enough to orient your reading.
Where it gets risky without guardrails: critical code with no human review. A 200-line diff generated by an agent deserves the same level of scrutiny as code written by a junior dev, maybe more, because syntactic plausibility can put you to sleep.
And systems with strong implicit constraints. If your domain rules aren’t in the context you gave the agent, it will improvise. Confidently.
Understanding these properties is what lets you decide where to integrate AI and where not to. Not out of ideology, out of pragmatism.
One concrete answer to non-determinism: harness engineering. Not testing the agent, building the environment it operates in. Which tools it can call, which constraints bound its decisions, how it validates its own work before handing it to you. The OpenAI Codex team put it clearly: the engineering work shifts from writing code to designing the system around the agent. I wrote a short piece on it here, with links to what Anthropic and OpenAI published on the subject.