If you shipped an LLM feature in 2024, you probably wrote a while loop around a chat completion and called it an agent. That worked. It does not work anymore.
By mid-2026, “agent” means tool calls, branching, retries, streaming, durable state, human-in-the-loop pauses, and an evaluation pipeline you can show to a skeptical platform team. The framework you pick is now load-bearing. Pick wrong and you are six months from a rewrite — I’ve watched two teams do it this year alone, both starting from CrewAI, both ending on something else.
The market consolidated. Twelve months ago there were twenty-plus frameworks worth a look. Today there are about six that show up in real production stacks: LangGraph, OpenAI Agents SDK, CrewAI, LlamaIndex Workflows, Pydantic AI, and Mastra. This is the honest version of how they compare when you actually run them at 2am.
The mental model matters more than the feature list
Every framework you’ll evaluate is selling you the same nouns — agents, tools, memory, handoffs — but the underlying abstraction is wildly different, and that’s what decides whether your codebase rots in nine months.
- Graph-based (LangGraph): explicit nodes and edges. You draw the state machine, the framework runs it. Verbose, but every transition is inspectable.
- Event-driven (LlamaIndex Workflows): steps emit events, other steps subscribe. Feels natural if you’ve written reactive code, awkward if you haven’t.
- Role-based (CrewAI): you define “agents” with personas and a “crew” that coordinates them. Reads like fiction. Debugs like fiction too.
- Typed-flow (Pydantic AI, Mastra): functions with typed inputs and outputs, composed manually. The agent is just
await agent.run(...). Honest about what an LLM call is. - SDK-style (OpenAI Agents SDK): a thin orchestration layer with handoffs, guardrails, and tracing. Opinionated and tied to OpenAI by default.
The pattern I keep seeing: teams pick a framework because of a demo video, then they hit a wall the abstraction wasn’t designed for. A graph framework with a 12-node loop that nobody wants to refactor. A role-based framework where two “agents” talk past each other and there’s no clean way to inspect the message queue. Pick the abstraction that matches the shape of your problem, not the abstraction with the prettiest README.
LangGraph: still the default, still the most painful to learn
LangGraph is the gravity well of agent frameworks. If your team has been on LangChain since 2023, you’re probably on LangGraph now whether you chose it or not. It deserves its position — the graph model is genuinely the right primitive for non-trivial agents with branching and retries, and the checkpointing story (Postgres-backed, resumable) is the most mature in the field.
What people don’t tell you: the learning curve is brutal. The StateGraph API has more knobs than you’ll touch in your first six months, and the docs still bounce between LangChain-classic style and pure LangGraph style depending on which page you land on. New engineers on the team will write code that runs but accidentally re-executes a node on every tick because they didn’t grasp the reducer pattern.
The payoff is real once you’re past that. Human-in-the-loop pause/resume with persistent state is a one-line interrupt() call. Streaming partial state out of a graph is built in. LangSmith integration is automatic and the trace view is the best in the category — by a margin, not a hair. If you’re building a customer support agent or anything that needs to survive a process restart mid-conversation, LangGraph’s durability story is hard to beat.
The lock-in worry is real but overstated. LangGraph itself is provider-agnostic; you can plug Anthropic, OpenAI, Bedrock, or anything via LiteLLM. The lock-in you should worry about is to LangSmith — once your team gets used to that trace UI, you will not want to leave it, and LangSmith is not cheap at scale.
OpenAI Agents SDK: the quiet winner for OpenAI-only stacks
When OpenAI shipped the Agents SDK in 2025, most of us shrugged — yet another framework in a crowded field. A year later it’s eating LangGraph’s lunch for one specific use case: teams that were always going to be on gpt-* models anyway and don’t want to think about it.
The primitives are tight. Agent, Runner, handoff, guardrails, tracing. That’s basically the whole surface area. Handoffs in particular are the right primitive for multi-agent flows — instead of CrewAI’s persona-driven hand-waving, you literally say “if condition X, hand off to agent Y” and the framework manages the state transfer. It feels less magical, which is exactly the point.
The trace dashboard ships in the OpenAI console, free. No separate observability vendor to evaluate. For a small team shipping an internal agent or a niche SaaS feature, that’s enormous — three SaaS evaluations you don’t have to do.
The cost is obvious: you are tied to OpenAI models. The SDK technically supports plugging in other providers via custom model classes, but you can feel that it wasn’t the design target. If your CFO comes to you in nine months asking why you can’t just route to Haiku for the cheap stuff and gpt-* for the hard stuff, you’ll be doing surgery on adapter layers. For a lot of teams that bet is fine. Just make it explicitly, not by accident.
CrewAI: the framework production teams keep removing
I want to be careful here because CrewAI has real adopters and a real community, but the pattern I keep seeing — and that several teams have written about publicly — is the same: ship to prod on CrewAI, hit a debugging wall around month four, rewrite onto LangGraph or OpenAI Agents SDK.
The pitch is great. You define “agents” with roles (“you are a senior research analyst”), a “crew” that orchestrates them, and “tasks” they execute. It reads like a screenplay. The early demo is delightful — your researcher agent talks to your writer agent talks to your editor agent. Beautiful.
The problem is that the abstraction hides exactly the thing you need to see when something goes wrong. When agent A produces garbage and agent B then dutifully refines that garbage into more garbage, the framework gives you a transcript that looks like an HR transcript and not a debuggable state machine. Token costs balloon because every “agent” re-establishes context. The role-based mental model also makes it hard to express “if X happens, skip the next two agents and go straight to the cleanup step” — you fight the abstraction.
CrewAI Flows (the newer, more imperative API) addresses some of this and is honestly fine. If you’re picking up CrewAI in 2026, use Flows, not the classic Crew API. But at that point you’re using a more verbose flavor of what LangGraph or Pydantic AI does natively, with a smaller ecosystem.
LlamaIndex Workflows: the right answer if your product is mostly RAG
LlamaIndex has been quietly excellent for a long time. Workflows — their event-driven orchestration layer — is the right tool if your agent is fundamentally a RAG pipeline with a few branches: retrieve, rerank, synthesize, optionally call a tool, return.
The event model is elegant. Each step is an async function that takes an event and emits one or more events. The framework runs them concurrently where it can, serially where it has to. Streaming is first-class. The documentation is good, the community is sane, and the ingest tooling LlamaIndex is famous for is still the best in the business if you’re indexing PDFs, web pages, and structured data into a vector store.
The catch: for non-RAG agent shapes — say, an ops automation agent with twelve tools and no document store — Workflows feels like the wrong shape. You can make it work, but you’ll be doing without primitives that LangGraph or OpenAI Agents SDK give you for free. If “retrieve and reason” is 80% of your product, LlamaIndex. If it’s 20%, look elsewhere.
Pydantic AI: the typed Python option
Pydantic AI is the framework I keep recommending to engineers who hate frameworks. Which is most of them.
The pitch is small: agents are Python objects, tool calls are decorated functions with Pydantic-validated arguments, dependencies are passed via a typed context. Run an agent with await agent.run(...). That’s basically it. No graph, no events, no roles. Just functions with types.
The reason this works in production is that it doesn’t hide anything. When the LLM returns a bad tool call argument, Pydantic raises a validation error you can read. When you want to swap models, you swap one line. Multi-provider support is genuinely first-class — Anthropic, OpenAI, Gemini, Bedrock, Ollama all behave the same way at the call site.
What you give up: there is no built-in checkpointer. There is no graph visualization. Human-in-the-loop is something you build yourself with whatever queue/database you already have. For a small product team with a clear use case, that’s a feature — you’re not paying complexity tax for things you don’t use. For a 30-engineer platform team building a shared agent runtime, you might want LangGraph’s batteries.
Mastra: TypeScript teams finally have a real option
For a long time the answer to “what’s the TypeScript equivalent of LangGraph” was “the Vercel AI SDK plus a lot of glue code.” Mastra is the first TypeScript-native agent framework I’d call production-credible.
It bundles workflows (graph-style), agents (with tools and memory), evals, and RAG into one package that feels coherent. The DX is the best part — your IDE actually understands your tool definitions because they’re typed end-to-end, no as any escape hatches. Streaming works. Multi-provider works. Memory backends pluggable (Postgres, Redis, in-memory).
The reasons to hold back: the project is young by production standards, the integrations ecosystem is thinner than the Python frameworks, and “I need this exact thing from LangChain” probably means a port. If your stack is Node/Next.js and you’ve been doing the AI work in a separate Python service purely to access these libraries, Mastra is the first real argument to bring it home. If your team is already happy in Python, you don’t need Mastra.
Production realities the demo videos don’t show
Most framework comparisons stop at API ergonomics. The decisions that bite you later are the ones below.
Durability and checkpointing. LangGraph has the most mature story — Postgres-backed, resumable, with explicit checkpoint hooks. OpenAI Agents SDK has tracing but not durable state across process restarts. Pydantic AI and Mastra hand you the primitives and expect you to wire up your own persistence. LlamaIndex Workflows added a context-serialization story that’s improving but not yet at LangGraph’s level. If your agent runs for more than 30 seconds end-to-end, you need to think about this on day one, not month six.
Observability. LangSmith remains best-in-class but pricey and tied to the LangChain ecosystem. Langfuse is the open-source choice that works with everything — there’s a separate post on this blog about LLM observability tools that goes deeper. OpenAI’s built-in trace dashboard is great if you’re OpenAI-only. Mastra has solid built-in tracing with optional OTel export. Whatever you pick, install it before you ship — retrofitting observability into an agent codebase is miserable.
Cost tracking and rate limiting. Almost no framework handles this well out of the box. You’ll end up putting an LLM gateway (Portkey, LiteLLM, OpenRouter) in front of whatever framework you choose, both for fallback routing and for per-tenant spend caps. Plan for it.
Multi-tenant SaaS. This is where role-based frameworks especially struggle. If “user A’s agent state must never leak to user B” is a hard requirement, audit the framework’s memory and context handling carefully. LangGraph’s per-thread state model and Pydantic AI’s explicit dependency injection make this clean. Some of the others, less so.
Picking by problem shape
Skip the “best framework” framing. Pick by the shape of what you’re building.
- RAG-heavy product (docs Q&A, research assistant): LlamaIndex Workflows. Pydantic AI as runner-up if you want lighter.
- Multi-step agent with tools, branching, retries: LangGraph. Worth the learning curve.
- OpenAI-only SaaS feature, small team: OpenAI Agents SDK. Stop overthinking.
- TypeScript/Next.js stack, want everything in one repo: Mastra.
- Small product, want to understand every line, hate frameworks: Pydantic AI.
- Customer support automation with handoffs to humans: LangGraph (for the checkpointing) or OpenAI Agents SDK (for handoff primitives) — pick by your model strategy.
- You are evaluating CrewAI: try CrewAI Flows, not classic Crew. Or just use Pydantic AI for a smaller surface area and similar feel.
One thing nobody warns you about: framework migrations are not as bad as they sound, if you’ve kept your tool definitions and prompts in their own modules. The agent loop is usually 200–400 lines and rewritable in a weekend. Your tools, your prompts, and your evals are the asset. Build those well and you can swap the framework underneath later without crying.
The agent framework you pick today is a one-year bet, not a five-year one. Pick deliberately, get something into prod, and instrument it well. The framework that wins in 2027 is probably already on the list above — but which one wins is going to depend on which one your team can actually debug at 2am.