Skip to main content
Logo
Overview

Langfuse vs Helicone vs Portkey vs LiteLLM in 2026

May 9, 2026
11 min read

Last month a startup I advise pinged me at 11pm: “Our Anthropic spend tripled overnight and we have no idea why. Help.” They’d been shipping Claude features for six months on the bare SDK — no proxy, no traces. The cost shock and the panic were predictable. The fix took two days.

If you’re running LLMs in production in 2026 and you don’t have a gateway and an observability layer, you’re flying with no instruments. The catch is that the market keeps mashing those two things into one phrase — “LLM observability” — and dumping tools into it that solve different problems. Langfuse and Helicone are not the same thing as LiteLLM and Portkey, even though Google pretends they are.

Here’s how I’d actually pick.

Two layers, not one

Stop thinking of “LLM observability” as one product. It’s two.

The gateway sits in your hot path. App talks to gateway, gateway talks to provider. It does routing across providers, retries, fallback when Claude has a bad ten minutes, response caching, rate limiting, virtual keys for departments, and budget enforcement. LiteLLM, Portkey, OpenRouter, and the newer Bifrost project live here.

The observability layer sits beside it. It ingests traces (usually async, usually via OTel), stores them, runs evals, manages prompt versions, builds dashboards. Langfuse, Helicone, Phoenix, Braintrust, LangSmith live here. Some of them — notably Helicone and Portkey — also act as proxies, which is where the confusion starts.

You usually want both layers. You can pick a stack where one tool covers both (Portkey, Helicone) or pair best-of-breed (LiteLLM in front, Langfuse beside). Both are reasonable. Picking one and pretending it covers the other is how startups end up writing custom fallback logic at 2am.

The 2026 reference shape

your app
LLM gateway (routing, fallback, cache, rate limits, virtual keys)
provider APIs (Anthropic, OpenAI, Google, Mistral, Groq, …)
gateway → observability (traces, evals, prompt mgmt, cost attribution)

The observability path is async. Your hot path doesn’t block on Langfuse; it ships spans over OTel or a batched HTTP client. If your “observability” tool is in the synchronous path — Helicone is, by design — that’s a separate trade-off: convenience now for a hard dependency on their edge later.

That trade-off is most of the story for picking between these tools.

Langfuse: the OSS observability default

Langfuse is what I tell people to start with for traces, evals, and prompt management. It’s open-source (MIT core, EE features under a separate license), self-hostable, and the cloud tier is generously priced. Langfuse 3 moved the trace store onto ClickHouse with Postgres for metadata, which is the right architecture and also the part that bites you on self-host.

What it actually gives you:

  • Traces across multi-step agents — tool calls, sub-spans, LLM responses
  • Prompt management with versioning and an A/B layer
  • Evals — both LLM-as-judge and code-defined, with dataset runs
  • Score annotations from humans or automated pipelines
  • A UI that doesn’t make you cry

Where it bites:

  • Self-hosting Langfuse 3 means running ClickHouse, Postgres, Redis, and the worker stack. If “we have a Postgres and that’s it” describes your ops surface, you’ll feel it.
  • The Python and JS SDKs are good. The Go SDK is community-maintained and shows.
  • It is not a gateway. You still need fallback and retry logic upstream.

The cloud tier is the path of least resistance. Hobby is free with generous limits; Pro sits around $59/month with usage; Enterprise is a separate quote. For a small team self-hosting on one beefy box, ops cost is real but not crazy.

Helicone: one-line proxy observability

Helicone took a different bet — sit in the hot path as a proxy, and traces come basically for free with one base-URL change. That’s hard to argue with for a startup that needs visibility yesterday.

client = Anthropic(base_url="https://anthropic.helicone.ai")

Done. Traces, costs, latency, prompt logs, all in their UI. No SDK instrumentation, no OTel setup, no library wrapping.

The trade-off is real and unavoidable: every LLM request goes through Helicone’s edge. If they have a bad day, you have a bad day. The edge is global, and in practice this hasn’t been a recurring pain, but it’s the architectural bet. Some teams switch to async logging (Helicone supports it via their wrapper) to drop the synchronous dependency, at the cost of losing the one-line setup story.

Helicone now also covers caching, rate limiting, and basic routing, which makes it more of a hybrid than pure observability. MIT-licensed core, hosted tier most people land on, self-hosting doable.

I like Helicone for solo devs and small teams who want to ship and not babysit observability infra. At series-B scale you’ve outgrown one-line magic and want a real OTel pipeline anyway.

Portkey: the unified-control-plane bet

Portkey tries to be everything — gateway, observability, prompt management, guardrails, virtual keys, AI security. It’s well-funded, the product is genuinely broad, and the “one control plane for AI” pitch is what enterprise buyers want to hear.

If you’re at a regulated company that needs SSO, RBAC, audit logs, PII redaction, prompt firewalling, and a unified place to enforce per-team budgets, Portkey is on the shortlist. It’s a credible LangSmith-and-LiteLLM-and-LaunchDarkly-for-LLMs in one box.

What I dislike: when one vendor owns gateway and observability, you’ve hard-coupled two layers that should be loosely joined. If their gateway hiccups, your observability also goes dark, because the trace pipeline runs through the same surface. The breadth also means individual capabilities aren’t always best-in-class — the prompt management is fine, the evals are fine, routing is fine. Nothing is “wow.”

For a 200-person org, “fine across the board with one bill” beats “best-of-breed with five vendors.” For a 5-person team, that calculus flips.

LiteLLM: the SRE-friendly proxy

LiteLLM is what I install when I want gateway behavior and nothing else. Open-source (MIT), speaks 100+ providers behind an OpenAI-compatible API, runs as a Python library or a standalone proxy, and it’s boring in the good way.

You get:

  • One API surface for Anthropic, OpenAI, Bedrock, Vertex, Azure, Together, Groq, Mistral, Cohere, and a long tail
  • Routing rules with fallback (try Claude Sonnet, fall back to GPT-5 mini if it 529s)
  • Budget tracking per virtual key
  • Caching (Redis or in-memory)
  • A rough UI that won’t impress anyone but works

What it doesn’t give you: a serious traces and evals UI. You pipe LiteLLM logs into Langfuse, Phoenix, or Datadog, and that’s the canonical setup — LiteLLM in front for routing, Langfuse beside for tracing. That two-tool combo is the reference architecture I most often deploy.

LiteLLM has a paid Enterprise tier with SSO, audit, and support. The OSS proxy is genuinely production-capable; I’ve seen it run multi-million-call workloads on a small Kubernetes deployment without drama.

OpenRouter: the marketplace

OpenRouter is a different shape. It’s a multi-provider router with a credits-based marketplace — buy credits, call any model OpenRouter has, they handle billing and routing under the hood. There’s a UI, model rankings, and a pricing page that compares providers head-to-head.

For prototyping or solo projects, OpenRouter is hard to beat. You get access to dozens of models without juggling N billing relationships, and they sometimes have surplus-capacity pricing that beats list rates on the underlying provider.

For production at scale, two real concerns. One: you’ve added a vendor with margin in your hot path, and at high volume you’ll save money going direct. Two: you’re trusting OpenRouter’s SLA and rate limits on top of the provider’s. I see OpenRouter most often in side projects and dev workflows, and as a fallback in LiteLLM configs for the long-tail open-source models you don’t want to host.

The honorable mentions

A few tools I considered and parked:

  • Phoenix (Arize) — strong OSS observability, OTel-native, particularly nice for ML teams already in the Arize world. Worth a look if you don’t like the Langfuse stack shape.
  • Braintrust — eval-first, slick UI for prompt experiments and dataset-driven scoring. Pricier, very polished, ML-team-flavored.
  • LangSmith — LangChain’s tool. If you’re on LangChain, this is the path of least resistance. If you’re not, it doesn’t make a strong case for itself.
  • Lunary — OSS, narrower scope than Langfuse, picked by teams who want something lightweight.
  • Honeycomb / Datadog LLM — your existing observability vendor probably ships an LLM module by now. Never best-of-breed, but “we already have it” is a real argument when budgets are tight.

Cost math at three scales

Illustrative as of early 2026 — verify against current pricing pages before you spreadsheet anything. The point isn’t the exact dollars, it’s the shape.

1M LLM calls a month (small SaaS feature):

  • Langfuse Pro: ~$59/mo plus a usage component
  • Helicone hosted: free tier covers a chunk, then ~$25–$50/mo
  • Portkey starter: ~$50–$100/mo
  • LiteLLM self-hosted: just the box, ~$20/mo on a small VM
  • OpenRouter: ~5% margin on top of provider costs, no platform fee

At this scale, observability cost is rounding error compared to the LLM bill. Pick what your team will use.

10M calls a month (a real product):

Storage and trace retention start mattering. Langfuse cloud lands in the low hundreds. Self-hosting is cheaper, but you’re paying in ops. Portkey and Helicone scale roughly with traffic. LiteLLM proxy is still cheap. The real question is engineering time, not the bill.

100M calls a month (AI-native company):

Self-hosted Langfuse on ClickHouse becomes attractive. Sample rates start to matter — you don’t trace every span. Gateway choice matters more for unit cost (fallback strategies and response caching can save 20–40% on repeat queries). At this scale you’re probably running LiteLLM (or Bifrost) plus a self-hosted observability layer, and you have a person whose actual job is owning the AI infra.

Self-hosted vs cloud, honestly

The OSS-self-host pitch sounds great until you realize what “running ClickHouse” actually means for a non-data team. Backups, schema migrations, compaction tuning, disk pressure alerts at 3am. If your team has never run ClickHouse, factor in two weeks of learning and one regretted Sunday.

For most teams under series B, the right call is: use the SaaS tier for Langfuse or Helicone or Portkey, run LiteLLM as a small in-cluster service if you want gateway control, and revisit when SaaS costs cross one engineer-month per year. That’s the threshold where rolling your own becomes cheaper than paying.

Claude Agent SDK and OpenAI Agents SDK integration

Both SDKs emit OTel-shaped traces by 2026, which means any observability tool that ingests OTel works in principle. In practice:

  • Langfuse has first-class Anthropic and OpenAI support, decorators for tool calls, and an Agent SDK example repo. Smoothest path for Claude agents today.
  • Helicone works via the proxy URL; agent traces show up as a sequence of LLM calls, which is fine for cost but loses the agent-graph view unless you set custom session IDs.
  • Portkey has Agent SDK adapters; the unified-control-plane angle is genuinely useful here for tool-call budgeting.
  • LiteLLM captures call-level traces; agent-level shape comes from your observability layer, not LiteLLM itself.

If you’re building Claude Agent SDK pipelines today, my default: LiteLLM in front for fallback and budget, Langfuse beside it for traces and evals. Portkey is the alternative if you want one vendor. Helicone is the alternative if you want zero setup.

Picking by team shape

  • Solo dev shipping a chatbot — Helicone, free tier, one line of code. Move on with your life.
  • Growth-stage SaaS, 5–20 engineers — LiteLLM proxy plus Langfuse cloud. Cheap, OSS-hedged, real eval pipeline.
  • Regulated enterprise — Portkey, or self-hosted Langfuse plus LiteLLM Enterprise. SSO, audit, and RBAC matter more than UI polish.
  • AI-native startup with custom evals — Braintrust or Langfuse self-hosted, plus Bifrost or LiteLLM. You’ll have a strong opinion on evals, and you want the source.

Migration notes

A few traps people hit:

  • LangSmith → Langfuse: the OTel adapter handles most calls, but tool-call shapes differ. Plan a week to remap dashboards.
  • Raw SDK → Portkey: easy switch, but the virtual-keys model needs you to redo per-team budget logic.
  • Ad-hoc fallback code → LiteLLM: the routing rules YAML isn’t as expressive as some custom Python you’ve written. Run a chaos test on failover modes before flipping.
  • Helicone sync → async: if you start synchronous and later want to drop the proxy dependency, the async client takes more wiring than the docs suggest. Start async if you suspect you’ll need it eventually.

If you take one thing from this: stop searching for “the best LLM observability tool.” Decide first whether you need a gateway, an observability layer, or both. Then pick within each category. The teams that get burned in 2026 aren’t the ones who pick “wrong” — they’re the ones who pick one tool, assume it covers everything, and find out at 11pm that it doesn’t.

Try this: open your top three LLM endpoints and ask, “if Anthropic 529s for ten minutes right now, what happens?” The honesty of that answer tells you which layer to fix first.