I’ve spent the last eight months helping three different teams move off Datadog. Two of them had bills north of $400k a year; the third was a fifteen-person startup paying more for observability than for AWS. Every conversation started the same way: “OpenTelemetry is the standard now, so we just self-host the backend, right?”
Sort of. The problem is that “the backend” stopped being one thing. By 2026 there are four credible OSS options that can actually carry production traffic, and they make wildly different bets about storage, query languages, and how much of your weekend you’d like to spend tuning compaction.
This post is the comparison I wished existed before we picked. Real ops cost, real failure modes, and honest opinions about which one I’d hand to a small SRE team versus a platform org.
Why this is finally a real choice
A few years ago the self-hosted observability conversation was basically “stand up Prometheus and hope your logs stay small.” Now OpenTelemetry has consolidated instrumentation across logs, metrics, and traces, and four projects have grown up enough to be worth comparing seriously:
- SigNoz — single-binary-ish all-in-one on ClickHouse. UI is purpose-built for OTel.
- Grafana LGTM — Loki for logs, Tempo for traces, Mimir for metrics, Grafana for the front. The CNCF-blessed stack.
- OpenObserve — single Go binary, S3-native storage, columnar parquet. The new kid that punches above its weight.
- Uptrace — pure OTel-first on ClickHouse + Postgres. Smaller community, very focused product.
All four ingest OTLP natively. All four can show you a flame graph at 2am. After that, they diverge fast.
Architecture in one paragraph each
SigNoz runs ClickHouse as the single store for logs, metrics, and traces, with a Query Service in front and a React UI on top. You also need ZooKeeper (or ClickHouse Keeper) for cluster coordination once you scale past one node. Deployment is Helm or Docker Compose. The whole thing feels like one product because it is one product.
Grafana LGTM is four services that pretend to be one. Loki shards logs into object storage with an inverted index, Tempo does the same for traces, Mimir is essentially Cortex with a friendlier face for Prometheus-style metrics, and Grafana glues them together. You deploy each as its own stateful service, each with its own scaling story. S3 (or GCS, or Azure Blob) backs all three of the data services. The decoupling is a feature when you have a platform team and a bug when you don’t.
OpenObserve is a single Go binary that writes Parquet files to S3 and stores metadata in either local RocksDB or a managed Postgres. There’s no Kafka, no ZooKeeper, no separate query layer. You point your OTLP exporter at it and it works. Compared to the other three, the deployment story is almost suspiciously simple.
Uptrace uses ClickHouse for data and Postgres for metadata, with a Go backend and Vue frontend. It looks structurally similar to SigNoz, but the project is leaner and the UI is more SQL-forward — it expects you to be comfortable writing ClickHouse SQL when the prebuilt views run out.
Storage cost: where the math actually lives
This is the part vendor blogs always hand-wave through. Object storage is ~$23/TB-month on S3 standard. EBS gp3 is closer to $80/TB-month before IOPS. That gap drives most of the architectural decisions.
LGTM and OpenObserve both push the cold path to S3, which is where their cost story comes from. Loki indexes are tiny because the design avoids full-text indexing — labels do most of the work, and grep happens at query time. Tempo doesn’t index spans for search at all in the cheapest tier; you query by trace ID or via metrics-generated exemplars. Mimir compacts metrics into long-term object storage blocks. Your TB of logs sits on S3 for ~$23 plus a thin compute layer.
SigNoz and Uptrace put data on ClickHouse, which traditionally meant EBS. Both now support tiered storage with S3 as the cold tier, but the hot tier still lives on local SSD or EBS, and ClickHouse’s compression is good enough that the bill is often closer than you’d guess. For raw ingest the difference between “ClickHouse on EBS for 30 days hot” and “Loki on S3 for 30 days warm” can be smaller than the LGTM camp claims — sometimes 2x, not 10x. Check the numbers against your own retention before you architect around them.
The honest takeaway: at 100GB/day you won’t notice the difference. At 10TB/day, LGTM’s S3-native split and OpenObserve’s Parquet-on-S3 design start mattering a lot, and the math gets brutal for ClickHouse-on-EBS unless you’re aggressive about tiering.
Ops overhead, ranked
This is the section that should matter more than it usually does in these comparisons. Software is cheap; the SRE who has to keep it running is not.
Easiest to run: OpenObserve. One binary, one S3 bucket, one Postgres. I’ve personally stood it up in an afternoon and it just kept working. The community is smaller, which means fewer answers when something weird happens, but the surface area is small enough that you can read the code when you have to.
Easy enough for a single SRE: SigNoz. ClickHouse has rough edges — schema migrations, mutation lag, the occasional “why is this query OOM-ing” — but the SigNoz team has done most of the hard tuning for you. Helm chart works. The biggest gotcha is that ClickHouse Keeper failures are non-obvious and the recovery story is annoying.
Easy enough for a single SRE if you’re patient: Uptrace. Same ClickHouse story as SigNoz, but smaller team, fewer prebuilt dashboards, and you’ll spend more time writing your own SQL. If you’re comfortable with that trade, it’s lighter than SigNoz.
Needs a platform team: Grafana LGTM. Loki at scale has a learning curve — the storage backend changes between versions (v11 vs v13 schema, single-store vs the older boltdb-shipper), label cardinality footguns are real, and ingester memory tuning is its own skill. Mimir is fine if you already speak Prometheus deeply. Tempo is the calmest of the three. Run all four and you have four services to operate, four upgrade cycles, four sets of metrics about your observability stack. This is fine if you have three or more engineers on the platform team. It is not fine if observability is one person’s 30% rotation.
I have watched a two-person SRE team try to run LGTM at 2TB/day. They were spending more time keeping the stack alive than using it. They migrated to SigNoz in a quarter and got their weekends back. Anecdote, not data, but I’ve heard variants of this story enough times to treat it as a pattern.
Query experience: what investigation actually feels like
Different stacks reward different muscle memory. This shapes adoption more than the docs admit.
LGTM is the most familiar if your team already lives in Grafana. LogQL feels like PromQL with |= filters tacked on, which is either elegant or annoying depending on your background. The correlation panel is the killer feature — click a span, see its logs, see the metrics it generated, all in one view. Nothing else does this as cleanly out of the box.
SigNoz built a query UI that doesn’t make you learn a new language for the basic stuff. You can drag-build a query for traces or metrics, and there’s a SQL escape hatch when you need it. It’s friendlier for engineers who don’t want to memorize PromQL syntax. The trade-off is that very advanced queries push you back to raw ClickHouse SQL anyway.
OpenObserve is SQL-first, full stop. You’re writing SELECT ... FROM logs WHERE ... for most non-trivial things. If your team likes SQL this is great. If they don’t, the friction shows up immediately. The histogram and stream views are clean, but the dashboards aren’t as polished as Grafana’s.
Uptrace is roughly similar — ClickHouse SQL is the floor, and the UI is more focused on the OTel signal types than on dashboard composition. It’s the right pick for an engineering-heavy team that thinks of observability as “querying my service” rather than “looking at dashboards.”
Migration from Datadog without breaking anything
The pattern that works: dual-ship for at least a month. The OTel Collector lets you fan out the same telemetry to both your old vendor and your new self-hosted backend simultaneously. Set up the new backend, point a Collector pipeline at both sinks, then move dashboards and alerts incrementally.
The signals OpenTelemetry covers cleanly: traces, metrics, structured logs. The signals you’ll still pay a vendor for in some form: RUM (browser real-user monitoring), session replay, synthetic monitoring, and continuous profiling. The OTel profiling SIG has shipped a spec and the collector supports profile signals now, but the self-hosted UIs are still catching up. If profiling is critical, expect to bolt on Grafana Pyroscope (which integrates cleanest with LGTM) or Parca.
Don’t underestimate the dashboard migration. A team I worked with had 600 Datadog dashboards. About 80 of them were genuinely used. Three were load-bearing. Identify the load-bearing ones first, rebuild them, and let the rest die quietly. Trying to migrate everything is how these projects get killed by exhaustion.
A decision matrix that actually decides
I’m going to commit to opinions here, because the “it depends” tables in most comparison posts are useless.
-
Two-to-five-engineer startup, mostly Node/Go services, hates ops work: OpenObserve. The single-binary deploy and S3-only storage means almost nothing to break. You’ll outgrow the dashboard polish before you outgrow the architecture.
-
Mid-size SaaS, 50–200 engineers, one SRE owns observability: SigNoz. It’s the best balance of “out of the box” and “actually scales.” Migrate from Datadog onto it, run it for a year, decide later whether you need to graduate to LGTM. Most teams won’t.
-
Platform team of three or more, big Prometheus/Grafana culture already: Grafana LGTM. You have the bandwidth and the muscle memory. The correlation panel pays for itself once you’ve trained the on-call rotation. Bonus: Grafana Cloud is a clean exit if you ever want managed.
-
Regulated industry, on-prem or air-gapped, SQL-heavy engineering team: Uptrace or SigNoz, with the choice coming down to whether your team prefers prebuilt UI (SigNoz) or raw SQL flexibility (Uptrace).
-
You’re paying Datadog over $1M/year and have a real platform team: Honestly, go look at Grafana Cloud or Chronosphere first. The self-hosted total cost at that scale starts to lose to managed alternatives once you price the engineering time properly. Self-hosting wins at the middle of the curve, not the very top.
The honest limitations
A few things every self-hosted post should say but rarely does.
You will spend engineering time on this. Pretending otherwise is how teams end up two years in with a haunted stack nobody understands. Budget at least 20% of one engineer’s time, ongoing.
The single-pane-of-glass story is real with LGTM and reasonable with SigNoz, but it’s never as integrated as a vendor’s product. Cross-signal correlation costs you something everywhere except Grafana, and even there it requires careful instrumentation.
OTel itself is still evolving. The profile signal is new, the events spec is in flux, and SDK behavior varies between languages more than you’d like. Pinning collector versions and watching the upstream release notes is part of the job now.
And if your real problem is that nobody’s owning observability — that you have 600 dashboards and nobody knows which three matter — switching backends won’t fix that. Pick the stack you can staff, then spend the savings on actually using it.
If you’re early in the evaluation, my honest suggestion: spin up OpenObserve in a sandbox this week. Point one service at it. See how it feels. The deploy is cheap enough that you’ll learn more in two hours than you will from any comparison post, including this one.