Skip to main content
Logo
Overview

Replicate vs Modal vs Together vs Fireworks vs Baseten 2026

May 16, 2026
14 min read

The GPU floor collapsed this year. H100 on-demand bottomed out around $1.49/hr at the cheaper specialty providers — roughly 80% under what AWS was charging in 2023. Modal is selling H100s at $3.95/hr per-second with no minimum, Together AI is serving Llama 3.3 70B at $0.88 per million tokens, and B200 capacity is showing up in the same catalogs at prices people would have laughed at twelve months ago.

The practical effect: if you’ve been paying OpenAI per-token for anything other than the flagship reasoning models, you’re probably overpaying now. The question stopped being can I afford to host an open model in production and started being which platform should I park it on. Replicate, Modal, Together AI, Fireworks, and Baseten are the five most-searched names when engineering teams actually sit down to pick one. None of them are the same shape underneath, and the vendor blogs ranking on these comparison keywords are useless because nobody is going to tell you their product is the wrong choice for your workload.

So here’s a five-way that tries to be honest about it, including when each is wrong.

What an “AI inference platform” actually is

Three things get muddled together in this category, and the muddle is why most comparison content reads like nonsense.

A GPU rental shop (RunPod, Lambda Labs, Vast.ai, Spheron) hands you a box with a GPU in it. You ssh in, you run whatever, you pay by the hour. No serving stack, no autoscaling, no model catalog. The cheapest H100s live here.

An LLM gateway (OpenRouter, Portkey, LiteLLM) sits in front of model APIs and routes requests across them. It doesn’t run the model. We covered the gateway/observability layer in the May post on Langfuse, Helicone, Portkey, and LiteLLM.

A model API (OpenAI, Anthropic) charges per token to call a model you don’t host and can’t fine-tune outside the vendor’s offering.

An inference platform is the middle layer. It runs the model for you, exposes an HTTP endpoint, handles autoscaling and cold starts, and bills per-second or per-token instead of per-instance. The five platforms in this post all sit here, but they sit at slightly different points along the spectrum from “managed catalog” to “bring your own container.”

That spectrum is the only mental model that makes the comparison tractable. Together AI and Fireworks are almost fully managed — you call a model name, they handle everything. Replicate is mostly managed but adds a deploy-your-own-Cog story. Baseten is hybrid — managed runtime, your model artifact. Modal is the most BYO — you write Python, they give you a GPU on demand, you build your own serving stack.

Replicate: the demos-and-niche-models platform

Replicate is what most people’s first inference platform was, and that legacy shapes both its strengths and the places it gets passed over now.

The catalog is enormous. If somebody open-sourced an image, video, audio, or weird-multimodal model in the last three years, there is a Cog wrapper for it on Replicate, often within 48 hours of release. For prototyping or non-LLM workloads — Stable Diffusion variants, SDXL fine-tunes, AnimateDiff, MusicGen, Whisper — Replicate is still the fastest way from “I read the paper” to “I have an HTTP endpoint.”

Per-second billing on A100 80GB sits at $0.001400/sec (about $5/hr), with smaller GPUs cheaper. That’s not competitive for sustained high-QPS LLM workloads, but it’s fine when usage is bursty and you don’t want to think about it.

Where Replicate hurts: the LLM fine-tune-and-deploy story is weaker than Together or Fireworks, the SLA is not loudly published, and cold starts are the worst-kept secret in the platform. Replicate’s own docs note 10-30 second cold starts on public models, and large diffusion or 70B-class LLMs can spend 30-120 seconds loading before they answer the first request. That time is billed at the same per-second rate as inference. For a chat product where p99 matters more than median, that’s disqualifying.

The pricing model is also shifting — Replicate is moving to charge private models for startup and idle at half the per-second rate while halving the active rate. Net-net it’s getting cheaper, but it makes the cost model less predictable than competitors who quote flat per-token prices.

Best fit: prototypes, image/video/audio models, anything where you want to try fifty open-source models in a week. Not the right tool for 24/7 LLM chat.

Modal is the one that doesn’t quite fit the “inference platform” label, and that’s the point.

You write Python functions, decorate them, and Modal runs them on whatever GPU you ask for. There is no model catalog. There is no “call this endpoint” experience out of the box. You build the serving loop yourself — usually with vLLM or TensorRT-LLM or SGLang — and Modal gives you per-second autoscaling, fan-out, fast container starts, and a deeply integrated dev loop with modal run and modal serve.

The pricing is honest and per-second: B200 at $0.001736/sec ($6.25/hr), H100 at $0.001097/sec ($3.95/hr), with smaller GPUs and CPUs proportionally cheaper. There are no markups for the actual GPU price — what you see is what you pay during execution. Idle containers and cold starts aren’t billed once the function returns. (Modal pricing has the full table.)

The catch: those base prices are for preemptible workloads. Production non-preemptible US workloads carry roughly a 3.75x combined multiplier when you stack regional and reliability tiers, and that detail is buried in the docs. So Modal H100s for sustained production work are closer to ~$3.95/hr effective, which is more expensive than dedicated GPU rental at RunPod or Spheron. Modal isn’t selling cheap hardware — it’s selling the per-second elasticity, the fan-out, and the developer experience.

The thing Modal does that nobody else does as well: bursty, parallel, mixed-workload inference. If your job is “process 50,000 PDFs through Llama 3.3 70B in the next hour and then go to sleep,” Modal will spin up hundreds of containers, run them, and tear them down without you operating a single Kubernetes node. The same code runs locally and in production. Custom CUDA, weird audio models, mixed CPU/GPU pipelines — all trivial.

Where it hurts: if you want managed inference, this is not the product. You will write a serving loop. You will think about how to keep containers warm. There is no published SLA on the lower tiers. And the regional/reliability multipliers can sneak up on you if you don’t read the pricing page carefully.

Best fit: teams with infra muscle who want a real compute platform, not a hosted-model API. Especially good for batch, fan-out, fine-tune jobs, and custom serving stacks.

Together AI: the hosted open-weights workhorse

Together is the boring, correct choice for most teams hosting an open LLM in production. That’s not a backhanded compliment — boring and correct is what you want when the alternative is operating vLLM yourself at 2 a.m.

The catalog covers 200+ open-weights models with a unified OpenAI-compatible API. Llama 3.3 70B is $0.88 per million input and $0.88 per million output tokens, with similar pricing on Mixtral, Qwen, and DeepSeek variants. (Together AI pricing has the current table.) That price is roughly 90% under GPT-4o-class pricing for comparable quality on instruction-following tasks, which is the whole reason this category exists.

Together publishes a 99.9% uptime SLA, dedicated endpoint pricing for predictable QPS, sub-100ms TTFT on most models, and the engineering is genuinely good — they’ve contributed real research on speculative decoding and FlashAttention. Fine-tuning is supported and the deploy-your-fine-tune workflow is straightforward.

The honest weaknesses: it’s hosted-only, so if you need on-prem or VPC-resident inference for compliance reasons, this is not the product. The catalog skews heavily to LLMs and embeddings — image, video, and audio coverage is thinner than Replicate’s. And pricing is per-token, which is great for chat workloads but punishing for pipelines that push tens of thousands of tokens through prompts repeatedly.

Best fit: production LLM chat or RAG where you’ve outgrown OpenAI’s per-token bill but don’t want to operate inference yourself.

Fireworks AI: the speed specialist

Fireworks is what you reach for when latency is the metric.

The pitch is FireAttention — their custom attention kernel, now in v3 — which the company benchmarks at roughly 4x throughput over open-source vLLM and 1.4-1.8x over NIM containers, depending on model and context length. (FireAttention v3 announcement covers the AMD MI300 numbers; the broader engine claim is documented in their earlier posts.) For long-context inference specifically — 32K+ prompts in production chat or coding workloads — Fireworks is consistently faster than Together at comparable price-per-million-token.

Llama 3.3 70B sits around $0.90 per million tokens, within a few cents of Together. Same 99.9% SLA. Same OpenAI-compatible API. The difference is throughput-per-dollar at high QPS and long contexts.

What you’re trading: catalog breadth is narrower than Together’s, the non-LLM story (image, audio, video) is much thinner, and fine-tune-then-deploy-elsewhere isn’t as smooth — Fireworks fine-tunes are best served on Fireworks. If you care primarily about variety, that’s a real cost.

Best fit: production chat, coding assistants, and agentic workloads where p99 latency on long contexts is what your users are actually complaining about.

Baseten: the production MLOps platform

Baseten is the one most non-MLOps engineers haven’t heard of, and the one MLOps engineers tend to land on after they’ve tried the others.

The packaging unit is Truss — an open-source format for bundling a model artifact, its dependencies, and its serving code. You write a Truss, push it, and Baseten gives you a dedicated deployment with autoscaling, observability, multi-region support, A/B routing, and detailed metrics. (Baseten pricing is per-minute on dedicated GPUs, with B200 chips down through T4s in the tier list.)

The big philosophical difference: Baseten is dedicated-deployment-first, not shared-serverless-first. You pay for the GPU minutes your deployment is up — including idle — and in exchange you get predictable latency, no noisy neighbors, and behavior that looks like a real production service. There’s a Pro plan with capacity reservation for high-volume workloads and an Enterprise plan with the usual VPC/SOC2/data-residency story.

Where it gets uncomfortable: there is no shared serverless pool. If your workload is genuinely bursty and goes idle for 16 hours a day, you’ll either pay for idle capacity or accept the cold-start hit when scaling from zero. Baseten autoscaling is real, but the floor is “GPU is up.” For hobbyist or low-traffic projects, this is overkill and expensive.

Best fit: a real ML team with real production traffic that wants Datadog-quality observability and Kubernetes-quality reliability without operating Kubernetes.

The cost-per-token reality check

Per-token pricing for the popular open models in May 2026, rounded to make the comparison legible:

PlatformLlama 3.3 70B in/out (per 1M tokens)Mixtral 8x22BPricing model
Together AI$0.88 / $0.88~$1.20 / $1.20Per-token
Fireworks AI~$0.90 / $0.90~$1.20 / $1.20Per-token
Replicatevaries by deploymentvariesPer-second
Modaldepends on your serving stackdependsPer-second GPU
Basetenper-GPU-minuteper-GPU-minutePer-minute

A few things to notice. First, Together and Fireworks are priced within rounding error of each other on the headline LLMs. Don’t pick between them on price alone — pick on which one is faster for your prompt shape (Fireworks for long context, Together for everything else) or whose catalog has the model you actually want to serve.

Second, the per-second platforms (Replicate, Modal, Baseten) can be cheaper or dramatically more expensive than the per-token platforms depending on your utilization. A single H100 serving Llama 3.3 70B with reasonable batching can push ~10,000 tokens/sec, which at $3.95/hr works out to roughly $0.11 per million tokens of compute. That sounds amazing until you remember (a) you have to actually achieve that throughput, which requires real engineering, (b) you pay for the GPU when it’s idle, and (c) cold starts are billed too.

The break-even is genuinely at high utilization. If your deployment is at 60-80% load most of the time, dedicated GPU on Modal or Baseten will be 3-5x cheaper than Together’s per-token price. At 20% load, it will be the same or more expensive. At 5% load — which describes most prototype-stage products — per-token wins by a mile.

Third, Deepinfra and similar discount providers are quietly serving Llama 3.3 70B at around $0.23 input / $0.40 output per million tokens, which is roughly 3x cheaper than Together. I haven’t put them on the main list because their reliability story isn’t on par with Together’s published SLA, but if your workload is non-critical it’s worth knowing they exist.

Cold start is the most-lied-about metric

Every inference platform’s marketing claims sub-second cold starts. Almost none of them deliver this in practice on real-world models.

Replicate openly publishes 10-30 second cold starts on public models, 30-120 seconds on large diffusion or 70B LLMs. That’s honest and unusual.

Modal cold starts can be sub-second on small CPU functions and 5-15 seconds for GPU-attached containers loading a typical LLM artifact, mostly limited by how fast the image and weights can be pulled. Modal’s lookup and warm-pool patterns can get this under 2 seconds if you commit engineering effort to it.

Together and Fireworks effectively don’t have cold starts visible to you — they keep a warm pool of every model in their catalog. That’s a real advantage, paid for by your per-token price.

Baseten cold starts depend on whether you scale-to-zero or keep a min-instance floor. Scale-to-zero is honestly closer to 30-90 seconds on a 70B model. Min-instances=1 is a few hundred ms to first token. Pick your tradeoff.

If your product is user-facing chat and you can’t tolerate first-byte delays above 500ms on a cold path, the only platforms that get you there without engineering work are Together and Fireworks.

Routing patterns that beat single-vendor

The teams I’ve seen run inference well in 2026 don’t pick one platform. They route by workload shape.

  • Streaming chat with short prompts → Fireworks (best TTFT at the price)
  • Streaming chat with long context (RAG, coding, agentic) → Fireworks again, FireAttention pays off here
  • High-volume batch processing → Together’s batch API or Modal with vLLM
  • Niche / non-LLM models → Replicate (catalog wins)
  • Custom pipelines, fine-tune jobs, fan-out work → Modal
  • Dedicated deployments with strict observability requirements → Baseten
  • Compliance-bound workloads in your own VPC → Baseten Enterprise or Modal with private cluster

The engineering effort to maintain three SDK clients is real but usually pays for itself within a quarter. An LLM gateway (covered in the earlier post linked above) makes this much less painful — you write code against one OpenAI-compatible client and route at the gateway layer.

Decision matrix, condensed

  • You’re prototyping or your traffic is spiky and unpredictable: Replicate or Together. Don’t over-engineer.
  • You have steady production LLM traffic and want one bill: Together AI. The boring answer is usually right.
  • You care about latency on long contexts more than catalog breadth: Fireworks AI.
  • You have an MLOps team and want dedicated production deployments with real observability: Baseten.
  • You want a real compute platform and your team can write Python serving code: Modal.
  • You need to fine-tune cheaply and serve the result: Together for hosted LLMs, Modal if it’s a non-LLM model or you want full control.

The thing I’d push back on, gently: don’t pick the platform with the cheapest per-token price on a spreadsheet. Pick the one whose failure modes you’re willing to operate around. Cold starts, fine-tune workflows, observability gaps, and pricing-multiplier surprises hurt more in production than a 10% price delta on a model you’re not even sure is the right model yet.

If you’ve got a free afternoon, write the same toy RAG pipeline against Together, Fireworks, and Modal in parallel and watch which one you actually enjoy debugging. That answer matters more than any benchmark in this post.

Sources: