Fireworks AI vs Together AI vs Groq
TL;DR
Groq, Fireworks AI, and Together AI all run open-source models faster and cheaper than OpenAI — but they optimize for different things. Groq (LPU chips) is the fastest raw inference provider, with Llama 3.3 70B at 500+ tokens/second. Fireworks is the most developer-friendly with production features (structured output, fine-tuning, function calling). Together AI has the widest model selection and the best fine-tuning pipeline. If you're choosing between them: Groq for latency-critical apps, Fireworks for production SaaS, Together for research and fine-tuning.
Key Takeaways
- Groq: 500+ tokens/sec on Llama 70B — fastest available, LPU hardware advantage
- Fireworks: best structured output + function calling for open models, production-grade
- Together AI: 200+ models, best fine-tuning workflow, multimodal support
- Cost: all three are 5-20x cheaper than GPT-4o for comparable quality open models
- OpenAI-compatible: all three use the same
/chat/completionsAPI format - When to use: Groq for chatbots, Fireworks for API products, Together for model research
The Case for Third-Party Inference
Why not just use OpenAI?
OpenAI GPT-4o:
Cost: $5/M input, $15/M output
Speed: ~80 tokens/sec
Models: OpenAI only (closed source)
Llama 3.3 70B on Groq:
Cost: $0.59/M input, $0.79/M output
Speed: 500+ tokens/sec
Models: Open source, no vendor lock-in
For many tasks (code, Q&A, summarization):
Llama 70B quality ≈ GPT-4o quality
Cost: 15-20x cheaper
Speed: 6x faster
Groq: The Speed King
Best for: real-time applications, latency-sensitive chatbots, anything needing sub-second response
// Groq is OpenAI API-compatible — drop-in replacement:
import Groq from 'groq-sdk';
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });
const completion = await groq.chat.completions.create({
messages: [{ role: 'user', content: 'Explain quantum computing in 3 sentences.' }],
model: 'llama-3.3-70b-versatile',
temperature: 0.7,
max_tokens: 500,
stream: true, // Streaming works the same as OpenAI
});
for await (const chunk of completion) {
process.stdout.write(chunk.choices[0]?.delta?.content ?? '');
}
// Or with the raw OpenAI SDK (just change baseURL):
import OpenAI from 'openai';
const groqClient = new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: 'https://api.groq.com/openai/v1',
});
// Exact same API as OpenAI from here
Groq Models (2026)
| Model | Tokens/sec | Context | Input $/M | Output $/M |
|---|---|---|---|---|
| llama-3.3-70b-versatile | 500+ | 128K | $0.59 | $0.79 |
| llama-3.1-8b-instant | 750+ | 128K | $0.05 | $0.08 |
| mixtral-8x7b-32768 | 500+ | 32K | $0.24 | $0.24 |
| gemma2-9b-it | 600+ | 8K | $0.20 | $0.20 |
| llama-3.2-90b-vision | 300+ | 128K | $0.90 | $0.90 |
How Groq achieves this: LPU (Language Processing Unit) — custom silicon designed specifically for sequential token generation. Unlike GPUs (optimized for parallelism), LPUs excel at the autoregressive nature of LLM decoding.
Groq Limitations
❌ No fine-tuning (fixed public models only)
❌ Limited model selection vs Together/Fireworks
❌ No persistent storage or embeddings
❌ Rate limits more aggressive than competitors
✅ Best latency
✅ OpenAI-compatible API
✅ Predictable performance
Fireworks AI: Production-Grade Open Models
Best for: production API products, structured output requirements, function calling with open models
import OpenAI from 'openai';
const fireworks = new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: 'https://api.fireworks.ai/inference/v1',
});
// Structured output (Fireworks FireFunction-v2 is best for this):
const response = await fireworks.chat.completions.create({
model: 'accounts/fireworks/models/firefunction-v2',
messages: [
{ role: 'user', content: 'Extract the name and email from: "John Smith, john@example.com"' },
],
response_format: { type: 'json_object' }, // Guaranteed JSON
tools: [
{
type: 'function',
function: {
name: 'extract_contact',
description: 'Extract contact information',
parameters: {
type: 'object',
properties: {
name: { type: 'string' },
email: { type: 'string', format: 'email' },
},
required: ['name', 'email'],
},
},
},
],
tool_choice: { type: 'function', function: { name: 'extract_contact' } },
});
Fireworks Key Differentiators
1. FireFunction-v2 — open-source model fine-tuned specifically for function calling:
// FireFunction-v2 matches GPT-4 on function calling benchmarks
// At $0.90/M tokens vs GPT-4o's $15/M output
model: 'accounts/fireworks/models/firefunction-v2'
2. Structured Output with any model:
// Fireworks supports structured JSON output on most models
// via response_format or grammar-based sampling:
const response = await fireworks.chat.completions.create({
model: 'accounts/fireworks/models/llama-v3p3-70b-instruct',
response_format: {
type: 'json_schema',
json_schema: {
name: 'product_review',
schema: {
type: 'object',
properties: {
sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] },
score: { type: 'number', minimum: 1, maximum: 5 },
summary: { type: 'string' },
},
required: ['sentiment', 'score', 'summary'],
},
},
},
messages: [{ role: 'user', content: `Review: "${reviewText}"` }],
});
3. Fine-tuning pipeline:
// Upload training data:
const formData = new FormData();
formData.append('file', fs.createReadStream('training.jsonl'));
await fetch('https://api.fireworks.ai/v1/files', {
method: 'POST',
headers: { Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}` },
body: formData,
});
// Create fine-tuning job:
await fetch('https://api.fireworks.ai/v1/fine_tuning/jobs', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.FIREWORKS_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama-v3p1-8b-instruct',
training_file: 'file-abc123',
hyperparameters: { n_epochs: 3, learning_rate_multiplier: 1.0 },
}),
});
Fireworks Models
| Model | Context | Input $/M | Notes |
|---|---|---|---|
| llama-v3p3-70b-instruct | 131K | $0.90 | Best general purpose |
| firefunction-v2 | 8K | $0.90 | Best function calling |
| llama-v3p2-11b-vision | 131K | $0.20 | Vision + text |
| phi-3-vision-128k | 128K | $0.20 | Lightweight vision |
| mixtral-8x22b-instruct | 65K | $0.90 | Complex reasoning |
| llama-v3p1-405b-instruct | 131K | $3.00 | Most capable open model |
Together AI: The Model Research Platform
Best for: trying many open-source models, fine-tuning experiments, teams researching model capabilities
import OpenAI from 'openai';
const together = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: 'https://api.together.xyz/v1',
});
const response = await together.chat.completions.create({
model: 'meta-llama/Llama-3.3-70B-Instruct-Turbo',
messages: [{ role: 'user', content: 'What is 17 * 23?' }],
max_tokens: 100,
temperature: 0.1,
});
Together's 200+ Model Selection
AI21 Labs:
jamba-1.5-large, jamba-1.5-mini
Alibaba:
Qwen/Qwen2.5-72B-Instruct, Qwen/QwQ-32B-Preview
Deepseek:
deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-R1
Google:
google/gemma-2-27b-it, google/gemma-2-9b-it
Meta:
meta-llama/Llama-3.3-70B-Instruct-Turbo
meta-llama/Llama-3.1-405B-Instruct-Turbo
meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo
Mistral:
mistralai/Mixtral-8x22B-Instruct-v0.1
mistralai/Mistral-7B-Instruct-v0.3
NovaSky:
NovaSky-AI/Sky-T1-32B-Preview
Nvidia:
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF
Together's Fine-Tuning (Most Complete)
Together has the most complete fine-tuning pipeline among the three:
// Together fine-tuning with LoRA:
const fineTuneJob = await fetch('https://api.together.xyz/v1/fine-tunes', {
method: 'POST',
headers: {
Authorization: `Bearer ${process.env.TOGETHER_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'meta-llama/Meta-Llama-3-8B-Instruct-Reference',
training_file: 'file-abc123',
n_epochs: 3,
learning_rate: 1e-5,
batch_size: 16,
lora: true, // LoRA fine-tuning (cheaper, faster)
lora_rank: 8,
lora_alpha: 16,
lora_dropout: 0.05,
}),
}).then((r) => r.json());
// After fine-tuning, deploy as dedicated endpoint:
// model: 'your-org/your-fine-tuned-model'
Together Embeddings
Together is the only provider with competitive embeddings:
const embedding = await together.embeddings.create({
model: 'togethercomputer/m2-bert-80M-8k-retrieval',
input: 'text to embed',
});
// 768-dimension embeddings at $0.008/M tokens
Speed Benchmark (2026)
Real-world tokens/second on Llama 3.3 70B:
| Provider | Tokens/sec (output) | First token latency |
|---|---|---|
| Groq | 500-700 | ~80ms |
| Fireworks | 100-150 | ~200ms |
| Together | 80-120 | ~250ms |
| OpenAI GPT-4o | 80-100 | ~400ms |
| Anthropic Claude 3.5 | 60-80 | ~500ms |
Groq's LPU advantage is real and significant — 3-5x faster than GPU-based providers.
Cost Comparison at Scale
For a chatbot handling 1M queries/month (avg 500 input + 200 output tokens):
| Provider + Model | Monthly Cost | Notes |
|---|---|---|
| Groq Llama 3.3 70B | $445 | $0.59+$0.79 /M |
| Fireworks Llama 70B | $630 | $0.90 /M both |
| Together Llama 70B | $445 | Similar to Groq |
| OpenAI GPT-4o-mini | $750 | $0.15+$0.60 /M |
| OpenAI GPT-4o | $7,500 | $5+$15 /M |
All three open model providers are 10-17x cheaper than GPT-4o for the same volume.
Decision Framework
Use GROQ if:
→ Chatbot or voice app requiring <200ms response
→ User-facing real-time streaming
→ You don't need fine-tuning
Use FIREWORKS if:
→ You need reliable structured output (JSON schemas)
→ Function calling with open models
→ Production API product where schema compliance matters
→ Fine-tuning with production deployment
Use TOGETHER if:
→ You need to try 10+ different models
→ Research or model comparison
→ Fine-tuning with LoRA and custom hyperparameters
→ Embeddings alongside completions
→ You want DeepSeek R1 or other newer models first
Use all three (via abstraction):
→ Vercel AI SDK or LiteLLM can route to any provider
→ A/B test models across providers
→ Failover: if Groq rate limits, fall back to Together
Multi-Provider Abstraction
// Route to fastest available provider:
import OpenAI from 'openai';
type Provider = 'groq' | 'fireworks' | 'together';
const providers: Record<Provider, OpenAI> = {
groq: new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: 'https://api.groq.com/openai/v1',
}),
fireworks: new OpenAI({
apiKey: process.env.FIREWORKS_API_KEY,
baseURL: 'https://api.fireworks.ai/inference/v1',
}),
together: new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: 'https://api.together.xyz/v1',
}),
};
const modelMap: Record<Provider, string> = {
groq: 'llama-3.3-70b-versatile',
fireworks: 'accounts/fireworks/models/llama-v3p3-70b-instruct',
together: 'meta-llama/Llama-3.3-70B-Instruct-Turbo',
};
async function inferWithFallback(prompt: string, preferredProvider: Provider = 'groq') {
const order: Provider[] = [
preferredProvider,
...(['groq', 'fireworks', 'together'] as Provider[]).filter((p) => p !== preferredProvider),
];
for (const provider of order) {
try {
const response = await providers[provider].chat.completions.create({
model: modelMap[provider],
messages: [{ role: 'user', content: prompt }],
});
return response.choices[0].message.content;
} catch (err) {
console.error(`${provider} failed, trying next...`, err);
}
}
throw new Error('All providers failed');
}
Compare all AI inference APIs at APIScout.