Groq API Review: Fastest LLM Inference 2026
What Is Groq?
Groq is an AI inference company built around a custom chip called the LPU (Language Processing Unit). Where GPU-based inference farms run at 40–100 tokens/second on typical LLMs, Groq's LPU delivers 276–1,500+ tokens/second depending on model — typically 4–20x faster.
The speed difference is real and observable. If you've tried Groq's playground, you'll have watched Claude-level quality responses complete in under a second. For most AI APIs, that's impossible.
This review covers what Groq actually is, which models are available, what it costs, where it excels, and where its limitations will bite you.
TL;DR
Groq is the right choice when latency is your primary constraint: real-time voice AI, interactive coding assistants, streaming chat where users feel every token delay. It's the wrong choice when you need the most capable models (Groq offers open-source models, not GPT-4.1 or Claude Opus), custom fine-tuning for self-serve accounts, embeddings, or image generation. Think of Groq as a speed-optimized inference layer for open-source models.
Key Takeaways
- Speed: 276–1,500+ tokens/second depending on model (vs. 40–100 for GPU-based APIs)
- Models: Open-source only — Llama 4, Llama 3.x, Mixtral, Gemma, Whisper
- API: Fully OpenAI-compatible; just change
base_url - Free tier: Available with rate limits; no credit card required
- Pricing: Competitive — often cheaper than OpenAI equivalents
- Key limitation: No proprietary models (no GPT, Claude, Gemini)
- Best for: Real-time chat, voice, coding assistants, low-latency inference
The LPU: Why Groq Is Fast
Standard GPU inference works by batching requests and running matrix multiplications across thousands of GPU cores in parallel. It's excellent for training (lots of parallelism) but less efficient for inference, where you're often running single requests sequentially.
Groq's LPU is a deterministic, single-core processor designed exclusively for inference. It:
- Runs computation in a fixed, predictable pipeline (no queuing variation)
- Has extremely high memory bandwidth (fast token generation = fast memory reads)
- Avoids GPU scheduling overhead entirely
The result is near-zero queue time and significantly higher sustained throughput per token. Groq claims and demonstrates the fastest inference of any commercial API — consistently.
The tradeoff: The LPU is optimized for a specific class of operations. It doesn't support training or the more exotic compute patterns used by diffusion models or specialized architectures. Groq runs transformer-based language models, and it runs them very fast.
Supported Models (2026)
Llama 4 (Meta)
llama-4-scout-17b-16e-instruct— Llama 4 Scout (17B active / 109B total MoE)llama-4-maverick-17b-128e-instruct— Llama 4 Maverick (17B active / 400B+ total MoE) — deprecated Feb 20, 2026; use Llama 4 Scout instead
Llama 3.x (Meta)
llama-3.3-70b-versatile— Llama 3.3 70B; strong general performancellama-3.1-8b-instant— Fastest model on Groq; 1,000+ t/sllama-3.3-70b-specdec— Speculative decoding variant for speed
Mixtral (Mistral AI)
mixtral-8x7b-32768— Mixtral MoE; 32K context; solid multilingual
Gemma (Google)
gemma2-9b-it— Google's Gemma 2 9B instruct
Speech-to-Text
whisper-large-v3— OpenAI's Whisper v3 (speech transcription)whisper-large-v3-turbo— Faster Whisper v3 variantdistil-whisper-large-v3-en— English-only; ultra-fast
Speed Benchmarks
Groq's sustained output speeds (approximate, varies with load):
| Model | Tokens/Second | Context |
|---|---|---|
| llama-3.1-8b-instant | 1,200–1,500 t/s | 128K |
| llama-3.3-70b-versatile | ~276 t/s | 128K |
| llama-4-scout | ~460 t/s | 128K |
| llama-4-maverick | ~200 t/s (deprecated) | 128K |
| mixtral-8x7b-32768 | 500–700 t/s | 32K |
| gemma2-9b-it | 900–1,100 t/s | 8K |
For comparison, GPT-4o on OpenAI's API typically runs 40–80 tokens/second. Groq's Llama 3.3 70B — a strong model — runs 3–7x faster at ~276 t/s. The smaller 8B model hits 1,200+ t/s, a 15–30x difference.
What this means for users: A 500-token response on GPT-4o takes 6–12 seconds. The same response on Groq's 70B model takes under 1 second.
Pricing
Groq's pricing (per 1M tokens):
| Model | Input | Output |
|---|---|---|
| llama-3.1-8b-instant | $0.05 | $0.08 |
| llama-3.3-70b-versatile | $0.59 | $0.79 |
| llama-4-scout | $0.11 | $0.34 |
| llama-4-maverick | $0.50 | $0.77 |
| mixtral-8x7b-32768 | $0.24 | $0.24 |
| gemma2-9b-it | $0.20 | $0.20 |
Whisper pricing (per audio hour):
- whisper-large-v3: $0.111/hour
- whisper-large-v3-turbo: $0.04/hour
These are among the cheapest inference prices available for these model sizes. Llama 3.3 70B at $0.59/$0.79 per 1M tokens is significantly cheaper than GPT-4o-mini ($0.15/$0.60) on a per-capability basis, and dramatically faster.
Free Tier
Groq's free tier includes:
- No credit card required
- Rate-limited but functional for development
- Access to all models
- Resets daily/hourly depending on the model
Free tier rate limits are quite restrictive (e.g., 30 RPM, 14,400 RPD for smaller models). Fine for prototyping, not for production.
API Usage
Groq's API is OpenAI-compatible. Switch by changing two lines:
from openai import OpenAI
client = OpenAI(
api_key="gsk_your_groq_key",
base_url="https://api.groq.com/openai/v1",
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to parse JSON safely."},
],
temperature=0.1,
max_tokens=1024,
)
print(response.choices[0].message.content)
Streaming
stream = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": "Tell me a quick joke."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
At Groq speeds, streaming Llama 3.1 8B feels like the model is typing at 20 words per second. Users notice.
Speech-to-Text
with open("audio.mp3", "rb") as audio_file:
transcription = client.audio.transcriptions.create(
file=audio_file,
model="whisper-large-v3-turbo",
response_format="text",
language="en",
)
print(transcription)
Groq's Whisper is the fastest managed Whisper transcription available. The turbo variant processes audio at 50–100x real-time speed.
Structured Output
from pydantic import BaseModel
class Sentiment(BaseModel):
score: float # -1.0 to 1.0
label: str # positive / negative / neutral
confidence: float
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Analyze sentiment: 'This product is amazing!'"}
],
response_format={"type": "json_object"},
)
# Parse manually — Groq supports json_object mode but not json_schema validation yet
import json
result = json.loads(response.choices[0].message.content)
Note: Groq supports json_object response format but not the newer json_schema / structured output mode that OpenAI introduced. You can prompt for JSON and parse, but you don't get schema enforcement.
Real-World Use Cases
Voice AI / Speech Processing
Groq + Whisper is the fastest pipeline for speech-to-text. Combine with a fast LLM for voice assistants:
- Transcribe audio with Whisper Large v3 Turbo
- Process/respond with Llama 3.3 70B
- TTS with ElevenLabs or OpenAI
Full round-trip under 500ms is achievable.
Real-Time Coding Assistant
IDE integrations where latency matters most. Llama 3.3 70B on Groq completes 200-token code suggestions before users finish reading the first line.
Interactive Chat Applications
Consumer chat apps where slow responses kill engagement. The speed difference between Groq and standard APIs is dramatic enough that users comment on it.
High-Volume, Cost-Sensitive Pipelines
Groq's Llama pricing is among the cheapest for 70B-class models. For pipelines processing millions of requests, the cost savings add up — especially if you don't need GPT-4.1-level capability.
Limitations
No proprietary models: No GPT, no Claude, no Gemini. If you need frontier proprietary model capability, Groq is not your option.
No embeddings API: Use OpenAI, Cohere, or Voyage AI for embeddings. Groq is inference-only.
Limited fine-tuning: Standard Groq doesn't support fine-tuning. Enterprise accounts can access LoRA fine-tuning via GroqCloud, but it's not available for self-serve. For custom models, use Together AI or OpenAI fine-tuning and deploy elsewhere.
No image generation or vision (mostly): Some vision-capable models are being added but it's not the primary use case.
Context limits: Most models cap at 128K tokens. Not a limitation for most use cases, but 1M context (Gemini) isn't available.
Rate limits at free tier: Restrictive. Plan for paid tier in production.
Model variety: ~15 models vs. 100+ on OpenRouter. If you need obscure models, Groq won't have them.
Building a Low-Latency Chat Application
Here's a full example of a low-latency chat server using FastAPI and Groq:
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from openai import OpenAI
import asyncio
app = FastAPI()
groq_client = OpenAI(
api_key="gsk_your_key",
base_url="https://api.groq.com/openai/v1",
)
@app.post("/chat")
async def chat(message: str, model: str = "llama-3.3-70b-versatile"):
"""Stream a chat response at Groq speed."""
async def generate():
stream = groq_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": message}],
stream=True,
max_tokens=1024,
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
return StreamingResponse(generate(), media_type="text/plain")
@app.post("/classify")
async def classify(text: str) -> dict:
"""Fast intent classification using 8B model."""
response = groq_client.chat.completions.create(
model="llama-3.1-8b-instant", # Fastest model; 1000+ t/s
messages=[
{
"role": "system",
"content": "Classify the intent. Reply with one word: question, complaint, compliment, or request.",
},
{"role": "user", "content": text},
],
max_tokens=10,
temperature=0,
)
return {"intent": response.choices[0].message.content.strip()}
The 8B model at 1,000+ t/s is effectively instantaneous for short classification tasks — sub-100ms total response time including network.
Groq in Production: Real Patterns
Voice AI Pipeline
import asyncio
from openai import OpenAI
groq = OpenAI(
api_key="gsk_your_key",
base_url="https://api.groq.com/openai/v1",
)
async def voice_pipeline(audio_bytes: bytes) -> str:
"""Transcribe → understand → respond at sub-500ms total."""
# Step 1: Transcribe audio (Whisper turbo is 50x real-time)
transcription = groq.audio.transcriptions.create(
file=("audio.webm", audio_bytes, "audio/webm"),
model="whisper-large-v3-turbo",
response_format="text",
)
user_text = transcription
# Step 2: Generate response (Llama 70B at 500+ t/s)
response = groq.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful voice assistant. Be concise."},
{"role": "user", "content": user_text},
],
max_tokens=256, # Short response for voice
)
return response.choices[0].message.content
# Full round-trip: typically 200–500ms
# Compare: same pipeline on OpenAI → 2–5 seconds
Parallel Batch Inference
When you need to process many items quickly, Groq's throughput compounds:
import asyncio
from openai import AsyncOpenAI
groq = AsyncOpenAI(
api_key="gsk_your_key",
base_url="https://api.groq.com/openai/v1",
)
async def classify_many(texts: list[str]) -> list[str]:
"""Classify hundreds of texts in parallel using Groq's throughput."""
async def classify_one(text: str) -> str:
response = await groq.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[
{"role": "system", "content": "Label as positive/negative/neutral. One word only."},
{"role": "user", "content": text},
],
max_tokens=5,
temperature=0,
)
return response.choices[0].message.content.strip()
# Fire all requests in parallel
results = await asyncio.gather(*[classify_one(t) for t in texts])
return results
# 100 texts classified in ~2 seconds (vs. ~60 seconds on standard GPU inference)
Rate Limits in Production
Groq's free tier is restrictive. Paid tiers are much more permissive:
| Tier | Requests/min | Tokens/min | Tokens/day |
|---|---|---|---|
| Free | 30 RPM | 14,400 TPM | 500K TPD |
| Developer (paid) | 6,000 RPM | 200K TPM | Unlimited |
| Batch | Higher | Higher | Higher |
For serious production workloads, you need the Developer tier. The free tier is genuinely only for development and demos.
Groq also supports batch inference — submit jobs asynchronously for higher throughput and lower cost. Good for offline processing pipelines.
Groq vs. The Alternatives
| Factor | Groq | OpenAI | Anthropic | Together AI | OpenRouter |
|---|---|---|---|---|---|
| Speed | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| Model quality ceiling | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Cost | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐ |
| Model variety | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Developer experience | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Fine-tuning | LoRA (enterprise) | ✅ | ❌ | ✅ | ❌ (pass-through) |
When to Use Groq
✅ Use Groq when:
- Latency is a primary product requirement (voice, real-time, interactive)
- You're using Llama 4, Llama 3.x, Mixtral, or Gemma
- You want the cheapest inference for 70B-class models
- You need fast Whisper transcription
❌ Skip Groq when:
- You need GPT-4.1, Claude Opus, or Gemini quality
- You need embeddings, fine-tuning, or image generation
- You need 1M+ context windows
- You need structured output schema enforcement
Bottom Line
Groq is not trying to replace OpenAI or Anthropic — it's a specialized inference infrastructure for open-source models that prioritizes speed above all else. The LPU architecture delivers on its promise: Llama 4 at 1 second per response is a genuinely different product experience than the same model at 10 seconds.
If your application's quality bar is met by Llama 4 or Llama 3.3 70B (which is quite high in 2026), and latency matters, Groq is the easy choice. At the pricing, you're not sacrificing much on cost either.
Compare all LLM inference APIs at APIScout.
Related: OpenRouter API: One Key for 100+ LLMs · How to Choose an LLM API in 2026