Skip to main content

Groq API Review: Fastest LLM Inference 2026

·APIScout Team
groqllminferenceapiperformancelpu2026

What Is Groq?

Groq is an AI inference company built around a custom chip called the LPU (Language Processing Unit). Where GPU-based inference farms run at 40–100 tokens/second on typical LLMs, Groq's LPU delivers 276–1,500+ tokens/second depending on model — typically 4–20x faster.

The speed difference is real and observable. If you've tried Groq's playground, you'll have watched Claude-level quality responses complete in under a second. For most AI APIs, that's impossible.

This review covers what Groq actually is, which models are available, what it costs, where it excels, and where its limitations will bite you.

TL;DR

Groq is the right choice when latency is your primary constraint: real-time voice AI, interactive coding assistants, streaming chat where users feel every token delay. It's the wrong choice when you need the most capable models (Groq offers open-source models, not GPT-4.1 or Claude Opus), custom fine-tuning for self-serve accounts, embeddings, or image generation. Think of Groq as a speed-optimized inference layer for open-source models.

Key Takeaways

  • Speed: 276–1,500+ tokens/second depending on model (vs. 40–100 for GPU-based APIs)
  • Models: Open-source only — Llama 4, Llama 3.x, Mixtral, Gemma, Whisper
  • API: Fully OpenAI-compatible; just change base_url
  • Free tier: Available with rate limits; no credit card required
  • Pricing: Competitive — often cheaper than OpenAI equivalents
  • Key limitation: No proprietary models (no GPT, Claude, Gemini)
  • Best for: Real-time chat, voice, coding assistants, low-latency inference

The LPU: Why Groq Is Fast

Standard GPU inference works by batching requests and running matrix multiplications across thousands of GPU cores in parallel. It's excellent for training (lots of parallelism) but less efficient for inference, where you're often running single requests sequentially.

Groq's LPU is a deterministic, single-core processor designed exclusively for inference. It:

  • Runs computation in a fixed, predictable pipeline (no queuing variation)
  • Has extremely high memory bandwidth (fast token generation = fast memory reads)
  • Avoids GPU scheduling overhead entirely

The result is near-zero queue time and significantly higher sustained throughput per token. Groq claims and demonstrates the fastest inference of any commercial API — consistently.

The tradeoff: The LPU is optimized for a specific class of operations. It doesn't support training or the more exotic compute patterns used by diffusion models or specialized architectures. Groq runs transformer-based language models, and it runs them very fast.

Supported Models (2026)

Llama 4 (Meta)

  • llama-4-scout-17b-16e-instruct — Llama 4 Scout (17B active / 109B total MoE)
  • llama-4-maverick-17b-128e-instruct — Llama 4 Maverick (17B active / 400B+ total MoE) — deprecated Feb 20, 2026; use Llama 4 Scout instead

Llama 3.x (Meta)

  • llama-3.3-70b-versatile — Llama 3.3 70B; strong general performance
  • llama-3.1-8b-instant — Fastest model on Groq; 1,000+ t/s
  • llama-3.3-70b-specdec — Speculative decoding variant for speed

Mixtral (Mistral AI)

  • mixtral-8x7b-32768 — Mixtral MoE; 32K context; solid multilingual

Gemma (Google)

  • gemma2-9b-it — Google's Gemma 2 9B instruct

Speech-to-Text

  • whisper-large-v3 — OpenAI's Whisper v3 (speech transcription)
  • whisper-large-v3-turbo — Faster Whisper v3 variant
  • distil-whisper-large-v3-en — English-only; ultra-fast

Speed Benchmarks

Groq's sustained output speeds (approximate, varies with load):

ModelTokens/SecondContext
llama-3.1-8b-instant1,200–1,500 t/s128K
llama-3.3-70b-versatile~276 t/s128K
llama-4-scout~460 t/s128K
llama-4-maverick~200 t/s (deprecated)128K
mixtral-8x7b-32768500–700 t/s32K
gemma2-9b-it900–1,100 t/s8K

For comparison, GPT-4o on OpenAI's API typically runs 40–80 tokens/second. Groq's Llama 3.3 70B — a strong model — runs 3–7x faster at ~276 t/s. The smaller 8B model hits 1,200+ t/s, a 15–30x difference.

What this means for users: A 500-token response on GPT-4o takes 6–12 seconds. The same response on Groq's 70B model takes under 1 second.

Pricing

Groq's pricing (per 1M tokens):

ModelInputOutput
llama-3.1-8b-instant$0.05$0.08
llama-3.3-70b-versatile$0.59$0.79
llama-4-scout$0.11$0.34
llama-4-maverick$0.50$0.77
mixtral-8x7b-32768$0.24$0.24
gemma2-9b-it$0.20$0.20

Whisper pricing (per audio hour):

  • whisper-large-v3: $0.111/hour
  • whisper-large-v3-turbo: $0.04/hour

These are among the cheapest inference prices available for these model sizes. Llama 3.3 70B at $0.59/$0.79 per 1M tokens is significantly cheaper than GPT-4o-mini ($0.15/$0.60) on a per-capability basis, and dramatically faster.

Free Tier

Groq's free tier includes:

  • No credit card required
  • Rate-limited but functional for development
  • Access to all models
  • Resets daily/hourly depending on the model

Free tier rate limits are quite restrictive (e.g., 30 RPM, 14,400 RPD for smaller models). Fine for prototyping, not for production.

API Usage

Groq's API is OpenAI-compatible. Switch by changing two lines:

from openai import OpenAI

client = OpenAI(
    api_key="gsk_your_groq_key",
    base_url="https://api.groq.com/openai/v1",
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to parse JSON safely."},
    ],
    temperature=0.1,
    max_tokens=1024,
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="llama-3.1-8b-instant",
    messages=[{"role": "user", "content": "Tell me a quick joke."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

At Groq speeds, streaming Llama 3.1 8B feels like the model is typing at 20 words per second. Users notice.

Speech-to-Text

with open("audio.mp3", "rb") as audio_file:
    transcription = client.audio.transcriptions.create(
        file=audio_file,
        model="whisper-large-v3-turbo",
        response_format="text",
        language="en",
    )
print(transcription)

Groq's Whisper is the fastest managed Whisper transcription available. The turbo variant processes audio at 50–100x real-time speed.

Structured Output

from pydantic import BaseModel

class Sentiment(BaseModel):
    score: float  # -1.0 to 1.0
    label: str    # positive / negative / neutral
    confidence: float

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Analyze sentiment: 'This product is amazing!'"}
    ],
    response_format={"type": "json_object"},
)
# Parse manually — Groq supports json_object mode but not json_schema validation yet
import json
result = json.loads(response.choices[0].message.content)

Note: Groq supports json_object response format but not the newer json_schema / structured output mode that OpenAI introduced. You can prompt for JSON and parse, but you don't get schema enforcement.

Real-World Use Cases

Voice AI / Speech Processing

Groq + Whisper is the fastest pipeline for speech-to-text. Combine with a fast LLM for voice assistants:

  • Transcribe audio with Whisper Large v3 Turbo
  • Process/respond with Llama 3.3 70B
  • TTS with ElevenLabs or OpenAI

Full round-trip under 500ms is achievable.

Real-Time Coding Assistant

IDE integrations where latency matters most. Llama 3.3 70B on Groq completes 200-token code suggestions before users finish reading the first line.

Interactive Chat Applications

Consumer chat apps where slow responses kill engagement. The speed difference between Groq and standard APIs is dramatic enough that users comment on it.

High-Volume, Cost-Sensitive Pipelines

Groq's Llama pricing is among the cheapest for 70B-class models. For pipelines processing millions of requests, the cost savings add up — especially if you don't need GPT-4.1-level capability.

Limitations

No proprietary models: No GPT, no Claude, no Gemini. If you need frontier proprietary model capability, Groq is not your option.

No embeddings API: Use OpenAI, Cohere, or Voyage AI for embeddings. Groq is inference-only.

Limited fine-tuning: Standard Groq doesn't support fine-tuning. Enterprise accounts can access LoRA fine-tuning via GroqCloud, but it's not available for self-serve. For custom models, use Together AI or OpenAI fine-tuning and deploy elsewhere.

No image generation or vision (mostly): Some vision-capable models are being added but it's not the primary use case.

Context limits: Most models cap at 128K tokens. Not a limitation for most use cases, but 1M context (Gemini) isn't available.

Rate limits at free tier: Restrictive. Plan for paid tier in production.

Model variety: ~15 models vs. 100+ on OpenRouter. If you need obscure models, Groq won't have them.

Building a Low-Latency Chat Application

Here's a full example of a low-latency chat server using FastAPI and Groq:

from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse
from openai import OpenAI
import asyncio

app = FastAPI()

groq_client = OpenAI(
    api_key="gsk_your_key",
    base_url="https://api.groq.com/openai/v1",
)

@app.post("/chat")
async def chat(message: str, model: str = "llama-3.3-70b-versatile"):
    """Stream a chat response at Groq speed."""

    async def generate():
        stream = groq_client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": message}],
            stream=True,
            max_tokens=1024,
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

    return StreamingResponse(generate(), media_type="text/plain")

@app.post("/classify")
async def classify(text: str) -> dict:
    """Fast intent classification using 8B model."""
    response = groq_client.chat.completions.create(
        model="llama-3.1-8b-instant",  # Fastest model; 1000+ t/s
        messages=[
            {
                "role": "system",
                "content": "Classify the intent. Reply with one word: question, complaint, compliment, or request.",
            },
            {"role": "user", "content": text},
        ],
        max_tokens=10,
        temperature=0,
    )
    return {"intent": response.choices[0].message.content.strip()}

The 8B model at 1,000+ t/s is effectively instantaneous for short classification tasks — sub-100ms total response time including network.

Groq in Production: Real Patterns

Voice AI Pipeline

import asyncio
from openai import OpenAI

groq = OpenAI(
    api_key="gsk_your_key",
    base_url="https://api.groq.com/openai/v1",
)

async def voice_pipeline(audio_bytes: bytes) -> str:
    """Transcribe → understand → respond at sub-500ms total."""

    # Step 1: Transcribe audio (Whisper turbo is 50x real-time)
    transcription = groq.audio.transcriptions.create(
        file=("audio.webm", audio_bytes, "audio/webm"),
        model="whisper-large-v3-turbo",
        response_format="text",
    )
    user_text = transcription

    # Step 2: Generate response (Llama 70B at 500+ t/s)
    response = groq.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[
            {"role": "system", "content": "You are a helpful voice assistant. Be concise."},
            {"role": "user", "content": user_text},
        ],
        max_tokens=256,  # Short response for voice
    )

    return response.choices[0].message.content

# Full round-trip: typically 200–500ms
# Compare: same pipeline on OpenAI → 2–5 seconds

Parallel Batch Inference

When you need to process many items quickly, Groq's throughput compounds:

import asyncio
from openai import AsyncOpenAI

groq = AsyncOpenAI(
    api_key="gsk_your_key",
    base_url="https://api.groq.com/openai/v1",
)

async def classify_many(texts: list[str]) -> list[str]:
    """Classify hundreds of texts in parallel using Groq's throughput."""
    async def classify_one(text: str) -> str:
        response = await groq.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[
                {"role": "system", "content": "Label as positive/negative/neutral. One word only."},
                {"role": "user", "content": text},
            ],
            max_tokens=5,
            temperature=0,
        )
        return response.choices[0].message.content.strip()

    # Fire all requests in parallel
    results = await asyncio.gather(*[classify_one(t) for t in texts])
    return results

# 100 texts classified in ~2 seconds (vs. ~60 seconds on standard GPU inference)

Rate Limits in Production

Groq's free tier is restrictive. Paid tiers are much more permissive:

TierRequests/minTokens/minTokens/day
Free30 RPM14,400 TPM500K TPD
Developer (paid)6,000 RPM200K TPMUnlimited
BatchHigherHigherHigher

For serious production workloads, you need the Developer tier. The free tier is genuinely only for development and demos.

Groq also supports batch inference — submit jobs asynchronously for higher throughput and lower cost. Good for offline processing pipelines.

Groq vs. The Alternatives

FactorGroqOpenAIAnthropicTogether AIOpenRouter
Speed⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Model quality ceiling⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Cost⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Model variety⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Developer experience⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐
Fine-tuningLoRA (enterprise)❌ (pass-through)

When to Use Groq

✅ Use Groq when:

  • Latency is a primary product requirement (voice, real-time, interactive)
  • You're using Llama 4, Llama 3.x, Mixtral, or Gemma
  • You want the cheapest inference for 70B-class models
  • You need fast Whisper transcription

❌ Skip Groq when:

  • You need GPT-4.1, Claude Opus, or Gemini quality
  • You need embeddings, fine-tuning, or image generation
  • You need 1M+ context windows
  • You need structured output schema enforcement

Bottom Line

Groq is not trying to replace OpenAI or Anthropic — it's a specialized inference infrastructure for open-source models that prioritizes speed above all else. The LPU architecture delivers on its promise: Llama 4 at 1 second per response is a genuinely different product experience than the same model at 10 seconds.

If your application's quality bar is met by Llama 4 or Llama 3.3 70B (which is quite high in 2026), and latency matters, Groq is the easy choice. At the pricing, you're not sacrificing much on cost either.


Compare all LLM inference APIs at APIScout.

Related: OpenRouter API: One Key for 100+ LLMs · How to Choose an LLM API in 2026

Comments