Building an AI-Powered App: Choosing Your API Stack
·APIScout Team
aillmapi stackvector databaseembeddings
Building an AI-Powered App: Choosing Your API Stack
Building an AI-powered app isn't just picking an LLM. It's choosing providers for inference, embeddings, vector storage, guardrails, monitoring, and more. Here's the complete stack and how to choose the right API for each layer.
The AI API Stack
┌─────────────────────────────────────┐
│ Frontend / User Interface │ Chat UI, streaming display
├─────────────────────────────────────┤
│ AI Gateway / Router │ LiteLLM, Portkey, or custom
│ (model routing, fallback, caching) │
├─────────────────────────────────────┤
│ LLM Provider │ OpenAI, Anthropic, Google, OSS
│ (chat, completion, reasoning) │
├─────────────────────────────────────┤
│ Embeddings │ OpenAI, Cohere, Voyage AI
│ (text → vectors for search/RAG) │
├─────────────────────────────────────┤
│ Vector Database │ Pinecone, Weaviate, Qdrant
│ (similarity search, retrieval) │
├─────────────────────────────────────┤
│ Document Processing │ Unstructured, LlamaParse
│ (PDF, HTML → chunks) │
├─────────────────────────────────────┤
│ Guardrails / Safety │ Guardrails AI, NeMo
│ (content filtering, validation) │
├─────────────────────────────────────┤
│ Monitoring / Observability │ Helicone, Langfuse, Braintrust
│ (cost tracking, quality, latency) │
└─────────────────────────────────────┘
Layer 1: LLM Provider
| Provider | Best Model | Strength | Pricing (1M tokens) |
|---|---|---|---|
| Anthropic | Claude Sonnet | Coding, analysis, safety | $3 in / $15 out |
| OpenAI | GPT-4o | Multimodal, ecosystem | $5 in / $15 out |
| Gemini 2.0 Pro | Long context, multimodal | $1.25 in / $5 out | |
| Groq | Llama 3.3 70B | Ultra-fast inference | $0.59 in / $0.79 out |
| Together AI | Open models | Variety, competitive price | $0.20-3.00 |
Choosing Your Primary LLM
Need best reasoning/coding? → Claude Sonnet or GPT-4o
Need cheapest good model? → Gemini 2.0 Flash or Llama 3.3
Need fastest inference? → Groq
Need multimodal (vision)? → GPT-4o or Claude Sonnet
Need open-source/self-host? → Llama 3.3 via Together/vLLM
Need 1M+ token context? → Gemini (2M context window)
Basic Integration
import Anthropic from '@anthropic-ai/sdk';
const anthropic = new Anthropic();
async function chat(userMessage: string, systemPrompt?: string) {
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: systemPrompt || 'You are a helpful assistant.',
messages: [{ role: 'user', content: userMessage }],
});
return response.content[0].type === 'text'
? response.content[0].text
: '';
}
Layer 2: Embeddings
| Provider | Model | Dimensions | Pricing (1M tokens) |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | $0.02 |
| OpenAI | text-embedding-3-large | 3072 | $0.13 |
| Cohere | embed-v4 | 1024 | $0.10 |
| Voyage AI | voyage-3 | 1024 | $0.06 |
| text-embedding-004 | 768 | Free (low volume) |
Embedding Pipeline
import OpenAI from 'openai';
const openai = new OpenAI();
async function embedText(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: texts,
});
return response.data.map(item => item.embedding);
}
// Chunk documents before embedding
function chunkText(text: string, chunkSize: number = 500, overlap: number = 50): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
Layer 3: Vector Database
| Database | Type | Free Tier | Best For |
|---|---|---|---|
| Pinecone | Managed | 1 index, 100K vectors | Simplest setup |
| Weaviate | Managed / Self-hosted | 14-day trial | Hybrid search |
| Qdrant | Managed / Self-hosted | 1GB free | Performance, self-hosting |
| Chroma | Self-hosted | Free (OSS) | Local development |
| pgvector | PostgreSQL extension | Free (with Postgres) | Already using Postgres |
RAG Pipeline
// Complete RAG (Retrieval-Augmented Generation) pipeline
// 1. Index documents
async function indexDocuments(documents: { id: string; text: string }[]) {
for (const doc of documents) {
const chunks = chunkText(doc.text);
const embeddings = await embedText(chunks);
// Store in vector database
await vectorDB.upsert(
chunks.map((chunk, i) => ({
id: `${doc.id}_${i}`,
values: embeddings[i],
metadata: { text: chunk, documentId: doc.id },
}))
);
}
}
// 2. Query with RAG
async function ragQuery(question: string): Promise<string> {
// Embed the question
const [questionEmbedding] = await embedText([question]);
// Find relevant chunks
const results = await vectorDB.query({
vector: questionEmbedding,
topK: 5,
});
// Build context from retrieved chunks
const context = results.matches
.map(match => match.metadata.text)
.join('\n\n');
// Generate answer with context
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
max_tokens: 1024,
system: `Answer based on this context:\n\n${context}`,
messages: [{ role: 'user', content: question }],
});
return response.content[0].type === 'text' ? response.content[0].text : '';
}
Layer 4: AI Gateway
Route requests across multiple providers with fallback, caching, and cost tracking:
// Using LiteLLM as AI gateway
import { completion } from 'litellm';
// Same interface, any provider
const response = await completion({
model: 'anthropic/claude-sonnet-4-20250514', // or 'gpt-4o', 'groq/llama-3.3-70b'
messages: [{ role: 'user', content: 'Hello' }],
// Automatic fallback
fallbacks: ['gpt-4o', 'groq/llama-3.3-70b-versatile'],
});
Layer 5: Monitoring
| Tool | What It Tracks | Pricing |
|---|---|---|
| Helicone | Requests, cost, latency | Free tier available |
| Langfuse | Traces, evaluations, prompts | Free tier, open-source |
| Braintrust | Evals, experiments, logging | Free tier available |
| LangSmith | Traces, testing, monitoring | Free tier (LangChain) |
Recommended Stacks by Use Case
Chatbot / Customer Support
LLM: Anthropic Claude Sonnet (safety, instruction-following)
Embeddings: OpenAI text-embedding-3-small (cheap, good quality)
Vector DB: Pinecone (managed, zero ops)
Monitor: Helicone (cost tracking)
RAG / Knowledge Base
LLM: Anthropic Claude Sonnet (long context, citations)
Embeddings: Cohere embed-v4 (best retrieval quality)
Vector DB: Weaviate (hybrid search: vector + keyword)
Processing: LlamaParse (PDF extraction)
Monitor: Langfuse (trace retrieval quality)
Code Generation
LLM: Anthropic Claude Sonnet (best coding benchmarks)
Gateway: LiteLLM (fallback to GPT-4o if needed)
Monitor: Braintrust (eval code quality)
Cost-Optimized
LLM: Groq / Together AI (Llama 3.3 70B)
Embeddings: Google text-embedding-004 (free tier)
Vector DB: pgvector (free with existing Postgres)
Gateway: LiteLLM (route to cheapest available)
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Using GPT-4 for everything | 10x overspending | Route simple tasks to cheaper/smaller models |
| No cost monitoring | Surprise bills | Add Helicone or Langfuse from day 1 |
| Embedding entire documents | Poor retrieval quality | Chunk documents (300-500 tokens per chunk) |
| No fallback provider | Outage = app down | AI gateway with automatic failover |
| Skipping guardrails | Harmful outputs, prompt injection | Add input/output validation |
| Not evaluating quality | Don't know if changes improve things | Set up automated evals |
Compare AI APIs across every layer of the stack on APIScout — LLMs, embeddings, vector databases, and monitoring tools side by side.