Skip to main content

Building an AI-Powered App: Choosing Your API Stack

·APIScout Team
aillmapi stackvector databaseembeddings

Building an AI-Powered App: Choosing Your API Stack

Building an AI-powered app isn't just picking an LLM. It's choosing providers for inference, embeddings, vector storage, guardrails, monitoring, and more. Here's the complete stack and how to choose the right API for each layer.

The AI API Stack

┌─────────────────────────────────────┐
│  Frontend / User Interface          │  Chat UI, streaming display
├─────────────────────────────────────┤
│  AI Gateway / Router                │  LiteLLM, Portkey, or custom
│  (model routing, fallback, caching) │
├─────────────────────────────────────┤
│  LLM Provider                      │  OpenAI, Anthropic, Google, OSS
│  (chat, completion, reasoning)      │
├─────────────────────────────────────┤
│  Embeddings                         │  OpenAI, Cohere, Voyage AI
│  (text → vectors for search/RAG)    │
├─────────────────────────────────────┤
│  Vector Database                    │  Pinecone, Weaviate, Qdrant
│  (similarity search, retrieval)     │
├─────────────────────────────────────┤
│  Document Processing                │  Unstructured, LlamaParse
│  (PDF, HTML → chunks)              │
├─────────────────────────────────────┤
│  Guardrails / Safety               │  Guardrails AI, NeMo
│  (content filtering, validation)    │
├─────────────────────────────────────┤
│  Monitoring / Observability         │  Helicone, Langfuse, Braintrust
│  (cost tracking, quality, latency)  │
└─────────────────────────────────────┘

Layer 1: LLM Provider

ProviderBest ModelStrengthPricing (1M tokens)
AnthropicClaude SonnetCoding, analysis, safety$3 in / $15 out
OpenAIGPT-4oMultimodal, ecosystem$5 in / $15 out
GoogleGemini 2.0 ProLong context, multimodal$1.25 in / $5 out
GroqLlama 3.3 70BUltra-fast inference$0.59 in / $0.79 out
Together AIOpen modelsVariety, competitive price$0.20-3.00

Choosing Your Primary LLM

Need best reasoning/coding? → Claude Sonnet or GPT-4o
Need cheapest good model? → Gemini 2.0 Flash or Llama 3.3
Need fastest inference? → Groq
Need multimodal (vision)? → GPT-4o or Claude Sonnet
Need open-source/self-host? → Llama 3.3 via Together/vLLM
Need 1M+ token context? → Gemini (2M context window)

Basic Integration

import Anthropic from '@anthropic-ai/sdk';

const anthropic = new Anthropic();

async function chat(userMessage: string, systemPrompt?: string) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: systemPrompt || 'You are a helpful assistant.',
    messages: [{ role: 'user', content: userMessage }],
  });

  return response.content[0].type === 'text'
    ? response.content[0].text
    : '';
}

Layer 2: Embeddings

ProviderModelDimensionsPricing (1M tokens)
OpenAItext-embedding-3-small1536$0.02
OpenAItext-embedding-3-large3072$0.13
Cohereembed-v41024$0.10
Voyage AIvoyage-31024$0.06
Googletext-embedding-004768Free (low volume)

Embedding Pipeline

import OpenAI from 'openai';

const openai = new OpenAI();

async function embedText(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: texts,
  });

  return response.data.map(item => item.embedding);
}

// Chunk documents before embedding
function chunkText(text: string, chunkSize: number = 500, overlap: number = 50): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }

  return chunks;
}

Layer 3: Vector Database

DatabaseTypeFree TierBest For
PineconeManaged1 index, 100K vectorsSimplest setup
WeaviateManaged / Self-hosted14-day trialHybrid search
QdrantManaged / Self-hosted1GB freePerformance, self-hosting
ChromaSelf-hostedFree (OSS)Local development
pgvectorPostgreSQL extensionFree (with Postgres)Already using Postgres

RAG Pipeline

// Complete RAG (Retrieval-Augmented Generation) pipeline

// 1. Index documents
async function indexDocuments(documents: { id: string; text: string }[]) {
  for (const doc of documents) {
    const chunks = chunkText(doc.text);
    const embeddings = await embedText(chunks);

    // Store in vector database
    await vectorDB.upsert(
      chunks.map((chunk, i) => ({
        id: `${doc.id}_${i}`,
        values: embeddings[i],
        metadata: { text: chunk, documentId: doc.id },
      }))
    );
  }
}

// 2. Query with RAG
async function ragQuery(question: string): Promise<string> {
  // Embed the question
  const [questionEmbedding] = await embedText([question]);

  // Find relevant chunks
  const results = await vectorDB.query({
    vector: questionEmbedding,
    topK: 5,
  });

  // Build context from retrieved chunks
  const context = results.matches
    .map(match => match.metadata.text)
    .join('\n\n');

  // Generate answer with context
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-20250514',
    max_tokens: 1024,
    system: `Answer based on this context:\n\n${context}`,
    messages: [{ role: 'user', content: question }],
  });

  return response.content[0].type === 'text' ? response.content[0].text : '';
}

Layer 4: AI Gateway

Route requests across multiple providers with fallback, caching, and cost tracking:

// Using LiteLLM as AI gateway
import { completion } from 'litellm';

// Same interface, any provider
const response = await completion({
  model: 'anthropic/claude-sonnet-4-20250514', // or 'gpt-4o', 'groq/llama-3.3-70b'
  messages: [{ role: 'user', content: 'Hello' }],
  // Automatic fallback
  fallbacks: ['gpt-4o', 'groq/llama-3.3-70b-versatile'],
});

Layer 5: Monitoring

ToolWhat It TracksPricing
HeliconeRequests, cost, latencyFree tier available
LangfuseTraces, evaluations, promptsFree tier, open-source
BraintrustEvals, experiments, loggingFree tier available
LangSmithTraces, testing, monitoringFree tier (LangChain)

Chatbot / Customer Support

LLM:        Anthropic Claude Sonnet (safety, instruction-following)
Embeddings: OpenAI text-embedding-3-small (cheap, good quality)
Vector DB:  Pinecone (managed, zero ops)
Monitor:    Helicone (cost tracking)

RAG / Knowledge Base

LLM:        Anthropic Claude Sonnet (long context, citations)
Embeddings: Cohere embed-v4 (best retrieval quality)
Vector DB:  Weaviate (hybrid search: vector + keyword)
Processing: LlamaParse (PDF extraction)
Monitor:    Langfuse (trace retrieval quality)

Code Generation

LLM:        Anthropic Claude Sonnet (best coding benchmarks)
Gateway:    LiteLLM (fallback to GPT-4o if needed)
Monitor:    Braintrust (eval code quality)

Cost-Optimized

LLM:        Groq / Together AI (Llama 3.3 70B)
Embeddings: Google text-embedding-004 (free tier)
Vector DB:  pgvector (free with existing Postgres)
Gateway:    LiteLLM (route to cheapest available)

Common Mistakes

MistakeImpactFix
Using GPT-4 for everything10x overspendingRoute simple tasks to cheaper/smaller models
No cost monitoringSurprise billsAdd Helicone or Langfuse from day 1
Embedding entire documentsPoor retrieval qualityChunk documents (300-500 tokens per chunk)
No fallback providerOutage = app downAI gateway with automatic failover
Skipping guardrailsHarmful outputs, prompt injectionAdd input/output validation
Not evaluating qualityDon't know if changes improve thingsSet up automated evals

Compare AI APIs across every layer of the stack on APIScout — LLMs, embeddings, vector databases, and monitoring tools side by side.

Comments