Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

AI is no longer experimental. It's production-ready, accessible, and economical. Companies are shipping AI features to millions of users. The question isn't whether to use AI, but how to use it effectively.

The challenge: AI systems are probabilistic, not deterministic. They can be expensive, slow, hallucinate, and violate privacy. Building production-grade AI features requires understanding these constraints and designing systems around them.

I've shipped multiple AI features across different applications. Let me share the patterns that work.

Architecture Patterns

Pattern 1: Basic Chat with Context Injection

// app/api/chat/route.ts
import { Anthropic } from '@anthropic-ai/sdk';

const client = new Anthropic();

export async function POST(request: Request) {
  const { messages, context } = await request.json();

  // System prompt with context (RAG-like)
  const systemPrompt = `You are a helpful assistant specialized in React development.
  
User Profile Context:
- Experience Level: ${context.userLevel}
- Focus Areas: ${context.focusAreas.join(', ')}

Answer questions concisely with code examples when relevant.`;

  try {
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      system: systemPrompt,
      messages: messages.map((msg) => ({
        role: msg.role as 'user' | 'assistant',
        content: msg.content,
      })),
    });

    return Response.json({
      content: response.content[0].type === 'text' ? response.content[0].text : '',
      stopReason: response.stop_reason,
    });
  } catch (error) {
    console.error('Chat error:', error);
    return Response.json({ error: 'Failed to generate response' }, { status: 500 });
  }
}

Pattern 2: Retrieval-Augmented Generation (RAG)

RAG combines LLMs with custom knowledge bases for accurate, relevant answers:

// lib/embeddings.ts
import { Pinecone } from '@pinecone-database/pinecone';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function searchDocuments(query: string) {
  // 1. Embed the query
  const queryEmbedding = await embeddings.embedQuery(query);

  // 2. Search knowledge base
  const index = pc.Index('documentation');
  const results = await index.query({
    vector: queryEmbedding,
    topK: 5,
    includeMetadata: true,
  });

  // 3. Return matching documents
  return results.matches.map((match) => ({
    id: match.id,
    score: match.score,
    content: match.metadata?.content || '',
    source: match.metadata?.source || '',
  }));
}

// app/api/rag-chat/route.ts
import { searchDocuments } from '@/lib/embeddings';
import { Anthropic } from '@anthropic-ai/sdk';

export async function POST(request: Request) {
  const { query } = await request.json();

  // 1. Search documents
  const relevantDocs = await searchDocuments(query);

  // 2. Format context
  const context = relevantDocs
    .map(
      (doc) => `Source: ${doc.source} (relevance: ${(doc.score * 100).toFixed(0)}%)
${doc.content}`
    )
    .join('\n\n---\n\n');

  // 3. Ask LLM with context
  const client = new Anthropic();
  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    system: `You are a helpful assistant. Answer based on the provided context.
If context doesn't contain relevant information, say so.

Context:
${context}`,
    messages: [{ role: 'user', content: query }],
  });

  return Response.json({
    answer: response.content[0].type === 'text' ? response.content[0].text : '',
    sources: relevantDocs.map((doc) => doc.source),
  });
}

Pattern 3: Vision/Image Analysis

// app/api/analyze-image/route.ts
import { Anthropic } from '@anthropic-ai/sdk';
import { readFile } from 'node:fs/promises';

export async function POST(request: Request) {
  const formData = await request.formData();
  const file = formData.get('image') as File;

  if (!file) {
    return Response.json({ error: 'No image provided' }, { status: 400 });
  }

  // Convert image to base64
  const buffer = await file.arrayBuffer();
  const base64 = Buffer.from(buffer).toString('base64');

  const client = new Anthropic();
  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/jpeg',
              data: base64,
            },
          },
          {
            type: 'text',
            text: 'Analyze this image and describe what you see. Focus on: 1) Main objects, 2) Colors and composition, 3) Any text visible, 4) Overall mood/context',
          },
        ],
      },
    ],
  });

  return Response.json({
    analysis: response.content[0].type === 'text' ? response.content[0].text : '',
  });
}

Advanced: Fine-Tuning Custom Models

For specialized tasks, fine-tune a smaller model on your data:

// scripts/prepare-training-data.ts
interface TrainingExample {
  input: string;
  output: string;
}

const trainingData: TrainingExample[] = [
  { input: 'how to use useState?', output: 'useState is a React hook that lets you add state to functional components...' },
  { input: 'what is context API?', output: 'Context API allows you to manage global state without prop drilling...' },
  // ... 100+ examples
];

// Format for OpenAI fine-tuning
const jsonlContent = trainingData
  .map((example) =>
    JSON.stringify({
      messages: [
        { role: 'user', content: example.input },
        { role: 'assistant', content: example.output },
      ],
    })
  )
  .join('\n');

// Upload and start fine-tuning
import { OpenAI } from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Save JSONL file
await Bun.write('training-data.jsonl', jsonlContent);

// Upload file
const uploadedFile = await openai.files.create({
  file: new File(['training-data.jsonl'], 'training.jsonl'),
  purpose: 'fine-tune',
});

// Start fine-tuning job
const fineTuneJob = await openai.fineTuning.jobs.create({
  training_file: uploadedFile.id,
  model: 'gpt-3.5-turbo',
  suffix: 'react-expert', // Model name will be gpt-3.5-turbo-react-expert
});

console.log('Fine-tuning started:', fineTuneJob.id);

Cost Optimization

AI can get expensive fast. Optimize:

// 1. Cache responses for identical queries
const responseCache = new Map<string, string>();

export async function getChatResponse(query: string) {
  if (responseCache.has(query)) {
    return responseCache.get(query);
  }

  const response = await generateResponse(query);
  responseCache.set(query, response);
  return response;
}

// 2. Use cheaper models when possible
const model = complexQuery
  ? 'claude-3-5-sonnet-20241022' // More expensive, more capable
  : 'claude-3-haiku-20240307'; // Cheaper, faster

// 3. Batch requests
async function batchProcess(items: string[]) {
  // Process in groups to reduce API calls
  const batches = chunk(items, 10);
  const results = await Promise.all(batches.map(processBatch));
  return results.flat();
}

// 4. Implement rate limiting
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 h'), // 10 requests/hour
});

export async function POST(request: Request) {
  const ip = request.headers.get('x-forwarded-for') || 'unknown';
  const { success } = await ratelimit.limit(ip);

  if (!success) {
    return Response.json({ error: 'Rate limited' }, { status: 429 });
  }

  // Process request
}

Handling Hallucinations and Failures

LLMs can confidently say wrong things. Mitigate:

export async function generateWithValidation(prompt: string) {
  const response = await generateResponse(prompt);

  // 1. Check for confidence signals
  if (response.includes('I don\'t know') || response.includes('uncertain')) {
    return { answer: response, confidence: 'low' };
  }

  // 2. Validate against known facts
  const factCheck = await validateFacts(response);
  if (!factCheck.isValid) {
    return {
      answer: response,
      warning: `Potential inaccuracy detected: ${factCheck.issues.join(',')}`,
    };
  }

  // 3. Ask for sources
  const followUp = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    messages: [
      { role: 'user', content: prompt },
      { role: 'assistant', content: response },
      { role: 'user', content: 'Where did you get this information? Provide sources.' },
    ],
  });

  return {
    answer: response,
    sources: followUp.content[0].type === 'text' ? followUp.content[0].text : null,
  };
}

Privacy and Security

// 1. Never send PII to external APIs
function sanitizeInput(input: string): string {
  // Remove email addresses, phone numbers, etc.
  return input
    .replace(/[\w.-]+@[\w.-]+\.\w+/g, '[EMAIL]')
    .replace(/\(\d{3}\)\s?\d{3}-\d{4}/g, '[PHONE]');
}

// 2. Use local models for sensitive data
import { Ollama } from 'ollama';

const ollama = new Ollama({ baseUrl: 'http://localhost:11434' });

async function processConfidentialData(data: string) {
  // Runs locally, no data leaves your server
  const response = await ollama.generate({
    model: 'mistral', // Open source model
    prompt: data,
    stream: false,
  });

  return response.response;
}

// 3. Implement audit logging
async function logAIRequest(
  userID: string,
  input: string,
  output: string,
  model: string
) {
  await db.logs.create({
    userId: userID,
    inputHash: hashValue(input), // Don't store raw input to disk
    outputHash: hashValue(output),
    model,
    timestamp: new Date(),
  });
}

Production Monitoring

// app/api/insights/route.ts
import { recordMetric } from '@/lib/monitoring';

export async function POST(request: Request) {
  const startTime = Date.now();

  try {
    const response = await generateResponse();
    const duration = Date.now() - startTime;

    await recordMetric('ai_request_success', {
      duration,
      model: 'claude-3',
      tokens: response.usage.output_tokens,
    });

    return Response.json({ answer: response });
  } catch (error) {
    const duration = Date.now() - startTime;

    await recordMetric('ai_request_error', {
      duration,
      error: error.message,
      model: 'claude-3',
    });

    return Response.json({ error: error.message }, { status: 500 });
  }
}

Common Pitfalls

Problem	Solution
Unreliable outputs	Use few-shot prompting + validation
Long latency	Cache responses, use streaming
High costs	Use cheaper models, implement caching, batch requests
Hallucinations	Validate against known facts, ask for sources
Privacy breaches	Use local models or sanitize input

Conclusion

AI features are now a core part of modern web applications. The key is understanding the tradeoffs: cost vs quality, speed vs accuracy, complexity vs maintainability.

Start simple—add a chatbot or content summarizer. Monitor costs and latency. Then gradually add more sophisticated AI features as you learn your users' needs.

The companies that win won't be those with the best AI, but those that integrate AI most thoughtfully into their products.

Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

I've shipped multiple AI features across different applications. Let me share the patterns that work.

Architecture Patterns

Pattern 1: Basic Chat with Context Injection

// app/api/chat/route.ts
import { Anthropic } from '@anthropic-ai/sdk';

const client = new Anthropic();

export async function POST(request: Request) {
  const { messages, context } = await request.json();

  // System prompt with context (RAG-like)
  const systemPrompt = `You are a helpful assistant specialized in React development.
  
User Profile Context:
- Experience Level: ${context.userLevel}
- Focus Areas: ${context.focusAreas.join(', ')}

Answer questions concisely with code examples when relevant.`;

  try {
    const response = await client.messages.create({
      model: 'claude-3-5-sonnet-20241022',
      max_tokens: 1024,
      system: systemPrompt,
      messages: messages.map((msg) => ({
        role: msg.role as 'user' | 'assistant',
        content: msg.content,
      })),
    });

    return Response.json({
      content: response.content[0].type === 'text' ? response.content[0].text : '',
      stopReason: response.stop_reason,
    });
  } catch (error) {
    console.error('Chat error:', error);
    return Response.json({ error: 'Failed to generate response' }, { status: 500 });
  }
}

Pattern 2: Retrieval-Augmented Generation (RAG)

RAG combines LLMs with custom knowledge bases for accurate, relevant answers:

// lib/embeddings.ts
import { Pinecone } from '@pinecone-database/pinecone';
import { OpenAIEmbeddings } from 'langchain/embeddings/openai';

const pc = new Pinecone({
  apiKey: process.env.PINECONE_API_KEY,
});

const embeddings = new OpenAIEmbeddings({
  apiKey: process.env.OPENAI_API_KEY,
});

export async function searchDocuments(query: string) {
  // 1. Embed the query
  const queryEmbedding = await embeddings.embedQuery(query);

  // 2. Search knowledge base
  const index = pc.Index('documentation');
  const results = await index.query({
    vector: queryEmbedding,
    topK: 5,
    includeMetadata: true,
  });

  // 3. Return matching documents
  return results.matches.map((match) => ({
    id: match.id,
    score: match.score,
    content: match.metadata?.content || '',
    source: match.metadata?.source || '',
  }));
}

// app/api/rag-chat/route.ts
import { searchDocuments } from '@/lib/embeddings';
import { Anthropic } from '@anthropic-ai/sdk';

export async function POST(request: Request) {
  const { query } = await request.json();

  // 1. Search documents
  const relevantDocs = await searchDocuments(query);

  // 2. Format context
  const context = relevantDocs
    .map(
      (doc) => `Source: ${doc.source} (relevance: ${(doc.score * 100).toFixed(0)}%)
${doc.content}`
    )
    .join('\n\n---\n\n');

  // 3. Ask LLM with context
  const client = new Anthropic();
  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    system: `You are a helpful assistant. Answer based on the provided context.
If context doesn't contain relevant information, say so.

Context:
${context}`,
    messages: [{ role: 'user', content: query }],
  });

  return Response.json({
    answer: response.content[0].type === 'text' ? response.content[0].text : '',
    sources: relevantDocs.map((doc) => doc.source),
  });
}

Pattern 3: Vision/Image Analysis

// app/api/analyze-image/route.ts
import { Anthropic } from '@anthropic-ai/sdk';
import { readFile } from 'node:fs/promises';

export async function POST(request: Request) {
  const formData = await request.formData();
  const file = formData.get('image') as File;

  if (!file) {
    return Response.json({ error: 'No image provided' }, { status: 400 });
  }

  // Convert image to base64
  const buffer = await file.arrayBuffer();
  const base64 = Buffer.from(buffer).toString('base64');

  const client = new Anthropic();
  const response = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 1024,
    messages: [
      {
        role: 'user',
        content: [
          {
            type: 'image',
            source: {
              type: 'base64',
              media_type: 'image/jpeg',
              data: base64,
            },
          },
          {
            type: 'text',
            text: 'Analyze this image and describe what you see. Focus on: 1) Main objects, 2) Colors and composition, 3) Any text visible, 4) Overall mood/context',
          },
        ],
      },
    ],
  });

  return Response.json({
    analysis: response.content[0].type === 'text' ? response.content[0].text : '',
  });
}

Advanced: Fine-Tuning Custom Models

For specialized tasks, fine-tune a smaller model on your data:

// scripts/prepare-training-data.ts
interface TrainingExample {
  input: string;
  output: string;
}

const trainingData: TrainingExample[] = [
  { input: 'how to use useState?', output: 'useState is a React hook that lets you add state to functional components...' },
  { input: 'what is context API?', output: 'Context API allows you to manage global state without prop drilling...' },
  // ... 100+ examples
];

// Format for OpenAI fine-tuning
const jsonlContent = trainingData
  .map((example) =>
    JSON.stringify({
      messages: [
        { role: 'user', content: example.input },
        { role: 'assistant', content: example.output },
      ],
    })
  )
  .join('\n');

// Upload and start fine-tuning
import { OpenAI } from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Save JSONL file
await Bun.write('training-data.jsonl', jsonlContent);

// Upload file
const uploadedFile = await openai.files.create({
  file: new File(['training-data.jsonl'], 'training.jsonl'),
  purpose: 'fine-tune',
});

// Start fine-tuning job
const fineTuneJob = await openai.fineTuning.jobs.create({
  training_file: uploadedFile.id,
  model: 'gpt-3.5-turbo',
  suffix: 'react-expert', // Model name will be gpt-3.5-turbo-react-expert
});

console.log('Fine-tuning started:', fineTuneJob.id);

Cost Optimization

AI can get expensive fast. Optimize:

// 1. Cache responses for identical queries
const responseCache = new Map<string, string>();

export async function getChatResponse(query: string) {
  if (responseCache.has(query)) {
    return responseCache.get(query);
  }

  const response = await generateResponse(query);
  responseCache.set(query, response);
  return response;
}

// 2. Use cheaper models when possible
const model = complexQuery
  ? 'claude-3-5-sonnet-20241022' // More expensive, more capable
  : 'claude-3-haiku-20240307'; // Cheaper, faster

// 3. Batch requests
async function batchProcess(items: string[]) {
  // Process in groups to reduce API calls
  const batches = chunk(items, 10);
  const results = await Promise.all(batches.map(processBatch));
  return results.flat();
}

// 4. Implement rate limiting
import { Ratelimit } from '@upstash/ratelimit';
import { Redis } from '@upstash/redis';

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(10, '1 h'), // 10 requests/hour
});

export async function POST(request: Request) {
  const ip = request.headers.get('x-forwarded-for') || 'unknown';
  const { success } = await ratelimit.limit(ip);

  if (!success) {
    return Response.json({ error: 'Rate limited' }, { status: 429 });
  }

  // Process request
}

Handling Hallucinations and Failures

LLMs can confidently say wrong things. Mitigate:

export async function generateWithValidation(prompt: string) {
  const response = await generateResponse(prompt);

  // 1. Check for confidence signals
  if (response.includes('I don\'t know') || response.includes('uncertain')) {
    return { answer: response, confidence: 'low' };
  }

  // 2. Validate against known facts
  const factCheck = await validateFacts(response);
  if (!factCheck.isValid) {
    return {
      answer: response,
      warning: `Potential inaccuracy detected: ${factCheck.issues.join(',')}`,
    };
  }

  // 3. Ask for sources
  const followUp = await client.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    messages: [
      { role: 'user', content: prompt },
      { role: 'assistant', content: response },
      { role: 'user', content: 'Where did you get this information? Provide sources.' },
    ],
  });

  return {
    answer: response,
    sources: followUp.content[0].type === 'text' ? followUp.content[0].text : null,
  };
}

Privacy and Security

// 1. Never send PII to external APIs
function sanitizeInput(input: string): string {
  // Remove email addresses, phone numbers, etc.
  return input
    .replace(/[\w.-]+@[\w.-]+\.\w+/g, '[EMAIL]')
    .replace(/\(\d{3}\)\s?\d{3}-\d{4}/g, '[PHONE]');
}

// 2. Use local models for sensitive data
import { Ollama } from 'ollama';

const ollama = new Ollama({ baseUrl: 'http://localhost:11434' });

async function processConfidentialData(data: string) {
  // Runs locally, no data leaves your server
  const response = await ollama.generate({
    model: 'mistral', // Open source model
    prompt: data,
    stream: false,
  });

  return response.response;
}

// 3. Implement audit logging
async function logAIRequest(
  userID: string,
  input: string,
  output: string,
  model: string
) {
  await db.logs.create({
    userId: userID,
    inputHash: hashValue(input), // Don't store raw input to disk
    outputHash: hashValue(output),
    model,
    timestamp: new Date(),
  });
}

Production Monitoring

// app/api/insights/route.ts
import { recordMetric } from '@/lib/monitoring';

export async function POST(request: Request) {
  const startTime = Date.now();

  try {
    const response = await generateResponse();
    const duration = Date.now() - startTime;

    await recordMetric('ai_request_success', {
      duration,
      model: 'claude-3',
      tokens: response.usage.output_tokens,
    });

    return Response.json({ answer: response });
  } catch (error) {
    const duration = Date.now() - startTime;

    await recordMetric('ai_request_error', {
      duration,
      error: error.message,
      model: 'claude-3',
    });

    return Response.json({ error: error.message }, { status: 500 });
  }
}

Common Pitfalls

Problem	Solution
Unreliable outputs	Use few-shot prompting + validation
Long latency	Cache responses, use streaming
High costs	Use cheaper models, implement caching, batch requests
Hallucinations	Validate against known facts, ask for sources
Privacy breaches	Use local models or sanitize input

Conclusion

AI features are now a core part of modern web applications. The key is understanding the tradeoffs: cost vs quality, speed vs accuracy, complexity vs maintainability.

Start simple—add a chatbot or content summarizer. Monitor costs and latency. Then gradually add more sophisticated AI features as you learn your users' needs.

The companies that win won't be those with the best AI, but those that integrate AI most thoughtfully into their products.

Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

Architecture Patterns

Pattern 1: Basic Chat with Context Injection

Pattern 2: Retrieval-Augmented Generation (RAG)

Pattern 3: Vision/Image Analysis

Advanced: Fine-Tuning Custom Models

Cost Optimization

Handling Hallucinations and Failures

Privacy and Security

Production Monitoring

Common Pitfalls

Conclusion

Related Blogs

Prompt Engineering for Developers: Techniques That Actually Work

The Rise of AI-Assisted Frontend Development in 2026

Streaming AI Responses in Next.js: Building Real-Time User Experiences with LLMs

Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

Building AI-Powered Features in Web Apps: LLMs, Embeddings, and Fine-Tuning

Architecture Patterns

Pattern 1: Basic Chat with Context Injection

Pattern 2: Retrieval-Augmented Generation (RAG)

Pattern 3: Vision/Image Analysis

Advanced: Fine-Tuning Custom Models

Cost Optimization

Handling Hallucinations and Failures

Privacy and Security

Production Monitoring

Common Pitfalls

Conclusion

Related Blogs

Prompt Engineering for Developers: Techniques That Actually Work

The Rise of AI-Assisted Frontend Development in 2026

Streaming AI Responses in Next.js: Building Real-Time User Experiences with LLMs