Testing AI Applications: Beyond Vibes

"It works on my prompt" is not a testing strategy. Here's how to actually test AI apps.

The Challenge

AI systems are:

Non-deterministic: Same input, different outputs
Subjective: "Good" is hard to define
Expensive: Every test costs API credits
Slow: Latency makes test suites painful

The Testing Pyramid for AI

        /\
       /  \  Human Evaluation
      /----\  
     /      \  LLM-as-Judge
    /--------\
   /          \  Assertion Tests
  /------------\
 /              \  Unit Tests
/________________\

Level 1: Unit Tests

Test your deterministic code:

describe("PromptBuilder", () => {
  it("should include system message", () => {
    const prompt = buildPrompt({ task: "summarize", context: "..." });
    expect(prompt).toContain("You are a summarization assistant");
  });

  it("should truncate long contexts", () => {
    const longContext = "x".repeat(100000);
    const prompt = buildPrompt({ task: "summarize", context: longContext });
    expect(prompt.length).toBeLessThan(50000);
  });
});

Level 2: Assertion Tests

Test structural properties of outputs:

describe("AI Responses", () => {
  it("should return valid JSON", async () => {
    const response = await ai.analyze(testInput);
    expect(() => JSON.parse(response)).not.toThrow();
  });

  it("should include required fields", async () => {
    const response = await ai.extractEntities(testInput);
    expect(response).toHaveProperty("entities");
    expect(response.entities).toBeInstanceOf(Array);
  });

  it("should not hallucinate URLs", async () => {
    const response = await ai.generateArticle(topic);
    const urls = extractUrls(response);
    for (const url of urls) {
      const exists = await checkUrlExists(url);
      expect(exists).toBe(true);
    }
  });
});

Level 3: LLM-as-Judge

Use a model to evaluate another model:

async function evaluateResponse(input: string, output: string) {
  const evaluation = await judge.evaluate({
    input,
    output,
    criteria: [
      "Accuracy: Is the information factually correct?",
      "Relevance: Does it answer the question?",
      "Completeness: Are all aspects covered?",
      "Clarity: Is it easy to understand?",
    ],
  });
  
  return evaluation.scores; // { accuracy: 0.9, relevance: 0.95, ... }
}

describe("Response Quality", () => {
  it("should score above threshold", async () => {
    const output = await ai.answer(testQuestion);
    const scores = await evaluateResponse(testQuestion, output);
    
    expect(scores.accuracy).toBeGreaterThan(0.8);
    expect(scores.relevance).toBeGreaterThan(0.9);
  });
});

Level 4: Human Evaluation

For critical paths, there's no substitute:

// Flag responses for human review
if (confidence < 0.7 || containsSensitiveTopics(response)) {
  await queueForHumanReview(response);
}

CI/CD Integration

name: AI Tests

on: [push]

jobs:
  test:
    steps:
      - name: Unit & Assertion Tests
        run: npm test
        
      - name: AI Evaluation (sampling)
        run: npm run test:ai -- --sample=50
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          
      - name: Cost Check
        run: |
          if [ "$TEST_COST" -gt "10" ]; then
            echo "Tests exceeded $10 budget"
            exit 1
          fi

Key Metrics to Track

Pass rate over time
Latency p50, p95, p99
Cost per test
Regression detection

Testing AI is hard. But shipping untested AI is harder.

Testing AI Applications: A Practical Framework