Building a Self-Improving AI: The Evaluator-Optimizer Pattern

I asked Claude to translate a business email from Spanish to English.

The result? Grammatically correct but tonally wrong. The original was formal and professional. The translation used “no worries” and “get it done soon”—way too casual for a client-facing message.

The problem wasn’t the AI. It was the workflow.

One-shot LLM calls can’t self-correct. You generate once, hope for the best, and manually iterate if it’s wrong.

But human translators don’t work that way. They:

Generate a first draft
Evaluate it critically
Identify specific problems
Refine based on feedback
Repeat until satisfied

What if AI could do the same?

I built a self-improving translation agent using the Evaluator-Optimizer pattern. It iterates autonomously until quality thresholds are met.

Result: 40-60% quality improvement over one-shot translations, with 80% reduction in hallucinations.

The Problem with Fire-and-Forget LLMs

Standard LLM workflow:

const response = await openai.chat.completions.create({
  messages: [{ role: "user", content: "Translate this to English..." }]
});

// Hope it's good 🤞

What you get:

Literal translations that miss cultural nuance
Tonal mismatches (formal becomes casual, or vice versa)
Context loss (idioms translated word-for-word)
No quality control (you don’t know if it’s good until a human checks)
Hallucinations (AI adds details not in the original)

The core issue: LLMs are stateless. Each call is independent. There’s no editorial process.

The Human Editorial Process

Professional translators iterate:

Draft 1: Quick literal translation

“No hay problema, lo haremos pronto”
→ “No problem, we’ll do it soon”

Self-critique: Too casual for business context

Draft 2: Adjust formality

→ “Certainly, we will complete this promptly”

Self-critique: Better tone, but “complete this” is vague

Draft 3: Add specificity

→ “Certainly, we will finalize the deliverables promptly”

Final check: Tone ✓ Clarity ✓ Professionalism ✓

This is iterative refinement. Can AI do it autonomously?

The Solution: Evaluator-Optimizer Pattern

I evolved a Vercel AI SDK proof-of-concept into a production agentic workflow:

Input → Generate → Evaluate → Refine → Evaluate → ... → Output

Key innovation: The agent acts as both Generator and Critic.

Step 1: Generate

Agent receives:

Source text
Target language
Context (tone, audience, domain)

Produces initial translation.

Step 2: Evaluate

Deterministic scoring against criteria:

interface EvaluationScore {
  tone_accuracy: number;       // 0-10: Matches intended tone?
  cultural_nuance: number;      // 0-10: Idioms localized appropriately?
  grammar: number;              // 0-10: Grammatically correct?
  semantic_fidelity: number;    // 0-10: Preserves original meaning?
  overall: number;              // Average of above
}

Example output:

{
  "tone_accuracy": 5,
  "cultural_nuance": 8,
  "grammar": 10,
  "semantic_fidelity": 9,
  "overall": 8.0
}

Step 3: Categorize Feedback

If overall < threshold (e.g., 8.5), generate specific, actionable feedback:

interface Feedback {
  category: "tone" | "grammar" | "nuance" | "fidelity";
  severity: "critical" | "moderate" | "minor";
  location: string;          // "Sentence 2" or "Paragraph 1"
  issue: string;             // What's wrong
  suggestion: string;        // How to fix it
}

Example:

{
  "category": "tone",
  "severity": "critical",
  "location": "Opening sentence",
  "issue": "Phrase 'no worries' is too casual for formal business context",
  "suggestion": "Use 'Certainly' or 'Of course' for appropriate formality"
}

This is targeted refinement, not generic “try again.”

Step 4: Refine

Agent receives:

Its own previous output
Categorized feedback
Original source text (for reference)

Generates refined translation addressing specific issues.

Step 5: Re-Evaluate

New translation scored using same criteria.

Loop continues until:

Threshold met (e.g., overall score ≥ 8.5)
OR max iterations reached (typically 3-5)

Step 6: Output

Final translation + quality scores + iteration count.

Real Example: Spanish → English Business Email

Source text (Spanish):

“Estimado cliente, lamentamos el inconveniente. Resolveremos su solicitud a la brevedad posible.”

Iteration 1: Generate

Translation:

“Dear customer, we’re sorry for the trouble. We’ll fix your request ASAP.”

Iteration 1: Evaluate

{
  "tone_accuracy": 6,
  "issue": "ASAP is too informal for 'Estimado cliente' formality level"
}

Iteration 2: Refine

Translation:

“Dear customer, we apologize for the inconvenience. We will resolve your request promptly.”

Iteration 2: Evaluate

{
  "tone_accuracy": 9,
  "cultural_nuance": 9,
  "grammar": 10,
  "semantic_fidelity": 10,
  "overall": 9.5
}

Threshold met → Output final translation

Total time: 4 seconds. Total iterations: 2.

Technical Implementation

Architecture

Base: Vercel AI SDK (proof-of-concept)

Evolution: Production-grade orchestration

async function selfImprovingTranslation(
  sourceText: string,
  targetLang: string,
  context: TranslationContext
): Promise<TranslationResult> {
  let translation = await generate(sourceText, targetLang, context);
  let score = await evaluate(translation, context);
  let iterations = 1;

  const maxIterations = 5;
  const threshold = 8.5;

  while (score.overall < threshold && iterations < maxIterations) {
    const feedback = await categorizeFeedback(translation, score, context);
    translation = await refine(translation, feedback, sourceText);
    score = await evaluate(translation, context);
    iterations++;
  }

  return {
    translation,
    score,
    iterations,
    metadata: { sourceText, targetLang, context }
  };
}

State Management Across Iterations

Each iteration maintains:

interface IterationState {
  original: string;           // Immutable source text
  current: string;            // Current translation (mutable)
  history: Translation[];     // All previous attempts
  scores: EvaluationScore[];  // Score progression
  feedback: Feedback[];       // All feedback given
  iteration: number;          // Current iteration count
}

This state persistence enables:

Tracking improvement across iterations
Avoiding repeating past errors
Understanding which refinements worked

Separate Evaluator Prompt

Key design decision: Use a different prompt for evaluation than generation.

Generator prompt:

“Translate the following text from Spanish to English, maintaining {tone} and {context}…”

Evaluator prompt:

“You are a critical translation reviewer. Evaluate the translation against these criteria: tone accuracy, cultural nuance, grammar, semantic fidelity. Be harsh—identify even subtle issues. Return structured JSON scores and specific feedback.”

Why separate prompts matter:

The generator is optimistic and creative. The evaluator is pessimistic and analytical.

Using the same prompt for both creates cognitive dissonance—the AI is reluctant to critique its own work harshly.

Separate prompts enable genuine self-criticism.

Hallucination Mitigation

One-shot translations often hallucinate—adding information not in the original.

Example:

Source: “Entregaremos el informe la próxima semana”

One-shot translation (hallucination):

“We will deliver the comprehensive quarterly report by next Tuesday”

Issues:

“comprehensive” not in original
“quarterly” not mentioned
“by next Tuesday” more specific than “next week”

The Evaluator step includes fidelity checking:

interface FidelityCheck {
  added_information: string[];    // Details not in source
  omitted_information: string[];  // Details missing from translation
  meaning_changes: string[];      // Semantic drift
}

If fidelity score is low, feedback explicitly states:

{
  "category": "fidelity",
  "severity": "critical",
  "issue": "Translation adds 'comprehensive' and 'quarterly' not present in source",
  "suggestion": "Remove added details. Translate literally: 'next week' not 'next Tuesday'"
}

Agent then removes hallucinated content in next iteration.

Result: 80% reduction in hallucinations compared to one-shot translations.

Moving Beyond API Calls

This isn’t just prompt engineering—it’s AI engineering.

Most LLM integrations:

Single API call
Hope for the best
Manual iteration if wrong

Self-Improving Translation Agent:

Multi-step reasoning chain
Autonomous quality control
Programmatic iteration logic
State management across turns
Structured output validation

This is sophisticated agentic orchestration, not just API calls.

Measurable Outcomes

After 2 months of testing across 100+ translations:

Quality improvement:

40-60% increase in human-evaluated quality scores (one-shot vs. iterative)
Measured across 5 criteria: tone, nuance, grammar, fidelity, overall

Hallucination reduction:

80% fewer hallucinations (fidelity checking catches invented details)

Iteration efficiency:

Average 2.3 iterations to reach threshold
95% success rate within 3 iterations
4-8 seconds total per translation (including all iterations)

User feedback:

“The first version was okay. The final version after 2 iterations was professional-grade.” — Beta tester

“It caught its own mistake with the idiom translation. Impressive self-correction.” — Professional translator reviewing output

The Broader Pattern: Evaluator-Optimizer

This pattern applies far beyond translation:

Code Generation

Generate code → Lint → Fix → Lint → Output

Content Writing

Draft article → Evaluate tone → Refine → Evaluate readability → Output

Data Transformation

Transform data → Validate schema → Fix errors → Validate → Output

API Response Generation

Generate response → Check constraints → Refine → Check completeness → Output

Any task where quality matters more than speed benefits from this pattern.

Key Engineering Principles

1. LLMs are better critics than creators

The Evaluator step is often more accurate than the Generator step.

Agents can:

Spot their own errors reliably
Provide specific, actionable feedback
Improve consistently across iterations

Insight: Use the AI’s analytical capabilities to validate its creative outputs.

2. Iteration unlocks quality one-shot can’t achieve

By iterating 2-3 times:

Translation quality improves 40-60%
Cultural nuance scores increase dramatically
Hallucinations reduced by 80%

The compounding effect of iteration is non-linear.

3. Structured feedback enables autonomous improvement

Generic feedback doesn’t help:

“This translation could be better”

Structured feedback enables action:

{
  "issue": "Tone too casual in sentence 2",
  "suggestion": "Replace 'no worries' with 'Certainly'"
}

The agent can parse, understand, and apply the fix autonomously.

4. Deterministic scoring prevents infinite loops

Without clear thresholds, iteration could continue indefinitely.

Deterministic scoring ensures:

Objective quality measurement
Clear stopping criteria
Iteration count limits (max 5)

Production systems need guardrails.

What I Learned

Production AI ≠ API calls

Building production AI systems means:

Orchestrating multi-step workflows
Managing state across iterations
Implementing quality control mechanisms
Handling edge cases and failures

This is engineering, not just prompting.

Separate roles unlock better performance

Using the same AI in different roles (Generator vs. Evaluator) produces better results than using different models.

Why?

Same model understands its own generation patterns
Can spot subtle issues specific to its generation style
Maintains consistency across iterations

But only if you separate the prompts to enable genuine critique.

The “good enough” threshold is critical

Setting threshold too high:

Wastes iterations on diminishing returns
Increases latency
Costs more in API calls

Setting threshold too low:

Outputs lower quality
Misses obvious errors

Through testing, 8.5/10 overall score proved optimal:

High enough for professional-grade output
Low enough to reach consistently in 2-3 iterations

Evolution Story

V1 (Vercel AI SDK example):

One-shot translation
No quality control
“Fire and forget”

V2 (Self-Improving Agent):

Added Evaluator step
Deterministic scoring
Single iteration refinement

V3 (Production):

Iterative refinement loops
Categorized feedback
State management
Hallucination detection
Quality threshold enforcement

Time to production: 2 weeks from proof-of-concept to robust system.

Try It

The pattern is language-agnostic and model-agnostic. Works with Claude, GPT-4, or any strong reasoning model.

Implementation guide: GitHub repository (coming soon)

Read the Vercel AI SDK docs: Vercel AI SDK

Connect

Building agentic systems or exploring Evaluator-Optimizer patterns?

Email: daniel@pragsys.io
Follow my work: Dev.to | Substack | X/Twitter
More case studies: danielpetro.dev/work

This is part of my series on AI agent infrastructure. Previously: Progressive disclosure for autonomous error correction. Next: Full-cycle product engineering with the Xano n8n integration.