I asked Claude to translate a business email from Spanish to English.
The result? Grammatically correct but tonally wrong. The original was formal and professional. The translation used “no worries” and “get it done soon”—way too casual for a client-facing message.
The problem wasn’t the AI. It was the workflow.
One-shot LLM calls can’t self-correct. You generate once, hope for the best, and manually iterate if it’s wrong.
But human translators don’t work that way. They:
- Generate a first draft
- Evaluate it critically
- Identify specific problems
- Refine based on feedback
- Repeat until satisfied
What if AI could do the same?
I built a self-improving translation agent using the Evaluator-Optimizer pattern. It iterates autonomously until quality thresholds are met.
Result: 40-60% quality improvement over one-shot translations, with 80% reduction in hallucinations.
The Problem with Fire-and-Forget LLMs
Standard LLM workflow:
const response = await openai.chat.completions.create({
messages: [{ role: "user", content: "Translate this to English..." }]
});
// Hope it's good 🤞
What you get:
- Literal translations that miss cultural nuance
- Tonal mismatches (formal becomes casual, or vice versa)
- Context loss (idioms translated word-for-word)
- No quality control (you don’t know if it’s good until a human checks)
- Hallucinations (AI adds details not in the original)
The core issue: LLMs are stateless. Each call is independent. There’s no editorial process.
The Human Editorial Process
Professional translators iterate:
Draft 1: Quick literal translation
- “No hay problema, lo haremos pronto”
- → “No problem, we’ll do it soon”
Self-critique: Too casual for business context
Draft 2: Adjust formality
- → “Certainly, we will complete this promptly”
Self-critique: Better tone, but “complete this” is vague
Draft 3: Add specificity
- → “Certainly, we will finalize the deliverables promptly”
Final check: Tone ✓ Clarity ✓ Professionalism ✓
This is iterative refinement. Can AI do it autonomously?
The Solution: Evaluator-Optimizer Pattern
I evolved a Vercel AI SDK proof-of-concept into a production agentic workflow:
Input → Generate → Evaluate → Refine → Evaluate → ... → Output
Key innovation: The agent acts as both Generator and Critic.
Step 1: Generate
Agent receives:
- Source text
- Target language
- Context (tone, audience, domain)
Produces initial translation.
Step 2: Evaluate
Deterministic scoring against criteria:
interface EvaluationScore {
tone_accuracy: number; // 0-10: Matches intended tone?
cultural_nuance: number; // 0-10: Idioms localized appropriately?
grammar: number; // 0-10: Grammatically correct?
semantic_fidelity: number; // 0-10: Preserves original meaning?
overall: number; // Average of above
}
Example output:
{
"tone_accuracy": 5,
"cultural_nuance": 8,
"grammar": 10,
"semantic_fidelity": 9,
"overall": 8.0
}
Step 3: Categorize Feedback
If overall < threshold (e.g., 8.5), generate specific, actionable feedback:
interface Feedback {
category: "tone" | "grammar" | "nuance" | "fidelity";
severity: "critical" | "moderate" | "minor";
location: string; // "Sentence 2" or "Paragraph 1"
issue: string; // What's wrong
suggestion: string; // How to fix it
}
Example:
{
"category": "tone",
"severity": "critical",
"location": "Opening sentence",
"issue": "Phrase 'no worries' is too casual for formal business context",
"suggestion": "Use 'Certainly' or 'Of course' for appropriate formality"
}
This is targeted refinement, not generic “try again.”
Step 4: Refine
Agent receives:
- Its own previous output
- Categorized feedback
- Original source text (for reference)
Generates refined translation addressing specific issues.
Step 5: Re-Evaluate
New translation scored using same criteria.
Loop continues until:
- Threshold met (e.g., overall score ≥ 8.5)
- OR max iterations reached (typically 3-5)
Step 6: Output
Final translation + quality scores + iteration count.
Real Example: Spanish → English Business Email
Source text (Spanish):
“Estimado cliente, lamentamos el inconveniente. Resolveremos su solicitud a la brevedad posible.”
Iteration 1: Generate
Translation:
“Dear customer, we’re sorry for the trouble. We’ll fix your request ASAP.”
Iteration 1: Evaluate
{
"tone_accuracy": 6,
"issue": "ASAP is too informal for 'Estimado cliente' formality level"
}
Iteration 2: Refine
Translation:
“Dear customer, we apologize for the inconvenience. We will resolve your request promptly.”
Iteration 2: Evaluate
{
"tone_accuracy": 9,
"cultural_nuance": 9,
"grammar": 10,
"semantic_fidelity": 10,
"overall": 9.5
}
Threshold met → Output final translation
Total time: 4 seconds. Total iterations: 2.
Technical Implementation
Architecture
Base: Vercel AI SDK (proof-of-concept)
Evolution: Production-grade orchestration
async function selfImprovingTranslation(
sourceText: string,
targetLang: string,
context: TranslationContext
): Promise<TranslationResult> {
let translation = await generate(sourceText, targetLang, context);
let score = await evaluate(translation, context);
let iterations = 1;
const maxIterations = 5;
const threshold = 8.5;
while (score.overall < threshold && iterations < maxIterations) {
const feedback = await categorizeFeedback(translation, score, context);
translation = await refine(translation, feedback, sourceText);
score = await evaluate(translation, context);
iterations++;
}
return {
translation,
score,
iterations,
metadata: { sourceText, targetLang, context }
};
}
State Management Across Iterations
Each iteration maintains:
interface IterationState {
original: string; // Immutable source text
current: string; // Current translation (mutable)
history: Translation[]; // All previous attempts
scores: EvaluationScore[]; // Score progression
feedback: Feedback[]; // All feedback given
iteration: number; // Current iteration count
}
This state persistence enables:
- Tracking improvement across iterations
- Avoiding repeating past errors
- Understanding which refinements worked
Separate Evaluator Prompt
Key design decision: Use a different prompt for evaluation than generation.
Generator prompt:
“Translate the following text from Spanish to English, maintaining {tone} and {context}…”
Evaluator prompt:
“You are a critical translation reviewer. Evaluate the translation against these criteria: tone accuracy, cultural nuance, grammar, semantic fidelity. Be harsh—identify even subtle issues. Return structured JSON scores and specific feedback.”
Why separate prompts matter:
The generator is optimistic and creative. The evaluator is pessimistic and analytical.
Using the same prompt for both creates cognitive dissonance—the AI is reluctant to critique its own work harshly.
Separate prompts enable genuine self-criticism.
Hallucination Mitigation
One-shot translations often hallucinate—adding information not in the original.
Example:
Source: “Entregaremos el informe la próxima semana”
One-shot translation (hallucination):
“We will deliver the comprehensive quarterly report by next Tuesday”
Issues:
- “comprehensive” not in original
- “quarterly” not mentioned
- “by next Tuesday” more specific than “next week”
The Evaluator step includes fidelity checking:
interface FidelityCheck {
added_information: string[]; // Details not in source
omitted_information: string[]; // Details missing from translation
meaning_changes: string[]; // Semantic drift
}
If fidelity score is low, feedback explicitly states:
{
"category": "fidelity",
"severity": "critical",
"issue": "Translation adds 'comprehensive' and 'quarterly' not present in source",
"suggestion": "Remove added details. Translate literally: 'next week' not 'next Tuesday'"
}
Agent then removes hallucinated content in next iteration.
Result: 80% reduction in hallucinations compared to one-shot translations.
Moving Beyond API Calls
This isn’t just prompt engineering—it’s AI engineering.
Most LLM integrations:
- Single API call
- Hope for the best
- Manual iteration if wrong
Self-Improving Translation Agent:
- Multi-step reasoning chain
- Autonomous quality control
- Programmatic iteration logic
- State management across turns
- Structured output validation
This is sophisticated agentic orchestration, not just API calls.
Measurable Outcomes
After 2 months of testing across 100+ translations:
Quality improvement:
- 40-60% increase in human-evaluated quality scores (one-shot vs. iterative)
- Measured across 5 criteria: tone, nuance, grammar, fidelity, overall
Hallucination reduction:
- 80% fewer hallucinations (fidelity checking catches invented details)
Iteration efficiency:
- Average 2.3 iterations to reach threshold
- 95% success rate within 3 iterations
- 4-8 seconds total per translation (including all iterations)
User feedback:
“The first version was okay. The final version after 2 iterations was professional-grade.” — Beta tester
“It caught its own mistake with the idiom translation. Impressive self-correction.” — Professional translator reviewing output
The Broader Pattern: Evaluator-Optimizer
This pattern applies far beyond translation:
Code Generation
Generate code → Lint → Fix → Lint → Output
Content Writing
Draft article → Evaluate tone → Refine → Evaluate readability → Output
Data Transformation
Transform data → Validate schema → Fix errors → Validate → Output
API Response Generation
Generate response → Check constraints → Refine → Check completeness → Output
Any task where quality matters more than speed benefits from this pattern.
Key Engineering Principles
1. LLMs are better critics than creators
The Evaluator step is often more accurate than the Generator step.
Agents can:
- Spot their own errors reliably
- Provide specific, actionable feedback
- Improve consistently across iterations
Insight: Use the AI’s analytical capabilities to validate its creative outputs.
2. Iteration unlocks quality one-shot can’t achieve
By iterating 2-3 times:
- Translation quality improves 40-60%
- Cultural nuance scores increase dramatically
- Hallucinations reduced by 80%
The compounding effect of iteration is non-linear.
3. Structured feedback enables autonomous improvement
Generic feedback doesn’t help:
“This translation could be better”
Structured feedback enables action:
{
"issue": "Tone too casual in sentence 2",
"suggestion": "Replace 'no worries' with 'Certainly'"
}
The agent can parse, understand, and apply the fix autonomously.
4. Deterministic scoring prevents infinite loops
Without clear thresholds, iteration could continue indefinitely.
Deterministic scoring ensures:
- Objective quality measurement
- Clear stopping criteria
- Iteration count limits (max 5)
Production systems need guardrails.
What I Learned
Production AI ≠ API calls
Building production AI systems means:
- Orchestrating multi-step workflows
- Managing state across iterations
- Implementing quality control mechanisms
- Handling edge cases and failures
This is engineering, not just prompting.
Separate roles unlock better performance
Using the same AI in different roles (Generator vs. Evaluator) produces better results than using different models.
Why?
- Same model understands its own generation patterns
- Can spot subtle issues specific to its generation style
- Maintains consistency across iterations
But only if you separate the prompts to enable genuine critique.
The “good enough” threshold is critical
Setting threshold too high:
- Wastes iterations on diminishing returns
- Increases latency
- Costs more in API calls
Setting threshold too low:
- Outputs lower quality
- Misses obvious errors
Through testing, 8.5/10 overall score proved optimal:
- High enough for professional-grade output
- Low enough to reach consistently in 2-3 iterations
Evolution Story
V1 (Vercel AI SDK example):
- One-shot translation
- No quality control
- “Fire and forget”
V2 (Self-Improving Agent):
- Added Evaluator step
- Deterministic scoring
- Single iteration refinement
V3 (Production):
- Iterative refinement loops
- Categorized feedback
- State management
- Hallucination detection
- Quality threshold enforcement
Time to production: 2 weeks from proof-of-concept to robust system.
Try It
The pattern is language-agnostic and model-agnostic. Works with Claude, GPT-4, or any strong reasoning model.
Implementation guide: GitHub repository (coming soon)
Read the Vercel AI SDK docs: Vercel AI SDK
Connect
Building agentic systems or exploring Evaluator-Optimizer patterns?
- Email: daniel@pragsys.io
- Follow my work: Dev.to | Substack | X/Twitter
- More case studies: danielpetro.dev/work
This is part of my series on AI agent infrastructure. Previously: Progressive disclosure for autonomous error correction. Next: Full-cycle product engineering with the Xano n8n integration.