A single prompt kicked off a full autonomous evaluation pipeline. Neo explored the codebase, designed the benchmarking strategy, ran the experiments, and delivered a 19% quality improvement with a 79% cost reduction.
What is Neo
Neo is a fully autonomous AI Engineering Agent that lives in your VS Code IDE or Cursor. You give it a high-level goal. It plans the implementation, writes the code, runs tests, and iterates until the project is complete. No hand-holding, no step-by-step instructions.
Neo handles the full AI/ML engineering stack: RAG pipelines, model fine-tuning, evaluation harnesses, classical ML experimentation, and deployment pipelines. It can explore an unfamiliar codebase, identify what is broken or missing, design a solution, and ship it.
The project was a customer support chatbot for TechFlow Solutions, built on a standard RAG setup: ChromaDB for vector retrieval, a system prompt, and an LLM to generate responses. The team had a general sense it was "okay" but nobody had actually measured anything. No scores, no baselines, no way to know if a change made things better or worse.
The prompt given to Neo was straightforward: evaluate the chatbot for quality and cost, optimize it, and report back. What followed was a three-phase process that Neo designed and executed autonomously.
The setup: TechFlow's chatbot handles customer questions about pricing, features, and account management. The knowledge base is a set of markdown docs loaded into ChromaDB. Six representative turns from a real production session were used as the evaluation dataset across all three phases.
Phase 1: Establishing a Real Baseline
The first thing Neo identified was that the existing evaluator.py was not actually evaluating anything. It was doing pattern-matching heuristics: checking whether responses contained certain keywords, counting source references, and flagging boilerplate phrases. The scores it produced looked like numbers but carried no real signal about response quality.
Neo rewrote the evaluator from scratch to make real API calls to anthropic/claude-haiku-4-5 via OpenRouter. The judge receives the user question, the assistant response, and the retrieved source chunks with their similarity scores. It scores each turn on four metrics - relevance, accuracy, helpfulness, and overall quality - all on a 0-10 scale, and returns a reasoning string per turn.
Before running any evaluation, Neo also reverted the agent to its original git state. All previously applied changes were rolled back. The goal was a clean, honest baseline with no prior optimizations baked in.
What the judge found
| Metric | Value |
|---|---|
| Overall Average | 6.62 |
| Relevance | 7.50 |
| Accuracy | 7.17 |
| Helpfulness | 5.67 |
The overall average of 6.62 out of 10 is not terrible, but the helpfulness score of 5.67 was the immediate flag. The agent was technically answering questions but not actually helping users get what they needed. The per-turn breakdown told a more specific story:
| Turn | User Question | Relevance | Accuracy | Helpfulness | Overall |
|---|---|---|---|---|---|
| 1 | "hey what do you guys do?" | 6 | 10 | 3 | 5 |
| 2 | "tell me about the pricing plans" | 10 | 7 | 9 | 8 |
| 3 | "what happens after 50 workflows are exhausted?" | 7 | 6 | 6 | 6 |
| 4 | "what should I do after my quota is exhausted?" | 7 | 8 | 6 | 7 |
| 5 | "what does techflow do?" | 9 | 5 | 6 | 6 |
| 6 | "How to use FlowSync?" | 6 | 7 | 4 | 5 |
Turn 1 was the most revealing. The user asked a simple "what do you do?" opener and the agent responded with: "I don't have access to specific information about our company's services." The judge gave it a helpfulness score of 3. Neo traced the root cause: retrieval returned zero documents. The SIMILARITY_THRESHOLD was set to 0.7, which in ChromaDB's cosine distance space is strict. A casual opener like "hey what do you guys do?" produced embeddings that did not get close enough to any chunk to pass the filter.
Critical finding: Turn 1 retrieved 0 documents because the similarity threshold was too strict for casual, low-specificity queries. The agent had no context to work with and correctly said so, but that is not helpful to a user who just wants to know what the product does. This was a retrieval configuration bug, not an LLM problem.
Turn 5 had the opposite problem. The agent retrieved good context but then fabricated product names that were not in the documents. The judge caught it and scored accuracy at 5. The real answer was in the retrieved chunks. The agent hallucinated details it did not need to hallucinate.
Baseline session cost: $0.002420 per session using google/gemini-3.1-flash-lite-preview, priced live from the OpenRouter /models API.
Phase 2: Iterative Agent Optimization
With a real baseline established, Neo applied fixes in two groups. The production model stayed the same throughout Phase 2 so any score changes could be cleanly attributed to agent changes, not LLM differences.
Fix Group A: Retrieval layer
Neo identified three retrieval-level problems and addressed each directly:
- Lowered
SIMILARITY_THRESHOLDfrom0.7to0.35. ChromaDB returns cosine distance, not similarity. Lower distance means more similar. The original threshold was filtering out too much context on low-specificity queries. - Added a top-K fallback: if all retrieved docs still get filtered after the threshold check, return the top 3 by distance anyway. The agent should never enter a turn with zero context.
- Added
_deduplicate_chunks(): removes chunks with more than 80% token overlap from the same source file. Turns 3 and 4 in the baseline had three nearly identical FAQ chunks in the context window, wasting tokens and adding noise for the LLM.
| Metric | Value |
|---|---|
| Overall Average | 7.62 |
| Relevance | 8.83 |
| Accuracy | 7.33 |
| Helpfulness | 7.00 |
A 15.1% improvement in overall average, from 6.62 to 7.62. Helpfulness jumped from 5.67 to 7.00, the biggest single-metric gain. Turn 1 went from a 5 to a 7 because the agent now had actual context. Turn 5 jumped to 9.5 because with deduplication, the context was cleaner and the hallucination stopped.
Fix Group B: LLM interaction layer
Two changes on top of Fix Group A:
- Added
_build_messages_with_history(): passes the last 3 conversation turns as prior messages to the LLM. Without this, every turn was stateless. The agent on Turn 4 had no memory of Turn 3, making follow-up questions awkward to handle. - Rewrote the system prompt: removed the boilerplate "Hello! Thank you for reaching out" greeting that appeared on every single turn, added explicit tone de-escalation instructions for frustrated users, and added a grounding rule: only state facts present in the retrieved documents.
| Metric | Value |
|---|---|
| Overall Average | 7.33 |
| Relevance | 8.50 |
| Accuracy | 7.50 |
| Helpfulness | 6.33 |
Why did the score dip from 7.62 to 7.33? The new system prompt's strict grounding rule made the agent more accurate but less helpful on turns where the documents did not fully answer the question. On Turn 4, the agent previously offered upgrade suggestions from memory. With the grounding rule, it said "the documents don't specify this, contact support." Technically more accurate. Less useful to the user. The judge penalized helpfulness for it. This is a real tradeoff and the right call for a factual support bot, but it is worth knowing it happened.
Net result: +10.7% overall quality versus the original baseline, with a 15.6% token reduction from deduplication. Both fix groups confirmed net-positive. Neo verified this before proceeding to Phase 3.

Phase 3: Model Sweep
Neo then swept model choices while keeping the optimized agent fixed. That separated agent quality from model quality and made the cost comparison meaningful.
All 5 models were evaluated with the same Claude Haiku 4.5 judge, same questions, and same agent configuration. Pricing was fetched live from the OpenRouter /models API at evaluation time so the cost numbers reflected current rates.
Quality scores
| Model | Relevance | Accuracy | Helpfulness | Overall Avg |
|---|---|---|---|---|
| Gemma 4 26B (Google) | 9.00 | 8.67 | 6.50 | 7.88 |
| Mistral Small 3.2 (Mistral AI) | 8.33 | 8.00 | 6.83 | 7.62 |
| Llama 4 Scout (Meta) | 8.50 | 7.33 | 6.83 | 7.46 |
| Gemini 3.1 Flash Lite (baseline ref) | 8.50 | 7.50 | 6.33 | 7.33 |
| Nova Micro v1 (Amazon) | 7.33 | 8.50 | 5.67 | 6.96 |
Gemma 4 26B came out on top with a 7.88 average. It scored highest on relevance (9.0) and accuracy (8.67). The judge consistently noted that Gemma's responses were well-grounded in the retrieved context without unnecessary padding. Mistral Small 3.2 was a close second at 7.62, with the best helpfulness score among all sweep models.
Nova Micro was the weakest on quality at 6.96, though still above the original unoptimized baseline of 6.62. Its helpfulness score dropped back to 5.67, similar to the original baseline. The model tends to give short, accurate-but-terse responses that the judge penalizes for not being actionable enough.

Cost per session
This is where the picture gets interesting. The baseline model costs $0.002043 per session with the optimized agent. Every other model in the sweep costs less than a quarter of that.
| Model | Input $/M tokens | Session Cost | vs Baseline Cost | Quality Avg |
|---|---|---|---|---|
| Gemini 3.1 Flash Lite (ref) | $0.25 | $0.002043 | - | 7.33 |
| Nova Micro v1 (Amazon) | $0.035 | $0.000238 | -88% | 6.96 |
| Mistral Small 3.2 | $0.075 | $0.000430 | -79% | 7.62 |
| Llama 4 Scout (Meta) | $0.080 | $0.000472 | -77% | 7.46 |
| Gemma 4 26B (Google) | $0.060 | $0.000509 | -75% | 7.88 |

Quality vs cost: the full picture
Looking at scores alone, Gemma wins. Looking at cost alone, Nova Micro wins. The scatter plot puts both axes together and makes the decision clear.

Gemma 4 26B is the clear winner. It scores higher than the baseline model on quality (7.88 vs 7.33) while costing 75% less per session. There is no tradeoff to make here. It is both better and cheaper.
If cost is the primary constraint and a slight quality drop is acceptable, Nova Micro at $0.000238 per session is a viable option. At 6.96 average quality it is still above the original unoptimized baseline of 6.62, and the 88% cost reduction is significant at scale.
End-to-end improvement: Starting from the original unoptimized agent at 6.62/10 and $0.002420/session, the optimized agent running Gemma 4 26B delivers 7.88/10 quality (+19%) at $0.000509/session (-79% cost). Both metrics moved in the right direction simultaneously.
What This Exercise Showed
- Measure before you optimize. The team had a vague sense the chatbot was "okay." The judge said helpfulness was 5.67 out of 10. Those are very different starting points. Without a real baseline, any optimization is guesswork and any claimed improvement is unverifiable.
- Retrieval problems look like LLM problems. Turn 1 looked like the model did not know about the company. The actual problem was that zero documents were retrieved. Always check what context the LLM actually received before blaming the generation step.
- Stricter grounding trades helpfulness for accuracy. The system prompt rewrite improved accuracy but hurt helpfulness on knowledge-gap turns. That is a real tradeoff and the right call for a factual support bot, but it needs to be a conscious decision, not a side effect.
- Cheaper models are not worse models. Gemma 4 26B cost 75% less and scored higher. The assumption that the most expensive model is the best model does not hold here. Always sweep before committing to a production model.
- LLM-as-judge is worth the cost. Each evaluation run cost a few cents in Claude Haiku API calls. The judge caught hallucinations, identified retrieval failures, and gave consistent per-turn reasoning that heuristics would have missed entirely.
What to improve next
The system prompt grounding rule caused a helpfulness regression on knowledge-gap turns. A better approach is a two-path response: if the retrieved context answers the question, answer from it; if not, say so clearly and give the user a concrete next step, such as a support email, docs link, or escalation path, rather than just "I don't know." The current agent does the second part inconsistently.
Adding a brief company summary directly into the system prompt would also help. Turn 1-style openers ("what do you do?") would always get a useful response regardless of retrieval quality. That single change would likely push the overall average above 8.0.
The evaluation pipeline itself is reusable. The same judge, the same 6 questions, and the same cost tracker can be run against any future agent change in a few minutes. That is the real output of this work: not just better scores, but a repeatable way to know whether the next change helps or hurts.
Try This Yourself with Neo
Everything Neo did here is reproducible. The full project, evaluation harness, session logs, and result files are in the repository. If you want to build on top of this or run your own evaluation, clone the project and point Neo at it.
How to get started: Install the Neo extension in VS Code or Cursor. Clone the TechFlow chatbot project. Open it in your IDE and give Neo a goal. Neo will explore the codebase, plan the work, and execute it autonomously.
Example prompts to continue this work
Improve the knowledge-gap handling
"The chatbot currently says 'I don't know, contact support' when retrieved docs don't answer the question. Improve this so it gives the user a specific next step (support email, docs link, or escalation path) based on the question type. Re-run the evaluation and confirm helpfulness improves."
Add a company overview to the system prompt
"Turn 1 in our evaluation scored 5/10 because casual openers like 'what do you do?' don't retrieve good context. Add a concise TechFlow company summary directly to the system prompt so the agent can always answer overview questions. Re-evaluate and confirm Turn 1 score improves."
Expand the evaluation dataset
"The current evaluation uses 6 turns from one session. Generate 20 diverse test questions covering pricing, features, account management, and edge cases. Run the full evaluation on this expanded dataset and report whether the quality rankings across models hold."
Add a cost-aware routing layer
"Build a query router that sends simple factual questions to Nova Micro (cheapest, $0.000238/session) and complex multi-turn questions to Gemma 4 26B (best quality). Evaluate the hybrid setup and report the blended cost and quality score."
Run a full re-evaluation after any change
"Run the full evaluation after the latest agent change. Compare the new quality, helpfulness, and cost numbers against the baseline and report whether the change helped or hurt."
All evaluation data, session logs, and result files are in eval/results/ of the project repository. Judge model: anthropic/claude-haiku-4-5 via OpenRouter. Pricing fetched live from the OpenRouter /models API.
Try NEO in Your IDE
Install the NEO extension to bring AI-powered development directly into your workflow:
- VS Code: NEO in VS Code
- Cursor: Install NEO for Cursor →
