Blog

24 articles

Research, case studies, and engineering deep-dives from the HeyNEO team.

Agent Constitution: A Policy Layer That Actually Stops Agents from Doing Dumb Things

NEO built a YAML-driven policy enforcement framework for AI agents with AST-restricted conditions, staged PII detection, JSONL audits, and a FastAPI + React dashboard so rule violations are visible and enforceable.

May 14, 2026·10 min

Read

AI Agents & Automation

ContextTimeMachine: Replay an Agent's Context Window at Any Turn You Choose

NEO built a post-hoc context debugger that reconstructs the exact context window at any turn, tracks when facts drop out, and finds divergence points across two agent runs.

May 14, 2026·10 min

Read

AI Agents & Automation

LiveContext: A Real-Time Stream View of What's Actually in Your Agent's Context Window

NEO built a transparent OpenAI, Anthropic, and Ollama proxy that streams context composition, token usage, evictions, and attention density to a live dashboard.

May 14, 2026·10 min

Read

AI Agents & Automation

agentsync: Git-Backed Sync for AI Team Configs with a 52-Point Compliance Audit

NEO built a git-backed CLI for AI team configuration sync with tree-level three-way merge, conflict-safe pull and push flows, and a 52-point security and compliance audit.

May 14, 2026·10 min

Read

LLM Evaluation & Benchmarking

ASR Evaluation Framework: Benchmarking Five Speech Models on Accuracy, Speed, and Robustness

NEO built an ASR benchmarking harness that compares five speech models across 15+ scenarios with WER, CER, RTF, and inference-time outputs in a stable JSON schema.

May 14, 2026·10 min

Read

LLM Evaluation & Benchmarking

How Neo Evaluated and Optimized a RAG Chatbot

A full case study on replacing gut-feel RAG tuning with LLM-as-judge evaluation, retrieval fixes, and model sweep benchmarking that delivered +19% quality and -79% session cost.

May 8, 2026·14 min

Read

LLM Evaluation & Benchmarking

Which Local LLM Should You Actually Use for Coding and Agentic Workflows?

A complete evaluation of qwen3.6:27b, qwen3.6:35b-a3b, qwen3-coder:30b, and deepseek-coder:33b across code generation, function calling, and agent capabilities — run entirely on local hardware.

May 7, 2026·10 min

Read

AI Agents & Automation

Agent Factory GUI: Visual No-Code Builder for AI Agent Workflows

NEO built Agent Factory GUI, a drag-and-drop visual environment for composing, testing, and exporting production AI agent pipelines without writing boilerplate.

March 23, 2026·8 min

Read

Applied AI / Domain-Specific Pipelines

Arxiv Paper to Podcast: ML Research You Can Listen To

NEO built a pipeline that converts arxiv research papers into podcast-style audio episodes with two AI hosts, TTS voice synthesis, background music, and MP3 output.

March 23, 2026·8 min

Read

AI Agents & Automation

AutoDoc: An Autonomous Agent That Reads Your Codebase and Writes the Docs

NEO built AutoDoc, an autonomous agent that traverses your codebase, understands structure and intent, and generates accurate, up-to-date documentation without manual effort.

March 23, 2026·8 min

Read

Applied AI / Domain-Specific Pipelines

Building a Multimodal RAG System That Retrieves Text, Images, and Tables Together

NEO built a multimodal RAG with CLIP embeddings and ChromaDB: 0.030s retrieval, 60%+ cross-modal accuracy, ingestion for PDFs, images (OCR), and tables (schema extraction).

March 17, 2026·8 min

Read

Model Optimization & Inference

Carbon-Aware Model Training: Cutting CO2 by 43% Without Sacrificing Accuracy

NEO built a PyTorch pipeline that schedules training around grid carbon intensity and tracks emissions with CodeCarbon—43.2% CO2 reduction on MNIST with accuracy within 0.3% of baseline.

March 16, 2026·8 min

Read