Eval-Driven RAG for Technical Documents
Built an evaluation-first RAG system for dense technical documents, using a labeled benchmark to separate retrieval quality from grounded answer quality before optimizing generation.
Eval-Driven RAG for Technical Documents is a flagship build around one engineering question: if a RAG system looks good, can you explain whether retrieval is actually strong enough to deserve the answer?
The standalone repository is public here: rag-eval-system
What It Is
This is an evaluation-first RAG system for dense technical documents. I used a small but adversarial corpus and a labeled benchmark to test retrieval quality before treating answer generation as the main problem.
That matters because technical documents create exactly the kinds of failure cases that make weak RAG systems look better than they are: repeated terminology, section-sensitive evidence, multi-span answers, and queries that should trigger abstention instead of confident synthesis.
Why It Matters
Most RAG demos optimize for output quality first. I wanted a system that could show, with evidence, where retrieval succeeds, where it fails, and when grounded generation is still the bottleneck.
That makes the project useful as a proof artifact, not just a demo artifact.
System Design
The system is intentionally lightweight and inspectable.
- raw technical documents are normalized and chunked deterministically
- retrieval baselines run against the same chunk artifact
- grounded generation sits on top of the strongest retrieval path
- evaluation scores retrieval and answer quality separately
That separation is the point. If the answer is weak, I want to know whether the problem is evidence retrieval, evidence coverage, or answer selection.
Retrieval Baselines
The current benchmark compares three retrieval methods:
keywordoverlap as a simple deterministic floorbm25as the strongest sparse baselinetfidf/cosineas a lightweight semantic baseline
The benchmark story is already useful: BM25 currently performs best on this dense technical corpus, while the lightweight semantic baseline loses precision on terminology-heavy and section-sensitive cases.
Generation Findings
The first published generation benchmark uses a deterministic grounded generator. That generator only uses retrieved chunks, which makes answer behavior easy to inspect.
The key finding is honest and important:
- deterministic generation is grounded
- grounded does not mean correct
- the current bottleneck is answer quality, not hallucination
An LLM-backed path also exists in the standalone repo, but it is not part of the current published benchmark results.
What This Proves
This project shows how I approach AI systems work when reliability matters:
- evaluation is part of system design, not a reporting step
- retrieval quality and generation quality should be separated
- engineering tradeoffs should be measurable, not implied
- public project artifacts should include failure analysis, not just polished outputs
That is the signal I wanted this flagship to carry.
Limitations
This is a strong first public release, not a finished platform.
- the corpus is small and intentionally adversarial
- chunking is simple and deterministic
- the semantic baseline is lightweight rather than neural
- the published generation result is the deterministic control, not an LLM benchmark
Those constraints are useful because they make the next engineering steps clear instead of hiding them behind premature complexity.