Eval-Driven RAG for Technical Documents

Built an evaluation-first RAG system for dense technical documents, using a labeled benchmark and three retrieval baselines to separate retrieval quality from grounded answer quality before optimizing generation.

Eval-Driven RAG for Technical Documents is a flagship build around one engineering question: if a RAG system looks good, can you explain whether retrieval is actually strong enough to deserve the answer?

The standalone repository is public here: rag-eval-system

What It Is

This is an evaluation-first RAG system for dense technical documents. I used a small but adversarial corpus, a labeled benchmark, and a retrieval comparison setup to test evidence quality before treating answer generation as the main problem.

That matters because technical documents create exactly the kinds of failure cases that make weak RAG systems look better than they are: repeated terminology, section-sensitive evidence, multi-span answers, and queries that should trigger abstention instead of confident synthesis.

Why It Matters

Most RAG demos optimize for output quality first. I wanted a system that could show, with evidence, where retrieval succeeds, where it fails, and when grounded generation is still the bottleneck.

That makes the project useful as a proof artifact, not just a demo artifact.

System Design

The system is intentionally lightweight and inspectable.

raw technical documents are normalized and chunked deterministically
retrieval baselines run against the same chunk artifact
grounded generation sits on top of the strongest retrieval path
evaluation scores retrieval and answer quality separately

That separation is the point. If the answer is weak, I want to know whether the problem is evidence retrieval, evidence coverage, or answer selection.

Workflow Snapshot

The public repository is intentionally organized so each stage can be inspected separately:

normalize source documents into a deterministic chunk artifact
run multiple retrieval strategies against the same benchmark queries
score retrieval independently before changing generation behavior
generate answers only from retrieved evidence
inspect grounded-but-wrong outputs as a separate class of failure

That workflow is the main proof. It shows the system was designed to reveal bottlenecks, not to hide them behind fluent answers.

Retrieval Baselines

The current benchmark compares three retrieval methods:

keyword overlap as a simple deterministic floor
bm25 as the strongest sparse baseline
tfidf/cosine as a lightweight semantic baseline

The benchmark story is already useful: BM25 currently performs best on this dense technical corpus, while the lightweight semantic baseline loses precision on terminology-heavy and section-sensitive cases. That is exactly the sort of result that is easy to miss if evaluation starts from final answer fluency instead of evidence retrieval.

Generation Findings

The first published generation benchmark uses a deterministic grounded generator. That generator only uses retrieved chunks, which makes answer behavior easy to inspect and keeps the failure analysis attributable.

The key finding is honest and important:

deterministic generation is grounded
grounded does not mean correct
the current bottleneck is answer quality, not hallucination

An LLM-backed path also exists in the standalone repo, but it is not part of the current published benchmark results. I kept the first public benchmark narrower on purpose so the retrieval-versus-generation tradeoff stayed inspectable instead of being blurred by model behavior.

What This Proves

This project shows how I approach AI systems work when reliability matters:

evaluation is part of system design, not a reporting step
retrieval quality and generation quality should be separated
engineering tradeoffs should be measurable, not implied
public project artifacts should include failure analysis, not just polished outputs

That is the signal I wanted this flagship to carry. It is a compact project, but it demonstrates a complete Staff-level instinct: build the benchmark early, keep the system inspectable, and make uncertainty visible instead of smoothing it away.

Limitations

This is a strong first public release, not a finished platform.

the corpus is small and intentionally adversarial
chunking is simple and deterministic
the semantic baseline is lightweight rather than neural
the published generation result is the deterministic control, not an LLM benchmark

Those constraints are useful because they make the next engineering steps clear instead of hiding them behind premature complexity.