Dark green terminal workspace supporting the RAG systems case study
2025 AI Systems Builder live

RAG Systems That Hold Up

Designed PDF-based retrieval systems with careful chunking, grounding, evaluation, and failure analysis so answers stayed useful beyond the demo stage.

  • RAG
  • Retrieval
  • Evaluation
  • Grounding
  • PDF Systems

RAG Systems That Hold Up is about moving retrieval-augmented generation out of the demo phase. The core work was not making a model sound fluent. It was designing a retrieval system that could stay grounded, inspectable, and useful when documents became messy and the questions stopped being easy.

Overview

This project focused on PDF-based RAG workflows where answer quality depended on far more than prompt wording. The real work sat in document parsing, chunk design, retrieval quality, citation grounding, and failure analysis. I approached the system as an information retrieval problem wrapped around an LLM, not the other way around.

Context

Many document-heavy AI workflows rely on PDFs that were never designed for clean machine use. They may contain broken text extraction, inconsistent section structure, duplicated headers, tables, footnotes, and long passages that make naive chunking behave poorly.

That creates a familiar failure pattern: the system appears strong in a controlled demo, but becomes unreliable when retrieval quality drops or context boundaries become ambiguous. For a RAG workflow to be genuinely useful, the retrieval layer has to be strong enough that the generation layer is not constantly compensating for bad evidence.

Problem

The central problem was trustworthiness.

RAG systems often fail for reasons that are easy to hide but hard to ignore in production:

  • chunks are too large, too small, or semantically incoherent
  • retrieved passages are adjacent to the answer but not actually sufficient for grounding it
  • PDFs introduce extraction noise that pollutes indexing quality
  • the model responds fluently even when retrieval is weak
  • teams optimize for response quality before understanding retrieval failure modes

The challenge was to design a workflow where retrieval quality could be inspected directly and where answer behavior stayed tied to evidence.

System Design

I structured the system around four layers: document preparation, chunking, retrieval, and grounded response generation.

First, PDFs had to be parsed into a cleaner intermediate representation so that indexing reflected document structure as faithfully as possible. Then chunking strategies were designed to preserve semantic coherence while still producing retrieval units small enough to rank effectively.

On top of that, the retrieval layer was tuned around practical usefulness rather than raw top-k output. The goal was to retrieve evidence that was both relevant and sufficient, not merely text that shared keywords with the query.

Generation came after retrieval was made inspectable. Responses were shaped around grounded context, with an emphasis on traceability back to retrieved passages rather than open-ended completion behavior.

Key Technical Decisions

One important decision was to treat chunking as a first-class design variable. Different documents required different chunk boundaries, overlap strategies, and metadata handling. Poor chunk design creates downstream retrieval problems that no prompt can fully repair.

Another was to separate retrieval quality from answer quality during analysis. If a response was weak, I wanted to know whether the problem came from parsing, chunking, ranking, or generation. That separation made debugging much faster and prevented the system from being tuned blindly at the prompt layer.

I also emphasized grounding over style. A shorter answer tied clearly to strong evidence is usually more useful than a polished answer built on weak retrieval.

Reliability and Evaluation

Evaluation focused on whether the system could behave reliably across non-ideal documents and harder question patterns.

That meant examining retrieval failures directly: missed evidence, ambiguous chunks, irrelevant top results, weak grounding, and responses that exceeded what the retrieved context justified. Instead of asking only whether the answer looked good, I asked whether the evidence path was strong enough to trust the answer.

This kind of evaluation matters because RAG systems often degrade quietly. Without explicit failure analysis, a system can appear effective while repeatedly succeeding for the wrong reasons.

Outcome

The result was a more reliable retrieval workflow for PDF-based knowledge systems: cleaner chunk behavior, more useful retrieval, stronger grounding, and a clearer view of where the system still failed.

That makes the system more valuable not just technically, but operationally. When retrieval is inspectable and failure modes are understood, the surrounding workflow becomes easier to improve with discipline rather than guesswork.

What I Owned

I owned the design direction for retrieval behavior, chunking strategy, grounding discipline, and evaluation logic, including how the system was analyzed when retrieval and answer quality diverged.

Reflection

This project reflects how I think about AI systems more broadly: reliability comes from architecture and evaluation, not just model capability. For RAG especially, the difference between something impressive and something trustworthy is usually the quality of the retrieval system behind it.