JungRAGTechnical Process

Building a RAG System for Jung's Works

A technical overview of how this retrieval-augmented generation system was built, from raw PDF extraction to semantic search and LLM integration.

1. System Overview

Retrieval-Augmented Generation (RAG) combines the knowledge retrieval capabilities of search systems with the natural language generation of large language models. Instead of relying solely on an LLM's training data, RAG retrieves relevant context from a curated knowledge base at query time.

This system indexes 34 volumes of Jung's writings—approximately 47,000 text chunks totaling millions of words—into a vector database. When a user asks a question, the system:

  1. Converts the query into a vector embedding
  2. Finds the most semantically similar text chunks
  3. Passes those chunks as context to Claude
  4. Returns a grounded response with citations
// Pipeline overview
PDF/EPUB → Extract → Clean → Chunk → Embed → Index → Query → Retrieve → Generate

2. Text Extraction

The source material was collected from Anna's Archive. The corpus consists of PDFs and EPUBs of varying quality—some are clean digital publications, others are OCR scans from the 1950s with significant artifacts.

PDF extraction uses PyMuPDF (fitz) to extract text while preserving reading order. For scanned documents, the extracted text often contains OCR errors like "j u n g" instead of "Jung" or broken hyphenation across page boundaries.

EPUB extraction parses the XML/HTML structure to extract text content while stripping formatting tags, preserving semantic structure like chapters and sections where possible.

# PDF extraction with PyMuPDF
import fitz  # PyMuPDF

def extract_pdf(path):
    doc = fitz.open(path)
    text = ""
    for page in doc:
        text += page.get_text("text")
    return text

3. Text Cleaning

Raw extracted text requires extensive cleaning before it's suitable for semantic search. The cleaning pipeline handles several categories of issues:

OCR artifact correction: Fixes character spacing errors ("j u n g" → "Jung"), corrects common OCR misreadings, and removes scanning artifacts like "Copyrighted Material" watermarks that appear throughout some volumes.

Structural cleaning: Removes front matter (title pages, copyright notices, tables of contents) and back matter (indices, bibliographies) that would add noise to semantic search without adding value.

Normalization: Standardizes unicode characters, fixes broken hyphenation across line/page breaks, normalizes whitespace, and ensures consistent paragraph boundaries.

# OCR artifact correction
def clean_ocr_artifacts(text):
    # Fix spaced-out names
    text = re.sub(r"j\s+u\s+n\s+g", "Jung", text, flags=re.I)
    text = re.sub(r"f\s+r\s+e\s+u\s+d", "Freud", text, flags=re.I)

    # Remove watermarks
    text = re.sub(r"Copyrighted Material\s*", "", text, flags=re.I)

    # Fix broken hyphenation
    text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)

    return text

4. Semantic Chunking

Chunking is perhaps the most critical step in RAG pipeline design. The goal is to split text into segments that are small enough to be relevant (a full book chapter would dilute the signal) but large enough to preserve context (a single sentence loses meaning).

Target size: Chunks target 400 tokens with a maximum of 600 tokens. This balances retrieval precision with context preservation—small enough that retrieved chunks are topically focused, large enough that they contain complete thoughts.

Boundary detection: The chunker respects semantic boundaries. It never splits mid-sentence—chunks end at sentence boundaries (periods, question marks). Where possible, it also preserves paragraph structure, preferring to break at paragraph boundaries over mid-paragraph.

Chapter detection: Each chunk is tagged with its source chapter, detected through regex patterns matching headers like "CHAPTER VII", "Part Two", or "Lecture 3". This metadata enables citation.

Concept tagging: Chunks are automatically tagged with Jungian concepts they contain (shadow, anima, individuation, etc.) using keyword detection. This enriches the metadata for potential filtered searches.

# Chunking parameters
TARGET_TOKENS = 400
MAX_TOKENS = 600
MIN_TOKENS = 80

# Result: 47,576 chunks from 34 volumes
# Average chunk: ~350 tokens
# Chunk metadata: work_title, chapter, concepts[],
#                 prev_chunk_id, next_chunk_id

5. Vector Embeddings

Vector embeddings are the mathematical foundation of semantic search. An embedding model converts text into a high-dimensional vector (a list of numbers) that captures semantic meaning—texts with similar meanings have vectors that are close together in this space.

The model: This system uses Pinecone's multilingual-e5-large model, which produces 1024-dimensional embeddings. The "e5" architecture is trained contrastively on text pairs, learning to place semantically similar texts nearby in vector space.

The linear algebra: Each text chunk becomes a point in 1024-dimensional space. To find relevant chunks for a query, we compute the cosine similarity between the query vector and all chunk vectors:

cos(θ) = (A · B) / (||A|| × ||B||)

Where A · B is the dot product and ||A|| is the magnitude. Cosine similarity ranges from -1 (opposite) to 1 (identical), with higher scores indicating greater semantic similarity.

Query prefixing: The e5 model uses instruction prefixes—passages are prefixed with "passage: " during indexing, and queries are prefixed with "query: " during search. This asymmetric approach improves retrieval quality.

# Embedding generation
# During indexing (each chunk)
text = "passage: " + chunk.text
embedding = model.encode(text)  # → [0.023, -0.041, ..., 0.018]  (1024 dims)

# During query
query = "query: What is the shadow?"
query_embedding = model.encode(query)

# Similarity search returns top-k nearest neighbors

6. Semantic Retrieval

Vector database: The 47,576 chunk embeddings are stored in Pinecone, a managed vector database optimized for similarity search. Pinecone uses approximate nearest neighbor (ANN) algorithms to search billions of vectors in milliseconds.

ANN indexing: Exact nearest neighbor search requires comparing the query against every vector—O(n) complexity. Pinecone uses hierarchical navigable small world (HNSW) graphs to achieve approximate results in O(log n) time, trading a small amount of recall for massive speed gains.

Retrieval strategy: For each query, the system retrieves the top 6 chunks by cosine similarity, then filters to only those with similarity scores above 0.7 (70% match), keeping at most 3. This ensures only highly relevant context reaches the LLM.

# Retrieval with filtering
async function queryPinecone(query: string) {
  const embedding = await generateQueryEmbedding(query);

  const results = await index.query({
    vector: embedding,
    topK: 6,
    includeMetadata: true,
  });

  // Filter to high-relevance only
  return results.matches
    .filter(m => m.score > 0.7)
    .slice(0, 3);
}

7. LLM Integration

Context injection: Retrieved chunks are formatted and injected into Claude's system prompt. Each chunk is numbered [1], [2], etc., with its source (work title and chapter) clearly marked. This gives Claude both the content and the citation information needed for grounded responses.

Prompt engineering: The system prompt instructs Claude to be conversational rather than academic, to keep responses concise (2-4 sentences for simple questions), and to only cite sources when directly quoting. This creates a more natural interaction pattern.

Grounding: Because Claude receives the actual source text, it can accurately represent Jung's ideas rather than relying on its training data (which may contain inaccuracies or lack depth). The citations provide verifiability.

# System prompt structure
const systemPrompt = `
You're a knowledgeable guide to Jung's psychology.
Be concise and conversational.

Rules:
- Keep responses to 2-4 sentences max
- Only cite [1], [2] when directly quoting
- Sound like a knowledgeable friend, not a textbook

Sources:
[1] Memories, Dreams, Reflections, Ch. 6
"The shadow is a moral problem that challenges..."

[2] Archetypes of the Collective Unconscious
"The shadow personifies everything that the subject..."
`;

8. Frontend Architecture

Stack: The frontend is built with Next.js 14 using the App Router, deployed on Vercel. The API route handles the RAG pipeline—embedding the query, searching Pinecone, calling Claude, and returning the response with sources.

Serverless architecture: The entire backend runs as a serverless function. Each request spins up an isolated instance, calls the external APIs (Pinecone for retrieval, Anthropic for generation), and returns. No persistent server to maintain.

Exploration UX: The interface is designed for exploration rather than linear chat. Each query creates an expandable card showing the response and sources. Related concepts are extracted from responses and offered as follow-up queries, encouraging users to explore connections between Jung's ideas.

# API route flow
export async function POST(request: NextRequest) {
  const { query } = await request.json();

  // 1. Retrieve relevant chunks
  const sources = await queryPinecone(query);

  // 2. Build context for Claude
  const context = formatSourcesForPrompt(sources);

  // 3. Generate response
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 550,
    system: buildSystemPrompt(context),
    messages: [{ role: "user", content: query }],
  });

  // 4. Return with sources for citation
  return NextResponse.json({
    message: response.content[0].text,
    sources: sources,
  });
}

Summary

This RAG system demonstrates how to make a large corpus of specialized text accessible through natural language queries. The key technical decisions—chunk size, embedding model, retrieval filtering, prompt design—all compound to determine the quality of the final output.

The result is a system that can answer questions about Jungian psychology with responses grounded in primary sources, complete with citations that allow users to explore further in the original texts.