1. System Overview
Retrieval-Augmented Generation (RAG) combines the knowledge retrieval capabilities of search systems with the natural language generation of large language models. Instead of relying solely on an LLM's training data, RAG retrieves relevant context from a curated knowledge base at query time.
This system indexes 34 volumes of Jung's writings—approximately 47,000 text chunks totaling millions of words—into a vector database. When a user asks a question, the system:
- Converts the query into a vector embedding
- Finds the most semantically similar text chunks
- Passes those chunks as context to Claude
- Returns a grounded response with citations
2. Text Extraction
The source material was collected from Anna's Archive. The corpus consists of PDFs and EPUBs of varying quality—some are clean digital publications, others are OCR scans from the 1950s with significant artifacts.
PDF extraction uses PyMuPDF (fitz) to extract text while preserving reading order. For scanned documents, the extracted text often contains OCR errors like "j u n g" instead of "Jung" or broken hyphenation across page boundaries.
EPUB extraction parses the XML/HTML structure to extract text content while stripping formatting tags, preserving semantic structure like chapters and sections where possible.
import fitz # PyMuPDF
def extract_pdf(path):
doc = fitz.open(path)
text = ""
for page in doc:
text += page.get_text("text")
return text3. Text Cleaning
Raw extracted text requires extensive cleaning before it's suitable for semantic search. The cleaning pipeline handles several categories of issues:
OCR artifact correction: Fixes character spacing errors ("j u n g" → "Jung"), corrects common OCR misreadings, and removes scanning artifacts like "Copyrighted Material" watermarks that appear throughout some volumes.
Structural cleaning: Removes front matter (title pages, copyright notices, tables of contents) and back matter (indices, bibliographies) that would add noise to semantic search without adding value.
Normalization: Standardizes unicode characters, fixes broken hyphenation across line/page breaks, normalizes whitespace, and ensures consistent paragraph boundaries.
def clean_ocr_artifacts(text):
# Fix spaced-out names
text = re.sub(r"j\s+u\s+n\s+g", "Jung", text, flags=re.I)
text = re.sub(r"f\s+r\s+e\s+u\s+d", "Freud", text, flags=re.I)
# Remove watermarks
text = re.sub(r"Copyrighted Material\s*", "", text, flags=re.I)
# Fix broken hyphenation
text = re.sub(r"(\w+)-\n(\w+)", r"\1\2", text)
return text4. Semantic Chunking
Chunking is perhaps the most critical step in RAG pipeline design. The goal is to split text into segments that are small enough to be relevant (a full book chapter would dilute the signal) but large enough to preserve context (a single sentence loses meaning).
Target size: Chunks target 400 tokens with a maximum of 600 tokens. This balances retrieval precision with context preservation—small enough that retrieved chunks are topically focused, large enough that they contain complete thoughts.
Boundary detection: The chunker respects semantic boundaries. It never splits mid-sentence—chunks end at sentence boundaries (periods, question marks). Where possible, it also preserves paragraph structure, preferring to break at paragraph boundaries over mid-paragraph.
Chapter detection: Each chunk is tagged with its source chapter, detected through regex patterns matching headers like "CHAPTER VII", "Part Two", or "Lecture 3". This metadata enables citation.
Concept tagging: Chunks are automatically tagged with Jungian concepts they contain (shadow, anima, individuation, etc.) using keyword detection. This enriches the metadata for potential filtered searches.
TARGET_TOKENS = 400 MAX_TOKENS = 600 MIN_TOKENS = 80 # Result: 47,576 chunks from 34 volumes # Average chunk: ~350 tokens # Chunk metadata: work_title, chapter, concepts[], # prev_chunk_id, next_chunk_id
5. Vector Embeddings
Vector embeddings are the mathematical foundation of semantic search. An embedding model converts text into a high-dimensional vector (a list of numbers) that captures semantic meaning—texts with similar meanings have vectors that are close together in this space.
The model: This system uses Pinecone's multilingual-e5-large model, which produces 1024-dimensional embeddings. The "e5" architecture is trained contrastively on text pairs, learning to place semantically similar texts nearby in vector space.
The linear algebra: Each text chunk becomes a point in 1024-dimensional space. To find relevant chunks for a query, we compute the cosine similarity between the query vector and all chunk vectors:
Where A · B is the dot product and ||A|| is the magnitude. Cosine similarity ranges from -1 (opposite) to 1 (identical), with higher scores indicating greater semantic similarity.
Query prefixing: The e5 model uses instruction prefixes—passages are prefixed with "passage: " during indexing, and queries are prefixed with "query: " during search. This asymmetric approach improves retrieval quality.
# During indexing (each chunk) text = "passage: " + chunk.text embedding = model.encode(text) # → [0.023, -0.041, ..., 0.018] (1024 dims) # During query query = "query: What is the shadow?" query_embedding = model.encode(query) # Similarity search returns top-k nearest neighbors
6. Semantic Retrieval
Vector database: The 47,576 chunk embeddings are stored in Pinecone, a managed vector database optimized for similarity search. Pinecone uses approximate nearest neighbor (ANN) algorithms to search billions of vectors in milliseconds.
ANN indexing: Exact nearest neighbor search requires comparing the query against every vector—O(n) complexity. Pinecone uses hierarchical navigable small world (HNSW) graphs to achieve approximate results in O(log n) time, trading a small amount of recall for massive speed gains.
Retrieval strategy: For each query, the system retrieves the top 6 chunks by cosine similarity, then filters to only those with similarity scores above 0.7 (70% match), keeping at most 3. This ensures only highly relevant context reaches the LLM.
async function queryPinecone(query: string) {
const embedding = await generateQueryEmbedding(query);
const results = await index.query({
vector: embedding,
topK: 6,
includeMetadata: true,
});
// Filter to high-relevance only
return results.matches
.filter(m => m.score > 0.7)
.slice(0, 3);
}7. LLM Integration
Context injection: Retrieved chunks are formatted and injected into Claude's system prompt. Each chunk is numbered [1], [2], etc., with its source (work title and chapter) clearly marked. This gives Claude both the content and the citation information needed for grounded responses.
Prompt engineering: The system prompt instructs Claude to be conversational rather than academic, to keep responses concise (2-4 sentences for simple questions), and to only cite sources when directly quoting. This creates a more natural interaction pattern.
Grounding: Because Claude receives the actual source text, it can accurately represent Jung's ideas rather than relying on its training data (which may contain inaccuracies or lack depth). The citations provide verifiability.
const systemPrompt = ` You're a knowledgeable guide to Jung's psychology. Be concise and conversational. Rules: - Keep responses to 2-4 sentences max - Only cite [1], [2] when directly quoting - Sound like a knowledgeable friend, not a textbook Sources: [1] Memories, Dreams, Reflections, Ch. 6 "The shadow is a moral problem that challenges..." [2] Archetypes of the Collective Unconscious "The shadow personifies everything that the subject..." `;
8. Frontend Architecture
Stack: The frontend is built with Next.js 14 using the App Router, deployed on Vercel. The API route handles the RAG pipeline—embedding the query, searching Pinecone, calling Claude, and returning the response with sources.
Serverless architecture: The entire backend runs as a serverless function. Each request spins up an isolated instance, calls the external APIs (Pinecone for retrieval, Anthropic for generation), and returns. No persistent server to maintain.
Exploration UX: The interface is designed for exploration rather than linear chat. Each query creates an expandable card showing the response and sources. Related concepts are extracted from responses and offered as follow-up queries, encouraging users to explore connections between Jung's ideas.
export async function POST(request: NextRequest) {
const { query } = await request.json();
// 1. Retrieve relevant chunks
const sources = await queryPinecone(query);
// 2. Build context for Claude
const context = formatSourcesForPrompt(sources);
// 3. Generate response
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 550,
system: buildSystemPrompt(context),
messages: [{ role: "user", content: query }],
});
// 4. Return with sources for citation
return NextResponse.json({
message: response.content[0].text,
sources: sources,
});
}Summary
This RAG system demonstrates how to make a large corpus of specialized text accessible through natural language queries. The key technical decisions—chunk size, embedding model, retrieval filtering, prompt design—all compound to determine the quality of the final output.
The result is a system that can answer questions about Jungian psychology with responses grounded in primary sources, complete with citations that allow users to explore further in the original texts.