SHARE
Overview
This guide shows how to build a production-readyRetrieval-Augmented Generation (RAG)stack usingMastrafor orchestration and ZeroEntropy.dev for fast, scalable retrieval. You will ingest content, embed and index it, retrieve top-k passages, rerank, and ground your LLM responses with citations.
Why Mastra+ZeroEntropy
Mastra: TypeScript-first agents, tools, and workflows for clean RAG pipelines.
ZeroEntropy: High-performance vector search, hybrid recall, and rerankers with simple APIs.
Architecture
Ingest: Load PDFs, HTML, Markdown, or API data.
Chunk: Split into semantic passages with overlap for context windows.
Embed: Create vectors using your preferred model.
Index: Store embeddings in ZeroEntropy for ANN search.
Retrieve: Top-k candidates per query; optionally, hybrid lexical+vector.
Rerank: Cross-encoder or rules to refine evidence.
Generate: LLM answers grounded in retrieved passages with citations.
Feedback: Log failures and improve prompts, chunking, or schemas.
Step 1: Prepare Your Corpus
Normalize filenames, extract text, and capture metadata like source, section, and timestamp. Good metadata enables time-bounded retrieval, tenanting, and audit trails.
Step 2: Chunk and Embed
Use token-aware chunking (e.g., 500–800 tokens with 10–15% overlap).
Embed each chunk; store vector, raw text, and metadata.
Keep a stable chunk ID to support citations and updates.
Step 3: Index in ZeroEntropy
Create a collection and upsert chunks with vector+metadata.
Enable filters for tenant/team, doc type, and time windows.
Optionally precompute BM25 terms for hybrid search.
Step 4: Wire Retrieval in Mastra
Define a tool that calls ZeroEntropy’s search API with top-k and filters.
Add a reranker step to rescore candidates before generation.
Pass the final evidence set to your LLM prompt template.
Prompt Template (Generation)
Use a structure like: “You are a factual assistant. Use only the Evidence below. Cite chunk IDs next to claims. If unsure, say you don’t know.”Include a short instruction to list sources at the end for traceability.
Policies and Guardrails
Filtering: Scope by tenant, role, region, and recency.
Safety: Block disallowed categories and enforce redact rules at retrieval time.
Citations: Require chunk IDs; reject answers lacking evidence.
Evaluation
Answer quality: Exact-match/F1 or rubric scores with human review.
Attribution: Citation precision/recall; broken-citation rate.
Latency: P95 end-to-end; per-stage timing (retrieve, rerank, generate).
Cost: Top-k, reranker batch size, and max-tokens controls.
Performance Tips
Use smaller, high-quality embeddings; test dimensionality vs recall.
Start with k=20 retrieval, rerank to 5–8 evidence chunks.
Cache hot queries; memoize reranker scores for frequent pairs.
Tune chunk size and overlap by domain; legal/medical often need larger chunks.
Operational Playbook
Data freshness:Automate ingest on file change; re-embed only affected chunks.
Versioning: Keep index versions; roll back quickly on bad ingests.
Observability: Log k-hit distribution, reranker deltas, and refusal rates.
A/B testing: Compare prompts, rerankers, and chunkers on a fixed eval set.
Use Cases
Internal knowledge: Policies, SOPs, architecture docs with RBAC filters.
Customer support: Deflection bots with strict citations and update freshness.
Healthcare/finance: Time-bounded retrieval, auditable answers, PII/PHI handling.
Getting Started
Explore Mastra examples Mastra on GitHub
Create a ZeroEntropy collection and ingest a small pilot set ZeroEntropy.dev
Ship an MVP with k=20→rerank to 5, strict citations, and a latency budget of 1–2s.
Conclusion
Mastra orchestrates clean RAG workflows, and ZeroEntropy delivers fast, accurate retrieval. Together, they provide a pragmatic path from prototype to production: better grounding, clearer citations, and predictable latency. Start small, instrument well, and iterate on chunking, prompts, and reranking to reach reliable RAG at scale.
Get started with
RELATED ARTICLES





