How to Do RAG with Mastra and ZeroEntropy

Aug 17, 2025

How to Do RAG with Mastra and ZeroEntropy
How to Do RAG with Mastra and ZeroEntropy
How to Do RAG with Mastra and ZeroEntropy
SHARE

Overview

This guide shows how to build a production-readyRetrieval-Augmented Generation (RAG)stack usingMastrafor orchestration and ZeroEntropy.dev for fast, scalable retrieval. You will ingest content, embed and index it, retrieve top-k passages, rerank, and ground your LLM responses with citations.

Why Mastra+ZeroEntropy

  • Mastra: TypeScript-first agents, tools, and workflows for clean RAG pipelines.

  • ZeroEntropy: High-performance vector search, hybrid recall, and rerankers with simple APIs.

Architecture

  • Ingest: Load PDFs, HTML, Markdown, or API data.

  • Chunk: Split into semantic passages with overlap for context windows.

  • Embed: Create vectors using your preferred model.

  • Index: Store embeddings in ZeroEntropy for ANN search.

  • Retrieve: Top-k candidates per query; optionally, hybrid lexical+vector.

  • Rerank: Cross-encoder or rules to refine evidence.

  • Generate: LLM answers grounded in retrieved passages with citations.

  • Feedback: Log failures and improve prompts, chunking, or schemas.

Step 1: Prepare Your Corpus

Normalize filenames, extract text, and capture metadata like source, section, and timestamp. Good metadata enables time-bounded retrieval, tenanting, and audit trails.

Step 2: Chunk and Embed

  • Use token-aware chunking (e.g., 500–800 tokens with 10–15% overlap).

  • Embed each chunk; store vector, raw text, and metadata.

  • Keep a stable chunk ID to support citations and updates.

Step 3: Index in ZeroEntropy

  • Create a collection and upsert chunks with vector+metadata.

  • Enable filters for tenant/team, doc type, and time windows.

  • Optionally precompute BM25 terms for hybrid search.

Step 4: Wire Retrieval in Mastra

  • Define a tool that calls ZeroEntropy’s search API with top-k and filters.

  • Add a reranker step to rescore candidates before generation.

  • Pass the final evidence set to your LLM prompt template.

Prompt Template (Generation)

Use a structure like: “You are a factual assistant. Use only the Evidence below. Cite chunk IDs next to claims. If unsure, say you don’t know.”Include a short instruction to list sources at the end for traceability.

Policies and Guardrails

  • Filtering: Scope by tenant, role, region, and recency.

  • Safety: Block disallowed categories and enforce redact rules at retrieval time.

  • Citations: Require chunk IDs; reject answers lacking evidence.

Evaluation

  • Answer quality: Exact-match/F1 or rubric scores with human review.

  • Attribution: Citation precision/recall; broken-citation rate.

  • Latency: P95 end-to-end; per-stage timing (retrieve, rerank, generate).

  • Cost: Top-k, reranker batch size, and max-tokens controls.

Performance Tips

  • Use smaller, high-quality embeddings; test dimensionality vs recall.

  • Start with k=20 retrieval, rerank to 5–8 evidence chunks.

  • Cache hot queries; memoize reranker scores for frequent pairs.

  • Tune chunk size and overlap by domain; legal/medical often need larger chunks.

Operational Playbook

  • Data freshness:Automate ingest on file change; re-embed only affected chunks.

  • Versioning: Keep index versions; roll back quickly on bad ingests.

  • Observability: Log k-hit distribution, reranker deltas, and refusal rates.

  • A/B testing: Compare prompts, rerankers, and chunkers on a fixed eval set.

Use Cases

  • Internal knowledge: Policies, SOPs, architecture docs with RBAC filters.

  • Customer support: Deflection bots with strict citations and update freshness.

  • Healthcare/finance: Time-bounded retrieval, auditable answers, PII/PHI handling.

Getting Started

  • Explore Mastra examples Mastra on GitHub

  • Create a ZeroEntropy collection and ingest a small pilot set ZeroEntropy.dev

  • Ship an MVP with k=20→rerank to 5, strict citations, and a latency budget of 1–2s.

Conclusion

Mastra orchestrates clean RAG workflows, and ZeroEntropy delivers fast, accurate retrieval. Together, they provide a pragmatic path from prototype to production: better grounding, clearer citations, and predictable latency. Start small, instrument well, and iterate on chunking, prompts, and reranking to reach reliable RAG at scale.

Get started with

ZeroEntropy Animation Gif
ZeroEntropy Animation Gif

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Our retrieval engine runs autonomously with the 

accuracy of a human-curated system.

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

Contact us for a custom enterprise solution with custom pricing

RELATED ARTICLES
Abstract image of a dark background with blurry teal, blue, and pink gradients.