Deep Dive:
The Architecture of ZeroEntropy v1

3 min read · Nov 29, 2024

4 min read · May 14, 2025

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLMs in the loop to deliver enterprise-grade search over unstructured documents.


At ZeroEntropy we’ve re-imagined every layer of the retrieval stack, from PDF parsing to query execution, to deliver end-to-end document intelligence on par with having an entire team of expert search engineers, behind one simple API.


In the sections below, we’ll dive into each layer of our architecture, from ingestion and indexing to query execution and security, demonstrating how we achieve sub-second latency, 90%+ recall, and enterprise-grade compliance (SOC 2, HIPAA).





Ingestion Architecture Diagram

1. High-Level Overview


At its core ZeroEntropy is a hybrid search system combining:

  • Sparse retrieval (BM25) for lightning-fast keyword matching

  • Dense embedding retrieval for semantic relevance

  • LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

2. Document Ingestion & Chunking


Render → OCR → VLM diagram tags

  • Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.*

  • Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.

  • Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.


Hierarchical chunking

We detect language, pick the right tokenizer & stemmer, then split into words → sentences → paragraphs. By keeping contextual spans, we try to create meaningful chunks. We currently support two chunk sizes: coarse (around 2,000 chars) and fine (around 200 chars).

3. Indexing: Sparse & Dense


Index Type

-

What we index

Why it matters

ParadeDB BM25

Paragraph & document tokens

+ LLM-generated keywords


Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree

-

Turbopuffer

-

Embeddings of every node (sentences → document)

-

Sub-second semantic search at scale



We feed our keyword index not just raw stems, but LLM-suggested synonyms, acronyms and domain terms, so a search for “FDA 21 CFR Part 11” surfaces your internal policies, even if they never literally mention “21 CFR.”




Query Architecture Diagram

4. Query Execution Walkthrough


LLM Rewriters

  • Query Rewriter: Refines your input into a clearer embedding prompt (e.g. “procedure for submitting Form 10-K to the SEC”).

  • Keyword Generator: Scores key terms (e.g. “10-K” = 0.8, “file” = –0.2) to improve matching.

  • Performance Modes:

    • Fast Mode skips LLM steps for sub-500 ms responses

    • Deep Mode runs the full LLM pipeline in 2–3 seconds


Tokenization & Typo Correction

We split the raw query into tokens and run them through a BK-tree typo corrector so even “10-Kk” maps back to “10-K.”


Sparse + Dense Fan-Out

  • Dense Recall: We query Turbopuffer embeddings to fetch the top-N semantically related chunks.

  • Sparse Recall: We use ParadeDB’s BM25 index to retrieve the top-N’ keyword matches.


Reciprocal Rank Fusion

We merge the sparse and dense rankings using reciprocal rank fusion to select the final top K results, combining complementary signals for up to a 10–15% boost in overall accuracy.

5. Security & Deployment

  • SOC 2 Type II compliant.

  • HIPAA compliant & following industry best practices.

  • End-to-end encryption for data in transit and at rest.

  • On-Prem deployment available for enterprise users, as easy-to-use docker images.

Wrapping Up

ZeroEntropy isn’t “another vector search.” It’s a full-stack retrieval platform that:

  • Knows how to parse your most fiendish PDFs.

  • Indexes meaningful snippets at every granularity.

  • Blends classical IR, embeddings, and LLM intelligence under the hood.

  • Scales from a single document to billions of nodes without compromising accuracy or speed.


Ready to see it in action?

Explore the docs →

Book a demo →


  • Why do you convert to JPEG? We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLM-powered reranking to deliver enterprise-grade search over unstructured documents.

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLMs in the loop to deliver enterprise-grade search over unstructured documents.


At ZeroEntropy we’ve re-imagined every layer of the retrieval stack, from PDF parsing to query execution, to deliver end-to-end document intelligence on par with having an entire team of expert search engineers, behind one simple API.


In the sections below, we’ll dive into each layer of our architecture, from ingestion and indexing to query execution and security, demonstrating how we achieve sub-second latency, 90%+ recall, and enterprise-grade compliance (SOC 2, HIPAA).



Ingestion Architecture Diagram

1. High-Level Overview


At its core ZeroEntropy is a hybrid search system combining:

  1. Sparse retrieval (BM25) for lightning-fast keyword matching

  2. Dense embedding retrieval for semantic relevance

  3. LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

At its core ZeroEntropy is a hybrid search system combining:

  1. Sparse retrieval (BM25) for lightning-fast keyword matching

  2. Dense embedding retrieval for semantic relevance

  3. LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

  • Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.*

  • Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.

  • Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.

Index Type

-

What we index

Why it matters

ParadeDB BM25

Paragraph & document tokens

+ LLM-generated keywords


Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree

-

Turbopuffer

-

Embeddings of every node (sentences → document)

-

Sub-ms semantic search at scale (millions of vectors)

2. Document Ingestion & Chunking


Render → OCR → VLM diagram tags

  • Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.*

  • Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.

  • Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.


Hierarchical chunking

We detect language, pick the right tokenizer & stemmer, then split into words → sentences → paragraphs. By keeping contextual spans, we try to create meaningful chunks. We currently support two chunk sizes: coarse (around 2,000 chars) and fine (around 200 chars).

3. Indexing: Sparse & Dense


Index Type

-

What we index

Why it matters

ParadeDB BM25

Paragraph & document tokens

+ LLM-generated keywords


Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree

-

Turbopuffer

-

Embeddings of every node (sentences → document)

-

Sub-second semantic search at scale



We feed our keyword index not just raw stems, but LLM-suggested synonyms, acronyms and domain terms, so a search for “FDA 21 CFR Part 11” surfaces your internal policies, even if they never literally mention “21 CFR.”



We feed our keyword index not just raw stems, but LLM-suggested synonyms, acronyms and domain terms, so a search for “FDA 21 CFR Part 11” surfaces your internal policies, even if they never literally mention “21 CFR.


Query Architecture Diagram

4. Query Execution Walkthrough

LLM Rewriters

  • Query Rewriter: Refines your input into a clearer embedding prompt (e.g. “procedure for submitting Form 10-K to the SEC”).

  • Keyword Generator: Scores key terms (e.g. “10-K” = 0.8, “file” = –0.2) to improve matching.

  • Performance Modes:

    • Fast Mode skips LLM steps for sub-500 ms responses

    • Deep Mode runs the full LLM pipeline in 2 to 3 seconds


Tokenization & Typo Correction

We split the raw query into tokens and run them through a BK-tree typo corrector so even “10-Kk” maps back to “10-K.”


Sparse + Dense Fan-Out

  • Dense Recall: We query Turbopuffer embeddings to fetch the top-N semantically related chunks.

  • Sparse Recall: We use ParadeDB’s BM25 index to retrieve the top-N’ keyword matches.


Reciprocal Rank Fusion

We merge the sparse and dense rankings using reciprocal rank fusion to select the final top K results, combining complementary signals for up to a 10–15% boost in overall accuracy.

5. Security & Deployment

  • SOC 2 Type II compliant.

  • HIPAA compliant & following industry best practices.

  • End-to-end encryption for data in transit and at rest.

  • On-Prem deployment available for enterprise users, as easy-to-use docker images.

Wrapping Up

ZeroEntropy isn’t “another vector search.” It’s a full-stack retrieval platform that:

  • Knows how to parse your most fiendish PDFs.

  • Indexes meaningful snippets at every granularity.

  • Blends classical IR, embeddings, and LLM intelligence under the hood.

  • Scales from a single document to billions of nodes without compromising accuracy or speed.


Ready to see it in action?

Explore the docs →

Book a demo →


  • Why do you convert to JPEG? We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.

Explore the docs →

Book a demo →


  • Why do you convert to JPEG? We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLM-powered reranking to deliver enterprise-grade search over unstructured documents.


At ZeroEntropy we’ve re-imagined every layer of the retrieval stack, from PDF parsing to query execution, to deliver end-to-end document intelligence on par with having an entire team of expert search engineers, behind one simple API.

ZeroEntropy is a full-stack, hybrid retrieval platform that combines sparse (BM25), dense (vector embeddings), and LLMs in the loop to deliver enterprise-grade search over unstructured documents.


At ZeroEntropy we’ve re-imagined every layer of the retrieval stack, from PDF parsing to query execution, to deliver end-to-end document intelligence on par with having an entire team of expert search engineers, behind one simple API.


In the sections below, we’ll dive into each layer of our architecture, from ingestion and indexing to query execution and security, demonstrating how we achieve sub-second latency, 90%+ recall, and enterprise-grade compliance (SOC 2, HIPAA).

In the sections below, we’ll dive into each layer of our architecture, from ingestion and indexing to query execution and security, demonstrating how we achieve sub-second latency, 90%+ recall, and enterprise-grade compliance (SOC 2, HIPAA).




Ingestion Architecture Diagram

1. High-Level Overview


At its core ZeroEntropy is a hybrid search system combining:

  1. Sparse retrieval (BM25) for lightning-fast keyword matching

  2. Dense embedding retrieval for semantic relevance

  3. LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

At its core ZeroEntropy is a hybrid search system combining:

  1. Sparse retrieval (BM25) for lightning-fast keyword matching

  2. Dense embedding retrieval for semantic relevance

  3. LLM-in-the-loop for query understanding, keyword generation, and final result re-ranking

By combining all three, we avoid the “either/or” trade-offs of vanilla search systems.

  • Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.*

  • Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.

  • Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.

Index Type

-

What we index

Why it matters

ParadeDB BM25

Paragraph & document tokens

+ LLM-generated keywords


Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree

-

Turbopuffer

-

Embeddings of every node (sentences → document)

-

Sub-ms semantic search at scale (millions of vectors)

2. Document Ingestion & Chunking


Render → OCR → VLM diagram tags

  • Why? Many PDFs, DOCXs, and PPTs hide text inside images, so we convert each page to a JPEG and OCR it.*

  • Why keep the JPEG? At query time, you can request the original page image alongside your top hits, which is useful if you want to feed the image into an VLM.

  • Why VLM? Diagrams, flowcharts, tables and formulas carry meaning a simple OCR method would miss.


Hierarchical chunking

We detect language, pick the right tokenizer & stemmer, then split into words → sentences → paragraphs. By keeping contextual spans, we try to create meaningful chunks. We currently support two chunk sizes: coarse (around 2,000 chars) and fine (around 200 chars).

3. Indexing: Sparse & Dense


Index Type

-

What we index

Why it matters

ParadeDB BM25

Paragraph & document tokens

+ LLM-generated keywords


Fuzzy/typo-tolerant keyword recall; lightning wildcard/fuzzy via BK-tree

-

Turbopuffer

-

Embeddings of every node (sentences → document)

-

Sub-second semantic search at scale



We feed our keyword index not just raw stems, but LLM-suggested synonyms, acronyms and domain terms, so a search for “FDA 21 CFR Part 11” surfaces your internal policies, even if they never literally mention “21 CFR.”


We feed our keyword index not just raw stems, but LLM-suggested synonyms, acronyms and domain terms, so a search for “FDA 21 CFR Part 11” surfaces your internal policies, even if they never literally mention “21 CFR.”




Query Architecture Diagram

4. Query Execution Walkthrough


LLM Rewriters

  • Query Rewriter: Refines your input into a clearer embedding prompt (e.g. “procedure for submitting Form 10-K to the SEC”).

  • Keyword Generator: Scores key terms (e.g. “10-K” = 0.8, “file” = –0.2) to improve matching.

  • Performance Modes:

    • Fast Mode skips LLM steps for sub-500 ms responses

    • Deep Mode runs the full LLM pipeline in 2–3 s


Tokenization & Typo Correction

We split the raw query into tokens and run them through a BK-tree typo corrector so even “10-Kk” maps back to “10-K.”


Sparse + Dense Fan-Out

  • Dense Recall: We query Turbopuffer embeddings to fetch the top-N semantically related chunks.

  • Sparse Recall: We use ParadeDB’s BM25 index to retrieve the top-N’ keyword matches.


Reciprocal Rank Fusion

We merge the sparse and dense rankings using reciprocal rank fusion to select the final top K results, combining complementary signals for up to a 10–15% boost in overall accuracy.

5. Security & Deployment

  • SOC 2 Type II compliant.

  • HIPAA compliant & following industry best practices.

  • End-to-end encryption for data in transit and at rest.

  • On-Prem deployment available for enterprise users, as easy-to-use docker images.

Wrapping Up

ZeroEntropy isn’t “another vector search.” It’s a full-stack retrieval platform that:

  • Knows how to parse your most fiendish PDFs.

  • Indexes meaningful snippets at every granularity.

  • Blends classical IR, embeddings, and LLM intelligence under the hood.

  • Scales from a single document to billions of nodes without compromising accuracy or speed.


Ready to see it in action?

Explore the docs →

Book a demo →


  • Why do you convert to JPEG? We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.

Explore the docs →

Book a demo →


  • Why do you convert to JPEG? We are able to render at 3-4x the resolution with JPEG at the same size of PNG. So, we can actually send better quality images with JPEG within the constraint of a particular latency / storage / bandwidth allocation.