harrier-27b: Can 27B Parameters Beat zembed-1?

Apr 8, 2026 · GitHub Twitter Slack LinkedIn Discord
harrier-27b: Can 27B Parameters Beat zembed-1?
TL;DR
  • zembed-1 retains the #1 overall position, outperforming harrier-27b on average (0.715 vs 0.706) and (0.771 vs 0.749)
  • On a per-dataset basis, zembed-1 wins 17 out of 28 datasets against harrier-27b on NDCG@10
  • voyage-4 and harrier-27b are neck-and-neck for the #2 spot — tied 14–14 on dataset wins, with voyage-4 ahead by a sliver on overall averages
  • The Harrier family scales well internally (270M → 0.6B → 27B), but even the largest variant doesn’t close the gap to zembed-1
  • Explore the full interactive dashboard →

zembed-1 vs Harrier

A New Challenger, Evaluated Properly

Harrier is a recently released family of open-weight embedding models from Microsoft ( Gemma and Qwen models), spanning three sizes: 270M, 0.6B, and 27B parameters. The largest variant — harrier-27b — has generated well-deserved attention. On binary , it ranked first among embedding models at time of release.

But as we explored in Beyond Binary, MTEB has a discrimination problem: given its (overwhelmingly) binary annotations, it can’t tell the difference between a document which perfectly answers a query and one which may only tangentially address it. So we ran all three Harrier models through the same we use for our evals dashboard — 28 diverse datasets, three independent , continuous relevance scores from 0 to 10.

It is not a question of harrier-27b is a good model. As a matter of fact, it is. (And at 27 billion parameters and a whopping 5,376 output vector dimensionality, we would certainly hope it would be). But is it the best?

The Three Model Problem: MTEB Evals

Embed-only Recall@K, averaged across 28 datasets and 3 judges. Toggle between models to compare.

On the global average across all 28 evaluation datasets, there are three embedding models which markedly outperform the rest:

ModelNDCG@10Recall@10Recall@100
zembed-10.7150.4710.771
voyage-40.7120.4730.751
harrier-27b0.7060.4680.749

Below those, qwen3-4b, jina-v5-text-small, cohere-embed-v4, and openai-v3-large (in that order) form a cluster of second tier performance. But if you need top-tier accuracy, your choice lies in that trio.

So what separates them? On NDCG@10, not much — roughly one point across the trio (though it’s worth noting that harrier still comes out worst). But NDCG@10 is not the whole story.

On Recall@100 — the metric that determines whether a relevant document even makes it to your reranker — zembed-1 leads by +2.0 points over voyage-4, and +2.2 over harrier-27b.

That is where the separation becomes real. A or other downstream system can reorder or rework whatever the embedding model surfaces, but it cannot conjure up a document the embedder failed to retrieve. zembed-1’s recall advantage compounds downstream: fewer relevant documents lost at the means a strictly better candidate set for everything that follows.

Head-to-Head: zembed-1 vs harrier-27b

Averages, of course, can obscure as much as they reveal. So let us go dataset by dataset. Our evals dashboard covers 28 datasets drawn from three MTEB task categories — retrieval, reranking, and instruction retrieval — spanning legal (AILAStatutes, LegalBench), medical (CovidRetrieval, TRECCOVID), financial (FiQA, FinQA), (MIRACL, MLQA, Belebele, WikipediaRetrieval), and technical domains (StackOverflowQA, SCIDOCS), among others.

Across 28 evaluation datasets, zembed-1 outperforms harrier-27b on NDCG@10 on 17 of them.

The pattern of where each model wins is telling. zembed-1 dominates on (Core17, News21, Robust04 — tasks which require parsing nuanced query intent, not merely matching keywords), medical and legal domains (CovidRetrieval, LegalBench, TRECCOVID), finance (FiQA2018, FinQARetrieval), and technology (StackOverflowQA). harrier-27b, for its part, shows strength on multilingual reranking and a handful of niche benchmarks (RuBQReranking, Russian paragraph reranking; and TwitterHjerne, Danish Twitter retrieval).

Datasetzembed-1harrier-27bDelta
FinQARetrieval0.7170.619+9.8
FiQA20180.8230.748+7.5
Robust04InstructionRetrieval0.8570.788+6.9
LEMBPasskeyRetrieval0.8910.825+6.6
Core17InstructionRetrieval0.8990.837+6.2
TRECCOVID0.9220.871+5.1
StackOverflowQA0.6950.651+4.4
CovidRetrieval0.8200.796+2.3
NQ0.7670.746+2.0
AlloprofReranking0.8510.832+1.9
LegalBenchCorporateLobbying0.8750.860+1.5
MLQARetrieval0.0420.029+1.2
MIRACLRetrievalHardNegatives0.5680.556+1.2
T2Reranking0.8040.794+1.0
News21InstructionRetrieval0.9190.910+0.8
BelebeleRetrieval0.1790.172+0.7
WikipediaRetrievalMultilingual0.7780.774+0.5
HagridRetrieval0.8970.899-0.2
SciFact0.7890.792-0.3
ArguAna0.5640.566-0.3
QuoraRetrieval0.6850.689-0.4
VoyageMMarcoReranking0.7320.739-0.7
StatcanDialogueDatasetRetrieval0.7230.742-1.9
WikipediaRerankingMultilingual0.5670.605-3.8
AILAStatutes0.7000.740-4.0
RuBQReranking0.7360.801-6.5
TwitterHjerneRetrieval0.6940.775-8.1
SCIDOCS0.5400.623-8.3

The Race for Second Place

As we established in our previous head-to-head, voyage-4 has been the reigning #2 embedding model. With harrier-27b now in the picture, that position is genuinely contested:

voyage-4harrier-27b
Average NDCG@100.7120.706
Average Recall@1000.7510.749
Dataset wins (head-to-head)1414

It is… remarkably close. The two models split dataset wins exactly (14–14), and depending on your vertical, either could be the better runner-up. If forced to pick, voyage-4 still holds thin but consistent edges on the overall averages — enough to keep the #2 title, but barely. (Neither, however, threatens first place.)

Harrier’s Scaling Story

One genuinely interesting aspect of the Harrier family is its range of sizes. The scaling is clean — and instructive:

ModelParamsAvg NDCG@10Avg Recall@100
harrier-270m270M0.6250.674
harrier-0.6b600M0.6580.704
harrier-27b27B0.7060.749

A +3.3 point NDCG jump from 270M to 0.6B, then +4.9 points from 0.6B to 27B. Returns to scale are not completely diminishing — the largest absolute improvement comes from the largest model. Credit where credit is due: this is a decently-executed scaling curve, particularly for the 0.6b model which demonstrates equal or better performance to Cohere’s flagship embed-v4.

What Does Each Size Buy You?
  • harrier-270m (0.625) outperforms bge-m3 (0.593) and openai-v3-small (0.601) — entirely respectable for a 270M-parameter model
  • harrier-0.6b (0.658) is competitive with cohere-embed-v4 (0.667)
  • harrier-27b (0.706) enters the top three — but requires 27 billion parameters and 5,376-dimensional output vectors to get there, compared to zembed-1’s 4 billion parameters and 2,560 dimensions

The contrast in size between harrier-27b and all other models bears emphasis: 27 billion parameters is absolutely massive for an embedding model, and that’s not a compliment.

zembed-1 achieves its #1 ranking with 4 billion parameters and a 2,560-dimensional output. harrier-27b needs nearly 7x the parameter count and 2x the dimensionality to land 0.9 points behind on NDCG@10. In a production setting — where embedding compute, storage costs, and index size are real constraints — the efficiency gap is hardly academic. Would you pay for a model with 7x higher inference cost, probably 7x as much latency, which outputs an embedding that’s twice as costly to store, just to get worse results?

We wouldn’t.

What This Means

harrier-27b is a legitimate top-three embedding model — quite possibly the strongest new entrant we have seen since voyage-4. It is genuinely competitive, especially on multilingual reranking tasks, and we expect Microsoft will continue to iterate on the family.

But the leaderboard has not changed:

zembed-1 leads on average NDCG@10, wins 17 of 28 datasets head-to-head against harrier-27b, and holds the highest Recall@100 of any embedding model — at 1/7th the parameter count and half the vector dimensionality.

For the full interactive breakdown across all models, datasets, metrics, and reranker combinations, explore the evaluation dashboard.

Get Started

zembed-1 is available today through multiple deployment options:

from zeroentropy import ZeroEntropy
zclient = ZeroEntropy()
response = zclient.models.embed(
model="zembed-1",
input_type="query", # "query" or "document"
input="What is retrieval augmented generation?", # string or list[str]
dimensions=2560, # optional: must be one of [2560, 1280, 640, 320, 160, 80, 40]
encoding_format="float", # "float" or "base64"
latency="fast", # "fast" or "slow"; omit for auto
)

Documentation: docs.zeroentropy.dev

HuggingFace: huggingface.co/zeroentropy

Get in touch: Discord community or contact@zeroentropy.dev

Talk to us if you need a custom deployment, volume pricing, or want to see how zembed-1 + zerank-2 performs on your data.

Related Blogs

Catch all the latest releases and updates from ZeroEntropy.

ZeroEntropy
The best AI teams build with ZeroEntropy models
Follow us on
GitHubTwitterSlackLinkedInDiscord