Vector Databases · Production RAG Retrieval, Honest Defaults
Vector database development — production RAG retrieval, honest defaults
Vector databases are the retrieval layer of modern AI applications — and the 2026 honest answer is “it depends on scale, your stack, and whether you need hybrid search.” Our default for most projects: pgvector inside Postgres or Supabase — production-grade in 2026, matching dedicated vector DBs at the 1-10M scale where most apps live, keeping your data in one place you own. We migrate to Pinecone, Qdrant, Weaviate, or Milvus when there’s a measured reason. For RAG that needs hybrid keyword + vector, Elasticsearch usually wins. We architect the whole retrieval layer — embeddings, chunking, reranking, evaluation — not just pick a database.
pgvector, Pinecone, Qdrant, Weaviate, Milvus — we deploy all five honestly
Vector database development for AI applications
In 2026 the vector database isn’t the essential storage layer for your AI app — it’s the smart retrieval layer that controls costs and improves quality. The honest version of which one is much shorter than the FAQs suggest.
Vector databases are the backbone of modern AI applications, powering everything from semantic search to retrieval-augmented generation (RAG), AI agents, recommendation systems, and intelligent document retrieval. At NerdHeadz we design and build vector search infrastructure that scales from prototype to production — using pgvector, Pinecone, Qdrant, Weaviate, and Milvus.
Our 2026 default: for most projects we start with pgvector inside Supabase or Postgres. The “Postgres is slow for vectors” narrative is dead — Supabase’s own benchmarks show pgvector HNSW matching or beating Qdrant on equivalent compute at 99% accuracy, and Supabase, Neon, and Instacart run it in production at significant scale. For projects in the 1–10M vector range where most apps live, pgvector keeps your data in one place you own — the selfware-thesis answer to vector retrieval. We migrate to dedicated vector DBs when there’s a measured reason — not when a vendor pitch suggests it.
For production RAG that needs hybrid retrieval — keyword precision via BM25 combined with vector semantic recall — pure vector DBs alone often underperform. Elasticsearch (or OpenSearch) with native hybrid search is usually the right answer for that shape; see our Elasticsearch page for that side of the architecture. Most production RAG stacks are hybrid.
We work across the full retrieval stack — not just the database. Embedding pipeline design, hybrid retrieval architecture, cross-encoder reranking, evaluation pipelines that measure retrieval quality honestly, production observability for the failure modes that actually bite. The vector DB is the smallest engineering problem in production RAG; we treat the whole layer with the same care.
What we actually build
Vector DB selection & setup
pgvector + Supabase as our 2026 default for most projects; Pinecone Serverless when zero-ops managed matters; Qdrant or Weaviate self-hosted when scale, performance, or open-source preference shape the call; Milvus for billion-vector workloads. Honest decision per project, configured for your real query patterns.
Embedding pipeline development
Robust pipelines transforming text, images, and structured data into vector embeddings. Model selection (OpenAI text-embedding-3-large, Cohere Embed v3, open-source), chunking strategy (the highest-impact decision in RAG), metadata schema design, batch processing, deduplication, version management.
RAG integration
Connecting vector retrieval to LLMs (Anthropic, OpenAI, Gemini) for grounded responses. Prompt engineering, context-window management, citation generation, agentic patterns with MCP tool use.
Hybrid search & reranking
The production retrieval pattern that beats pure vector alone: BM25 keyword + dense vector + (optional) ELSER sparse, fused with reciprocal rank fusion, then cross-encoder reranking for the top-k. For projects that need this depth, we layer Elasticsearch/OpenSearch alongside the vector DB.
Performance & cost optimization
HNSW parameter tuning, quantization (BBQ, scalar, binary), index strategy per query pattern, caching layers, batch sizing. The difference between “works at 1M vectors” and “works at 100M vectors with sub-50ms latency” is hours of careful tuning — we do it properly.
Migration & scaling
pgvector → Qdrant when single-node Postgres limits start to bind (~50M vectors). Pinecone → self-hosted alternatives when costs cross $600+/mo and DevOps capacity exists. Chroma → production-grade option after prototyping. Real migrations with parity validation, not “lift and shift” promises.
The 2026 strategic frame — what vector DBs are actually for now
An important reframe most buyers haven’t fully absorbed: the role of vector databases in 2026 is meaningfully different from 2023.
2023’s frame: “Your LLM can’t fit your data. You need a vector database to store and retrieve it.” Vector DBs were positioned as essential storage — the only way to make your data available to an LLM with a small context window.
2026’s reality:Claude Opus, GPT, and Gemini all support 1M+ token context windows. An LLM can now fit a small book in a single prompt. The reason to use a vector database has shifted from “you have to” to “it controls costs and improves quality”:
Cost control: stuffing 500K tokens into every query costs ~$2.50 per call with Claude Opus. Retrieving the right 5K tokens and sending those costs ~$0.03. Vector retrieval is now a 100× cost-optimization layer, not an enabler.
Quality: LLM accuracy degrades with longer contexts — the “lost in the middle” phenomenon is well-documented. Smart retrieval of the most relevant 5–20K tokens consistently outperforms naive long-context stuffing.
Latency: long-context queries are slow. Retrieval + a small-context call is consistently faster than huge-context queries.
Auditability: when you retrieve specific chunks, you can cite them. When you stuff a huge context, the LLM blends sources opaquely.
This reframe matters for buyer decisions. If your project has a small corpus that fits in 200K–500K tokens, you may not need a vector DB at all — just cache the corpus in the LLM prompt. If your corpus is larger, queries are frequent enough that cost-per-call matters, or you need citations and auditability, vector retrieval earns its place. We help you decide which case you’re actually in — and architect accordingly.
The pgvector + Supabase default — the selfware-thesis answer
For most projects we start with pgvector inside Postgres or Supabase. It’s the 2026 selfware-thesis default: vector retrieval inside the database you already own, no separate infrastructure, no platform lock-in. Three reasons.
It’s production-grade in 2026
The “Postgres is slow for vectors” narrative is dead. Since pgvector 0.5.0 brought HNSW indexing, performance matches or beats dedicated vector DBs at 1–10M scale — Supabase’s own benchmarks show pgvector HNSW outperforming Qdrant on equivalent compute at 99% accuracy. Companies including Supabase, Neon, and Instacart run pgvector in production at significant scale. The 0.7+ release series (current in 2026) adds parallel index builds, improved HNSW, and better memory management.
Your data stays in one place you own
Vector embeddings live in the same Postgres database as the rest of your application data — same auth, same connection pool, same backup strategy, same query language (SQL with vector operations). No syncing between a primary DB and a separate vector store. No additional infrastructure to monitor. No vendor lock-in. This is the architectural simplicity the broader selfware thesis is built around.
Real filtering and joins, not just similarity
Because pgvector is Postgres, you get SQL filtering, joins, transactions, and relational integrity alongside vector search. SELECT … WHERE category = $1 AND tenant_id = $2 ORDER BY embedding <-> $3 LIMIT 10 — metadata filtering and semantic search in one query, with the full Postgres optimizer behind it. Dedicated vector DBs all support filtering but most don’t do it as cleanly.
When pgvector starts to bind
The honest production ceiling is single-node Postgres limits — roughly ~50M vectors on a well-provisioned instance (depending on dimension size and query patterns). Beyond that, dedicated vector DBs that scale horizontally (Qdrant, Milvus, Pinecone) become the right call. The next block is the honest decision for when that migration is worth doing — and when it isn’t.
The 5 production vector DBs — honestly compared
Five real options in 2026. Each wins different slices. Here’s the honest map we use to pick — ending with our per-case recommendation. pgvector is highlighted as our default; it earns the lead, it doesn’t win every row.
Dimension
pgvectordefault
Pinecone
Qdrant
Weaviate
Milvus
What it is
Postgres extension
Managed-only cloud
Rust-based, open-source
Open-source, schema-rich
Open-source, distributed
Scale ceiling
~50M (single Postgres node)
Effectively unlimited (managed)
Excellent single-node; multi-node available
Strong distributed mode
Billion-vector+
Hybrid search
Native via Postgres FTS + pgvector
Limited (recent sparse-dense support)
Recent BM25 + dense fusion
Best-in-class native hybrid
Distributed hybrid
Filtering
Full SQL — joins, indices
Metadata filtering
Best-in-class filter performance
Schema-driven
Partition-based
Hosting
Wherever Postgres runs
Managed-only
Self-host or Cloud
Self-host or Cloud
Self-host or Zilliz Cloud
License
PostgreSQL (open)
Proprietary
Apache 2.0
BSD-3
Apache 2.0
Starting cost
~$25/mo Supabase Pro
$25/mo Starter
$30/mo self-host VPS
Free self-host or $25+/mo Cloud
Free self-host or Zilliz Cloud
Cost at scale
Linear with Postgres compute
Climbs with usage
Linear self-host; managed climbs
Linear self-host
Linear self-host
Our pick when
🟢 Default — already on Postgres/Supabase, 1–10M scale, want one DB to manage
🟢 Zero-ops managed required, RAG-first team, willing to pay for simplicity
🟢 Open-source preferred, self-hosted, best filtering, scaling beyond the pgvector ceiling
🟢 Hybrid search is structural, schema-rich needs, multi-tenancy critical
Most production stacks are hybrid — pgvector for the in-stack default + Elasticsearch for hybrid keyword+vector + occasionally a dedicated vector DB for scale. We pick by use case, not by single-tool dogma. See our Elasticsearch page for the hybrid-retrieval side of this picture.
Hybrid retrieval — when vector + Elasticsearch beats pure vector alone
Most production RAG benefits from hybrid retrieval — combining keyword precision and vector semantic recall. Pure vector has real weaknesses; pure keyword has real weaknesses. The honest 2026 answer is often to use both, fused intelligently.
What pure vector retrieval misses
Exact-match queries — “SKU ABC-123” or “the user named John Smith” should match the literal string, not semantic neighbours. Vector retrieval can semantically over-broaden.
Out-of-distribution terms — proper nouns, brand names, acronyms, code identifiers often have no meaningful semantic neighbours and are best retrieved by keyword.
Recency and freshness signals — keyword indexes naturally support boost-by-date; vector retrieval treats all neighbours equally.
What pure keyword retrieval misses
Semantic paraphrases — “doctor” and “physician” mean the same thing; BM25 doesn’t know that.
Conceptual relationships — “how do I reduce my taxes?” matches documents about “deductions” and “credits” via semantic similarity, not literal keywords.
The production answer is hybrid: BM25 keyword + dense vector + (optional) ELSER sparse, fused with reciprocal rank fusion. Elasticsearch (or OpenSearch) ships this natively in a single _search call; pgvector + Postgres FTS does it via composed queries; dedicated vector DBs (Weaviate, recent Qdrant, recent Pinecone) increasingly support hybrid natively at varying maturity.
For projects that need genuine hybrid retrieval at scale, we usually layer Elasticsearch alongside the application — see the Elasticsearch page for that side of the architecture. For projects where pure vector retrieval is what’s needed, the vector DB landscape on this page is the right place. The two pages are complementary, not competing.
The whole retrieval layer — what actually matters in production RAG
Most “vector DB comparison” guides obsess over indexing speed and latency benchmarks. The honest production reality: the vector DB is the smallest engineering problem in your RAG system. Four things matter more.
Highest impact
1. Embedding strategy
Which model (OpenAI text-embedding-3-large, Cohere Embed v3, open-source) — and the chunking strategy that feeds it. Chunk size, overlap, semantic boundaries, document hierarchy, metadata schema. This is the highest-impact decision in your entire RAG pipeline — get it wrong and no vector DB choice will save you.
Candidate quality
2. Retrieval architecture
Single-stage vs multi-stage retrieval. Pre-filtering vs post-filtering. Metadata schema for hybrid filter+similarity. Query expansion. Routing across multiple indexes (per-tenant, per-language, per-document-type). This shapes whether you’re retrieving the right candidates before ranking even starts.
The missing piece
3. Reranking
The vector DB returns top-100 candidates; a cross-encoder reranker (Cohere Rerank, BGE, custom) reorders them by deeper relevance, taking the top 5–10 to send to the LLM. This step is where retrieval quality goes from “okay” to “production-grade” — and it’s frequently the missing piece in struggling RAG systems.
Know it works
4. Evaluation & observability
Retrieval-quality metrics (hit rate, MRR, NDCG) measured against a real evaluation set, not vibes. Production observability for the failure modes that bite (empty retrievals, semantic drift, embedding-model regressions, slow queries). Without this, you can’t tell whether your RAG is working — only whether it doesn’t crash.
We architect all four — not just pick the database. For deeper context on retrieval architecture, see our RAG service page.
Production sizing & pricing — the honest math
Two honest pictures: what production vector infrastructure actually requires (RAM scaling reality), and how the five options compare on cost as projects grow.
Visual 1 · RAM per vector count
RAM required for an in-memory HNSW index — 1536-dim vectors
1M vectors
~7 GB · 16 GB instance
10M vectors
~70 GB · 64–128 GB + NVMe
25M vectors
~175 GB · 256 GB + NVMe
50M vectors
~350 GB · single-node ceiling
100M+ vectors
sharding · sharding / distribution
Rule of thumb: ~6–8 GB RAM per 1M vectors of 1536 dimensions for an HNSW index in memory. Most apps live in the 1–10M range where pgvector + Supabase or a single-node Qdrant fits comfortably. Beyond ~50M you’re in distributed-system territory. Quantization (BBQ, scalar, binary) compresses these numbers significantly — see our Elasticsearch page for the BBQ deep dive. ¹
Visual 2 · monthly cost at 50M vectors
Cost at production-large scale — illustrative monthly at 50M vectors
Qdrant self-host
$200–500/mo
Weaviate self-host
$200–500/mo
pgvector + Supabase
$300–600/mo
Milvus self-host
$300–800/mo
Pinecone Serverless
$400–1,500+/mo
The full picture across three scales: at prototype (100K vectors) all five run ~$0–30/mo and are roughly comparable. At small production (5M) pgvector + Supabase ($25–100) usually wins on simplicity — one database, one bill — vs $80–200 for Pinecone. At large production (50M+, shown above) Pinecone managed climbs fastest, while self-hosted Qdrant / Weaviate / Milvus give better cost control if DevOps capacity exists. Vector DB is typically a small fraction of total AI cost — LLM tokens dominate. We model your real query volume before recommending. ²
When vector databases aren’t the answer — and we’ll say so
If your corpus is small enough to fit in a modern LLM’s context window (Claude Opus, GPT-4o, Gemini all support 1M+ tokens now), you may not need a vector database at all — caching the corpus in the LLM prompt may be the simpler answer. We’ll do the math on cost-per-query vs context-stuffing honestly before recommending vector infrastructure.
If your retrieval is fundamentally about exact-match queries — finding documents by ID, SKU, customer name, code identifier — keyword search (Postgres FTS, Elasticsearch, or Meilisearch) often outperforms vector retrieval. Semantic similarity is the wrong tool for “find this specific thing.”
If your retrieval is fundamentally hybrid — needing both keyword precision and semantic recall — Elasticsearch/OpenSearch with native hybrid search usually beats stacking pgvector with a separate keyword layer.
And if your application doesn’t actually need semantic search — if the user query maps cleanly to filter parameters and structured queries — a real database with proper indexes will outperform any vector retrieval. Vector DBs are the right tool for semantic similarity over unstructured or semi-structured content. Outside that window, simpler tools usually win, and we’ll say so.
Proof · Clients
Teams who picked NerdHeadz to build production RAG and vector retrieval.
From embedding pipelines and pgvector deployments to hybrid-retrieval RAG with reranking and evaluation — what a buyer evaluating a real retrieval engagement actually cares about.
01 / 07
“
This system has been a dream of mine for almost a year. I have tried to build it myself and finally came to the conclusion I needed help. The NerdHeadz team has built me exactly what I was dreaming about and more! Working with them has been an absolute pleasure. I can't thank them enough.
The selfware-thesis answer to vector retrieval — production-grade, one DB to manage, no platform lock-in. We default here for the 1–10M scale where most apps live.
All five production options, deployed honestly.
pgvector, Pinecone, Qdrant, Weaviate, Milvus — we know each deeply and pick honestly per project. No vendor preference, no single-tool dogma. The right database for your actual scale and stack.
The whole retrieval layer, not just the DB.
Embedding pipelines, chunking strategy, hybrid retrieval, cross-encoder reranking, evaluation, production observability. The vector DB is the smallest engineering problem; we treat the rest with the same care.
Hybrid retrieval when it earns its place.
For production RAG needing both keyword precision and semantic recall, we layer Elasticsearch alongside the vector DB. Complementary, not competing — and we architect honestly.
Vector database development — FAQ
A vector database stores data as high-dimensional vectors (numerical representations) and enables similarity search — finding items that are semantically similar rather than just matching keywords. You need one if you’re building AI features like semantic search, recommendations, RAG (retrieval-augmented generation), image similarity, AI agents with memory, or any application that needs to understand meaning rather than just match text. 2026 update: with LLM context windows now reaching 1M+ tokens, vector DBs are increasingly a cost-and-quality optimization rather than essential storage — see the strategic-frame block above.
It depends on scale, infrastructure, and requirements. pgvector + Supabase (our default in 2026) — when you’re already on Postgres, want one database to manage, and your scale is in the 1–10M range where most apps live. Pinecone — fully managed, ideal for teams that want zero ops overhead. Qdrant — best price-performance self-hosted, best-in-class filtering. Weaviate — best hybrid search natively, schema-rich, multi-tenancy improvements in 1.28. Milvus — billion-vector workloads with mature distributed sharding. Chroma — local development and prototyping (out of its depth in production at scale). We help evaluate the trade-offs honestly per project.
In a RAG pipeline, your documents are split into chunks, converted to vector embeddings, and stored in a vector database. When a user asks a question, the query is also converted to a vector, and the database finds the most semantically similar document chunks. These chunks are then fed to an LLM as context, enabling accurate answers grounded in your actual data rather than the model’s training data. The vector DB is the smallest engineering problem in production RAG — chunking strategy, hybrid retrieval, reranking, and evaluation matter much more.
No — vector databases complement traditional databases rather than replacing them. Traditional databases handle structured queries, transactions, and exact lookups. Vector databases handle similarity search and semantic understanding. Most production systems use both: a traditional database for your core data and a vector database (often pgvector inside the same Postgres) for AI-powered search and retrieval features.
Costs vary based on data volume, query throughput, and complexity. A basic RAG pipeline with pgvector + Supabase typically starts at $5–15K for development. Enterprise implementations with custom embedding pipelines, hybrid retrieval (vector + Elasticsearch), cross-encoder reranking, evaluation pipelines, and high-availability setups range from $20–60K. Ongoing infrastructure: pgvector + Supabase Pro starts ~$25/mo; Pinecone Serverless ~$25–200/mo for small production; large-scale dedicated vector DBs $300–1,500+/mo. Vector DB is typically a small fraction of total AI costs — LLM tokens dominate.
Yes — emphatically. The “Postgres is slow for vectors” narrative comes from the IVFFlat index era (pre-2023). Since pgvector 0.5.0 brought HNSW indexing, performance matches or beats dedicated vector DBs at 1–10M scale. Supabase’s own benchmarks show pgvector HNSW outperforming Qdrant on equivalent compute at 99% accuracy. Companies including Supabase, Neon, and Instacart run pgvector in production at significant scale. The 0.7+ release series (current in 2026) adds parallel index builds, improved HNSW, and better memory management. The honest production ceiling is single-node Postgres limits (~50M vectors well-provisioned), and you migrate to dedicated DBs when that ceiling binds.
When you have a measured reason. Three real triggers: (1) you’re approaching the single-node Postgres ceiling (~50M vectors) and need horizontal scaling — Qdrant multi-node or Milvus; (2) your latency requirement is sub-10ms p99 at scale and pgvector tuning isn’t enough — Qdrant’s filtering performance specifically may win; (3) you need a hybrid-search architecture that Postgres FTS + pgvector can’t deliver cleanly — Weaviate or Elasticsearch. Migration is 2–6 weeks depending on scale, with parity validation. We don’t recommend migrating speculatively — pgvector is production-grade until measured evidence says otherwise.
Most production RAG benefits from hybrid retrieval — BM25 keyword + dense vector + (optional) ELSER sparse, fused with reciprocal rank fusion. Pure vector misses exact-match queries; pure keyword misses semantic recall. Elasticsearch (or OpenSearch) ships hybrid retrieval natively in a single _search call — usually our pick when hybrid retrieval is structural. pgvector + Postgres FTS does it via composed queries (workable but less polished). Weaviate ships hybrid natively. Pinecone and Qdrant have recent sparse-dense support at varying maturity. We architect honestly per project.
Three real options in 2026: OpenAI text-embedding-3-large (high quality, ~$0.13 per 1M tokens, widely supported); Cohere Embed v3 (multilingual strength, ~$0.10 per 1M tokens, multiple input types); open-source (e5-large, BGE, gte — free, self-hostable, competitive quality). The decision matters more than vector DB choice — the model defines what “semantic similarity” actually means for your data. We benchmark against your real corpus, not generic leaderboards.
Cross-encoder reranking is the production-RAG quality step that’s frequently missing. The vector DB returns the top-100 candidates by approximate semantic similarity; a cross-encoder reranker (Cohere Rerank, BGE-Reranker, custom-trained) reorders them by deeper relevance, taking the top 5–10 to send to the LLM. This step is where retrieval quality goes from “okay” to “production-grade.” For RAG systems that “work in demo but fail in production,” missing or weak reranking is one of the most common causes.
Real evaluation, not vibes: build a labelled evaluation set (50–500 query-result pairs with known correct chunks), measure hit rate at k (does the right chunk appear in the top-k?), mean reciprocal rank (where in the ranking?), NDCG, and end-to-end RAG accuracy (does the LLM answer correctly given the retrieved chunks?). Without measured evaluation, you can’t tell whether your RAG is improving — only that it didn’t crash. We build the evaluation pipeline alongside the production system.
Two main patterns. (1) Memory: AI agents use vector DBs for long-term memory — embedding past conversation turns, retrieved knowledge, and observed behaviours, then retrieving relevant context for new interactions. (2) Tool selection: when an agent has many tools available, vector retrieval can route the query to the most semantically relevant tools (especially useful with MCP tool discovery). For both patterns, the same vector DB landscape and selection logic on this page applies.
Maybe not — and we’ll do the math honestly. If your corpus fits in 200K–500K tokens, caching it directly in the LLM prompt may be simpler and cheaper than building vector infrastructure. If your corpus is larger, query frequency is high enough that cost-per-call matters, or you need citations and auditability, vector retrieval still earns its place — and increasingly as a quality + cost optimization rather than essential storage. The strategic-frame block above covers this in detail.
Yes — a common engagement. We audit existing pgvector, Pinecone, Qdrant, Weaviate, or Chroma deployments, identify performance issues (chunking strategy, index tuning, query patterns), surface evaluation gaps (no measured retrieval quality), and either improve the existing system or migrate to a better-fit DB if that’s the right move. Most “failing” RAG systems aren’t failing because of the vector DB — they’re failing because of chunking, embedding-model choice, missing reranking, or absent evaluation.
Vector retrieval & semantic search work we’ve shipped
Production vector retrieval and semantic search across insurance NLP, AI content tools, and faceted marketplace search — three genuinely retrieval-relevant builds.
Groovy Web, Pinecone vs pgvector vs Chroma vs Weaviate 2026: Best Vector DB by Use Case — pgvector production-grade, single-node Postgres ceiling (~50M), RAM rule of thumb.
Get AI Perks, Best Vector Databases 2026: Pinecone vs Weaviate vs Qdrant vs Chroma — the LLM context-window reframe (vector DBs as smart retrieval, not essential storage), pricing at scale.
Tensoria, Pinecone vs Qdrant vs Weaviate vs pgvector — 100M Vector Benchmark 2026 — production landscape consolidated to 5 options, RAM scaling, ingest throughput.
Firecrawl, Best Vector Databases in 2026: A Complete Comparison Guide — VectorDBBench numbers, managed vs self-hosted trade-offs.
Deepak Gupta, Top 5 Vector Databases 2026 — Pinecone Serverless analysis, production landscape.
MyEngineeringPath, Pinecone vs Qdrant vs Weaviate — Which Vector DB for Your RAG? 2026 — phase-3 maturity, multi-tenancy improvements.
Supabase and Neon production deployment case studies.
NerdHeadz vector database engagement experience.
The vector database landscape evolved significantly through 2024–2026 — Pinecone Serverless GA, pgvector HNSW maturity, OpenSearch v3, Elasticsearch BBQ quantization. Verify current vendor versions, pricing, and feature parity at publish; figures verified as of 2026-Q2.
Let’s scope your retrieval layer
Building production RAG or AI retrieval? Let’s talk.
30-minute scoping call. Whether you’re starting a RAG project from scratch, evaluating which vector DB fits your stack, hitting pgvector’s ceiling and considering migration, or have a struggling RAG system that needs honest diagnosis — we’ll architect the right retrieval layer (database, embeddings, reranking, evaluation) and send a fixed-price quote.