Tuning Max Search Results (k) in RAG: Recall, Redundancy, and Latency
The k you choose for retrieval directly affects recall, redundancy, and runtime. Too low and you miss supporting evidence; too high and you flood the context window or pay unnecessary latency. Here’s how the code surfaces work and how to pick sensible values.
Where k Comes From
The retriever pulls max_search_results from SearchRetrievalSettings if available, otherwise falls back to a static default.
# backend/core/semantic_searcher.py (excerpt)
def semantic_search(self, query: str, k: int = None, ...):
rag_settings = self._get_rag_config_settings()
sr_settings = self._get_search_retrieval_settings()
if k is None:
if sr_settings and getattr(sr_settings, "max_search_results", None):
k = int(sr_settings.max_search_results)
else:
k = AppConfig.DEFAULT_SEARCH_K
# Get more results than needed for filtering and reranking
search_k = k * AppConfig.SEARCH_EXPANSION_MULTIPLIER
The expansion multiplier pulls extra candidates to allow thresholding and MMR re-ranking without starving the final top-k.
Interaction with MMR
When MMR is enabled, the retriever uses larger fetch_k and a lambda_mult to balance relevance and diversity.
# backend/core/semantic_searcher.py (excerpt)
if use_mmr:
if rag_settings:
fetch_k = max(search_k, rag_settings.rag_mmr_fetch_k)
lambda_mult = rag_settings.rag_mmr_lambda_mult
else:
fetch_k = max(search_k, AppConfig.RAG_MMR_FETCH_K)
lambda_mult = AppConfig.RAG_MMR_LAMBDA_MULT
docs = self.vector_store.max_marginal_relevance_search(
query, k=search_k, fetch_k=fetch_k, lambda_mult=lambda_mult
)
This pattern gives you diversity among the final k results while still honoring max_search_results as the target.
Practical Ranges
- Small, focused corpora:
k = 3–5(with MMR off or λ ≥ 0.6) - Medium/general corpora:
k = 6–10(MMR on, λ ≈ 0.5, fetch_k ≥ 20) - Long-context models: consider
k = 8–12, but watch redundancy; prefer MMR
Evaluation Tactics
- Track answer correctness vs.
kon a validation set; look for diminishing returns. - Measure context-window utilization and truncation rates per
kbucket. - Latency budget: ensure expansion multiplier + fetch_k stay within SLOs.
Guardrails
- Always use an expansion multiplier ≥ 2 to allow thresholding to drop weak hits.
- If latency spikes, reduce
fetch_kfirst, not the finalk. - Avoid
k > 12unless you have strong de-duplication and summarization in place.