Maximum Marginal Relevance in RAG: Fighting the Echo Chamber Problem
You ask your AI assistant about “Vue.js development best practices” and get back three documents about component composition, two about component composition, and one more about component composition. Sound familiar? This is the similarity trap that plagues many RAG systems. They find the most relevant content but often return near-duplicates that don’t add much value.
Enter Maximum Marginal Relevance (MMR), a clever algorithm that asks a simple but powerful question: “What if we prioritized both relevance AND diversity?”
The Problem: Too Much of a Good Thing
Traditional RAG systems use pure similarity search. Ask about “frontend frameworks,” and you might get:
- Document A: “Vue.js is a progressive framework…” (Score: 0.92)
- Document B: “Vue.js offers reactive data binding…” (Score: 0.91)
- Document C: “Vue.js provides component-based architecture…” (Score: 0.89)
- Document D: “React is a popular JavaScript library…” (Score: 0.75)
- Document E: “Angular is a comprehensive framework…” (Score: 0.73)
Notice the problem? The top three results are all about Vue.js. While highly relevant, they’re redundant. The user gets a narrow view instead of comprehensive coverage.
How MMR Changes the Game
MMR uses a mathematical formula that balances two competing goals:
MMR Score = λ × Relevance - (1-λ) × Max Similarity to Selected
Where:
- λ (lambda): Controls the relevance vs. diversity tradeoff (0.0 to 1.0)
- Relevance: How well the document matches your query
- Max Similarity to Selected: How similar this document is to already-selected results
Instead of just picking the most relevant documents, MMR asks: “Is this document different enough from what we’ve already selected to add value?”
MMR in Action: Real Examples
Here’s the difference using actual queries from my system:
Query: “Tell me about Nick’s development experience”
Without MMR (Traditional Similarity)
Results:
1. Senior Frontend Developer role at Company A (0.89)
2. Frontend Developer role at Company B (0.87)
3. JavaScript Developer role at Company C (0.85)
4. Frontend Technologies summary (0.84)
5. Vue.js project details (0.82)
Analysis: Heavy focus on frontend, missing backend/full-stack experience
With MMR (λ = 0.6)
Results:
1. Senior Frontend Developer role at Company A (0.89) - Most relevant
2. Backend Python experience summary (0.76) - Different domain
3. Project management responsibilities (0.71) - Different skill type
4. Client consultation experience (0.69) - Different role aspect
5. Technical writing portfolio (0.65) - Different expertise area
Analysis: Comprehensive view of diverse skills and experiences
The MMR version gives a much more complete picture of my background.
Understanding the Lambda Parameter
The λ (lambda_mult) parameter in your admin settings is the key to MMR’s behavior:
λ = 1.0 (Pure Relevance)
- Identical to traditional similarity search
- Highest quality matches
- Risk of redundancy
- Use when: Precise answers to specific questions
λ = 0.5 (Balanced)
- Equal weight to relevance and diversity
- Good compromise for most use cases
- Prevents obvious duplicates while maintaining quality
- Use when: General knowledge queries
λ = 0.0 (Pure Diversity)
- Maximum variety, potentially sacrificing relevance
- Can include marginally related content
- Risk of confusing responses
- Use when: Exploratory research, brainstorming
The Technical Deep Dive
Here’s how MMR actually works under the hood:
def mmr_selection(documents, query_embedding, k=5, lambda_mult=0.6):
"""
Select documents using Maximum Marginal Relevance
"""
selected = []
candidates = documents.copy()
# Step 1: Select the most relevant document
relevance_scores = [cosine_similarity(query_embedding, doc.embedding)
for doc in candidates]
best_idx = max(range(len(candidates)), key=lambda i: relevance_scores[i])
selected.append(candidates.pop(best_idx))
# Step 2: Iteratively select remaining documents
while len(selected) < k and candidates:
mmr_scores = []
for i, candidate in enumerate(candidates):
# Calculate relevance to query
relevance = cosine_similarity(query_embedding, candidate.embedding)
# Calculate max similarity to already selected documents
max_sim_to_selected = max([
cosine_similarity(candidate.embedding, selected_doc.embedding)
for selected_doc in selected
])
# Apply MMR formula
mmr_score = (lambda_mult * relevance -
(1 - lambda_mult) * max_sim_to_selected)
mmr_scores.append(mmr_score)
# Select document with highest MMR score
best_idx = max(range(len(candidates)), key=lambda i: mmr_scores[i])
selected.append(candidates.pop(best_idx))
return selected
Practical MMR Configuration
The Fetch-K Parameter
In your admin settings, you’ll see MMR Fetch K - this controls the initial candidate pool:
- Fetch K = 20: Get top 20 most relevant documents
- MMR K = 5: Use MMR to select 5 diverse documents from those 20
- Why this matters: Larger fetch pools enable more diversity choices
Real-World Tuning Strategy
Start with these defaults:
- Use MMR:
true
- Lambda:
0.6
(balanced relevance/diversity) - K:
4-6
(optimal context length for most LLMs) - Fetch K:
20-30
(good diversity pool)
Adjust based on content:
- Technical documentation: Higher lambda (0.7-0.8) for precision
- Creative content: Lower lambda (0.4-0.5) for variety
- Mixed content: Balanced lambda (0.5-0.6)
When MMR Transforms Your Experience
Before MMR: The Repetition Problem
User: "What technologies does Nick use for web development?"
Response: "Nick primarily uses Vue.js for frontend development.
He has extensive experience with Vue.js and the Vue ecosystem.
Nick's Vue.js projects include several single-page applications
built with Vue.js and related technologies."
Result: Repetitive, narrow focus
After MMR: The Comprehensive View
User: "What technologies does Nick use for web development?"
Response: "Nick uses a diverse technology stack for web development.
On the frontend, he specializes in Vue.js and React with TypeScript.
For backend development, he works with Python and FastAPI.
He also has experience with database design using PostgreSQL and
cloud deployment on platforms like Railway and Vercel."
Result: Complete, informative overview
Common MMR Pitfalls
Over-Diversification
Problem: Lambda too low (0.2-0.3) Symptom: Responses include barely relevant information Solution: Increase lambda to 0.5-0.6
Under-Diversification
Problem: Lambda too high (0.8-0.9) Symptom: Still getting similar documents Solution: Decrease lambda and increase fetch-k
Wrong Fetch Size
Problem: Fetch-k too small (< 10) Symptom: Limited diversity options Solution: Increase fetch-k to 20-50
The Performance Trade-off
MMR isn’t free - it requires additional computation:
# Performance comparison
Traditional Search: O(n log k) # Simple similarity ranking
MMR Search: O(k × fetch_k) # Additional similarity calculations
# Typical performance impact
Query time increase: 15-30%
Memory usage increase: 10-20%
Quality improvement: 40-60% (subjective)
For most applications, the quality improvement far outweighs the performance cost.
Advanced MMR Techniques
Content-Aware Lambda
Adjust lambda based on query type:
def get_dynamic_lambda(query, content_type):
if "specific" in query.lower() or "what is" in query.lower():
return 0.8 # Favor precision
elif "overview" in query.lower() or "tell me about" in query.lower():
return 0.4 # Favor diversity
else:
return 0.6 # Balanced default
Multi-Stage MMR
Use different parameters for different selection stages:
# Stage 1: High relevance threshold
primary_docs = mmr_select(query, documents, k=3, lambda_mult=0.8)
# Stage 2: High diversity for supporting context
context_docs = mmr_select(query, remaining_docs, k=2, lambda_mult=0.3)
Monitoring MMR Effectiveness
Track these metrics to optimize your MMR settings:
metrics = {
"response_diversity": measure_topic_coverage(response),
"user_satisfaction": track_user_feedback(),
"query_coverage": analyze_knowledge_gaps(),
"redundancy_score": detect_content_overlap()
}
The Bottom Line
MMR transforms RAG systems from echo chambers into comprehensive knowledge assistants. It’s the difference between getting the same information three different ways versus getting three different perspectives on your question.
When to use MMR:
- General knowledge queries
- Exploratory questions
- “Tell me about…” requests
- Research and analysis tasks
When to skip MMR:
- Precise factual lookups
- “What is the exact…” questions
- Single-concept queries
- Performance-critical applications
The magic happens when you find the right balance between relevance and diversity for your specific use case. Too much relevance creates tunnel vision. Too much diversity creates confusion. MMR helps you find that sweet spot where your AI assistant becomes truly helpful rather than just technically correct.