Maximum Marginal Relevance in RAG: Fighting the Echo Chamber Problem

Jan 8, 2025by Nick Berens

RAGMMRAIInformation RetrievalDiversitySearch QualityAlgorithmConfiguration

You ask your AI assistant about “Vue.js development best practices” and get back three documents about component composition, two about component composition, and one more about component composition. Sound familiar? This is the similarity trap that plagues many RAG systems. They find the most relevant content but often return near-duplicates that don’t add much value.

Enter Maximum Marginal Relevance (MMR), a clever algorithm that asks a simple but powerful question: “What if we prioritized both relevance AND diversity?”

The Problem: Too Much of a Good Thing

Traditional RAG systems use pure similarity search. Ask about “frontend frameworks,” and you might get:

Document A: “Vue.js is a progressive framework…” (Score: 0.92)
Document B: “Vue.js offers reactive data binding…” (Score: 0.91)
Document C: “Vue.js provides component-based architecture…” (Score: 0.89)
Document D: “React is a popular JavaScript library…” (Score: 0.75)
Document E: “Angular is a comprehensive framework…” (Score: 0.73)

Notice the problem? The top three results are all about Vue.js. While highly relevant, they’re redundant. The user gets a narrow view instead of comprehensive coverage.

How MMR Changes the Game

MMR uses a mathematical formula that balances two competing goals:

MMR Score = λ × Relevance - (1-λ) × Max Similarity to Selected

Where:

λ (lambda): Controls the relevance vs. diversity tradeoff (0.0 to 1.0)
Relevance: How well the document matches your query
Max Similarity to Selected: How similar this document is to already-selected results

Instead of just picking the most relevant documents, MMR asks: “Is this document different enough from what we’ve already selected to add value?”

MMR in Action: Real Examples

Here’s the difference using actual queries from my system:

Query: “Tell me about Nick’s development experience”

Without MMR (Traditional Similarity)

Results:
1. Senior Frontend Developer role at Company A (0.89)
2. Frontend Developer role at Company B (0.87) 
3. JavaScript Developer role at Company C (0.85)
4. Frontend Technologies summary (0.84)
5. Vue.js project details (0.82)

Analysis: Heavy focus on frontend, missing backend/full-stack experience

With MMR (λ = 0.6)

Results:
1. Senior Frontend Developer role at Company A (0.89) - Most relevant
2. Backend Python experience summary (0.76) - Different domain
3. Project management responsibilities (0.71) - Different skill type  
4. Client consultation experience (0.69) - Different role aspect
5. Technical writing portfolio (0.65) - Different expertise area

Analysis: Comprehensive view of diverse skills and experiences

The MMR version gives a much more complete picture of my background.

Understanding the Lambda Parameter

The λ (lambda_mult) parameter in your admin settings is the key to MMR’s behavior:

λ = 1.0 (Pure Relevance)

Identical to traditional similarity search
Highest quality matches
Risk of redundancy
Use when: Precise answers to specific questions

λ = 0.5 (Balanced)

Equal weight to relevance and diversity
Good compromise for most use cases
Prevents obvious duplicates while maintaining quality
Use when: General knowledge queries

λ = 0.0 (Pure Diversity)

Maximum variety, potentially sacrificing relevance
Can include marginally related content
Risk of confusing responses
Use when: Exploratory research, brainstorming

The Technical Deep Dive

Here’s how MMR actually works under the hood:

def mmr_selection(documents, query_embedding, k=5, lambda_mult=0.6):
    """
    Select documents using Maximum Marginal Relevance
    """
    selected = []
    candidates = documents.copy()
    
    # Step 1: Select the most relevant document
    relevance_scores = [cosine_similarity(query_embedding, doc.embedding) 
                       for doc in candidates]
    best_idx = max(range(len(candidates)), key=lambda i: relevance_scores[i])
    selected.append(candidates.pop(best_idx))
    
    # Step 2: Iteratively select remaining documents
    while len(selected) < k and candidates:
        mmr_scores = []
        
        for i, candidate in enumerate(candidates):
            # Calculate relevance to query
            relevance = cosine_similarity(query_embedding, candidate.embedding)
            
            # Calculate max similarity to already selected documents
            max_sim_to_selected = max([
                cosine_similarity(candidate.embedding, selected_doc.embedding)
                for selected_doc in selected
            ])
            
            # Apply MMR formula
            mmr_score = (lambda_mult * relevance - 
                        (1 - lambda_mult) * max_sim_to_selected)
            mmr_scores.append(mmr_score)
        
        # Select document with highest MMR score
        best_idx = max(range(len(candidates)), key=lambda i: mmr_scores[i])
        selected.append(candidates.pop(best_idx))
    
    return selected

Practical MMR Configuration

The Fetch-K Parameter

In your admin settings, you’ll see MMR Fetch K - this controls the initial candidate pool:

Fetch K = 20: Get top 20 most relevant documents
MMR K = 5: Use MMR to select 5 diverse documents from those 20
Why this matters: Larger fetch pools enable more diversity choices

Real-World Tuning Strategy

Start with these defaults:

Use MMR: true
Lambda: 0.6 (balanced relevance/diversity)
K: 4-6 (optimal context length for most LLMs)
Fetch K: 20-30 (good diversity pool)

Adjust based on content:

Technical documentation: Higher lambda (0.7-0.8) for precision
Creative content: Lower lambda (0.4-0.5) for variety
Mixed content: Balanced lambda (0.5-0.6)

When MMR Transforms Your Experience

Before MMR: The Repetition Problem

User: "What technologies does Nick use for web development?"

Response: "Nick primarily uses Vue.js for frontend development. 
He has extensive experience with Vue.js and the Vue ecosystem. 
Nick's Vue.js projects include several single-page applications 
built with Vue.js and related technologies."

Result: Repetitive, narrow focus

After MMR: The Comprehensive View

User: "What technologies does Nick use for web development?"

Response: "Nick uses a diverse technology stack for web development. 
On the frontend, he specializes in Vue.js and React with TypeScript. 
For backend development, he works with Python and FastAPI. 
He also has experience with database design using PostgreSQL and 
cloud deployment on platforms like Railway and Vercel."

Result: Complete, informative overview

Common MMR Pitfalls

Over-Diversification

Problem: Lambda too low (0.2-0.3) Symptom: Responses include barely relevant information Solution: Increase lambda to 0.5-0.6

Under-Diversification

Problem: Lambda too high (0.8-0.9) Symptom: Still getting similar documents Solution: Decrease lambda and increase fetch-k

Wrong Fetch Size

Problem: Fetch-k too small (< 10) Symptom: Limited diversity options Solution: Increase fetch-k to 20-50

The Performance Trade-off

MMR isn’t free - it requires additional computation:

# Performance comparison
Traditional Search: O(n log k)      # Simple similarity ranking
MMR Search: O(k × fetch_k)          # Additional similarity calculations

# Typical performance impact
Query time increase: 15-30%
Memory usage increase: 10-20%
Quality improvement: 40-60% (subjective)

For most applications, the quality improvement far outweighs the performance cost.

Advanced MMR Techniques

Content-Aware Lambda

Adjust lambda based on query type:

def get_dynamic_lambda(query, content_type):
    if "specific" in query.lower() or "what is" in query.lower():
        return 0.8  # Favor precision
    elif "overview" in query.lower() or "tell me about" in query.lower():
        return 0.4  # Favor diversity
    else:
        return 0.6  # Balanced default

Multi-Stage MMR

Use different parameters for different selection stages:

# Stage 1: High relevance threshold
primary_docs = mmr_select(query, documents, k=3, lambda_mult=0.8)

# Stage 2: High diversity for supporting context  
context_docs = mmr_select(query, remaining_docs, k=2, lambda_mult=0.3)

Monitoring MMR Effectiveness

Track these metrics to optimize your MMR settings:

metrics = {
    "response_diversity": measure_topic_coverage(response),
    "user_satisfaction": track_user_feedback(),
    "query_coverage": analyze_knowledge_gaps(),
    "redundancy_score": detect_content_overlap()
}

The Bottom Line

MMR transforms RAG systems from echo chambers into comprehensive knowledge assistants. It’s the difference between getting the same information three different ways versus getting three different perspectives on your question.

When to use MMR:

General knowledge queries
Exploratory questions
“Tell me about…” requests
Research and analysis tasks

When to skip MMR:

Precise factual lookups
“What is the exact…” questions
Single-concept queries
Performance-critical applications

The magic happens when you find the right balance between relevance and diversity for your specific use case. Too much relevance creates tunnel vision. Too much diversity creates confusion. MMR helps you find that sweet spot where your AI assistant becomes truly helpful rather than just technically correct.