Hybrid Agentic RAG with LangSmith & Traceloop Observability
How we delivered a hybrid Agentic RAG system with dual observability (LangSmith + Traceloop), 12+ RAGAS 2025 evaluation metrics, and production-ready TypeScript in just 3 weeks.
Key Results
The Challenge
The client had a basic vector search implementation for their AI assistant but needed to explore whether alternative retrieval strategies could improve results. They wanted to compare their existing approach against keyword search, RRF-fusion, and agentic dual-tool methods - but had no systematic way to evaluate them.
The team faced critical questions:
- Which retrieval strategy works best for which query types?
- How do we measure RAG quality beyond simple accuracy?
- How do we evaluate both production traces AND synthetic test cases?
- How do we implement continuous improvement with measurable metrics?
Without answers, they were flying blind - unable to optimize their AI assistant's retrieval layer with confidence.
The Solution
We delivered a comprehensive RAG evaluation and observability system:
Hybrid Agentic Retrieval POC
- Built 3 new retrieval strategies to compare against existing vector search
- Keyword search for exact term matching
- RRF-fusion: Single tool that executes both vector and keyword search per call, then fuses results using Reciprocal Rank Fusion (e.g., 7:3 ratio favoring vector)
- Dual-tool agentic: Agent has access to both vector and keyword tools separately, deciding which to call based on query
- Systematic A/B testing infrastructure for strategy comparison
12+ Evaluation Metrics (RAGAS 2025 Aligned)
| Tier | Metric | What It Measures |
|---|---|---|
| 1 - Critical | Pairwise Preference | Overall response quality |
| 1 - Critical | Context Recall | % of relevant docs retrieved |
| 1 - Critical | Response Relevance | Does response answer the question? |
| 2 - Diagnostic | Pairwise Retrieval | Quality of retrieved context |
| 2 - Diagnostic | Context Precision | % of retrieved docs that are relevant |
| 2 - Diagnostic | Faithfulness | Response grounded in context |
| 2 - Diagnostic | RAG Score | Harmonic mean of 4 core metrics |
| 3 - Additional | F0.5/F1/F2 Scores | Precision/recall trade-offs |
| 3 - Additional | Tool Usage | Effectiveness of tool calls |
| 3 - Additional | Retrieval Efficiency | For agentic multi-step RAG |
Smart Evaluators with Dual Mode
- Production Trace Evaluation: LLM-as-judge semantic analysis (no re-execution)
- Synthetic Dataset Evaluation: Fresh workflow runs against criteria
- Same evaluators adapt automatically based on data type
Dual Observability Stack
- LangSmith integration for experiment tracking and trace analysis
- Traceloop via Vercel AI SDK OpenTelemetry (9 AI calls instrumented)
- Real-time monitoring without instrumentation complexity
Technologies Used
The Results
How Dual-tool Agentic Works
| Query Type | Example | LLM Decision |
|---|---|---|
📝Factual Questions | "Where is the API key?" | Selects keyword search |
💡Conceptual Questions | "How does auth work?" | Selects vector search |
🧩Complex Multi-part | "How to configure and show examples" | Uses both iteratively |
Why Dual-tool Agentic Strategy Wins
The dual-tool agentic approach emerged as the recommended strategy because it allows the LLM to intelligently select the most appropriate retrieval method based on query characteristics:
- Adaptive Intelligence: The agent analyzes each query and decides whether to use vector search (for semantic/conceptual questions), keyword search (for exact matches), or both
- Query-Aware Routing: Complex multi-part questions benefit from iterative tool selection, while simple factual queries get direct keyword matches
- Best of Both Worlds: Unlike fixed fusion ratios, the agent dynamically optimizes retrieval per query context
Technical Deliverables
- 12+ evaluation metrics with tiered priority system (RAGAS 2025 aligned)
- Smart evaluators adapting between production traces and synthetic datasets
- Synthetic dataset generation for consistent testing
- RAG score computation (harmonic mean + F-scores)
- Full LangSmith experiment tracking
- Traceloop observability via OpenTelemetry
Business Impact
- Intelligent, query-aware retrieval strategy
- Comprehensive evaluation framework for continuous improvement
- Production-ready TypeScript code from day one
- Zero integration overhead with existing infrastructure
Ready to Transform Your Project?
Let's discuss how we can deliver production-ready solutions for your business.
Schedule a Call