Who owns the code and intellectual property?

You own 100% of the code, documentation, and all deliverables. We transfer full IP rights upon final payment.

What happens if I need changes after delivery?

All projects include 1-3 weeks of free critical bug fixes (varies by tier). For new features or enhancements, we can discuss additional work at discounted rates for existing clients.

What tech stacks do you support?

We work with your existing stack or recommend technologies based on your needs. Common stacks: React/Next.js, Node.js, Python, TypeScript, PostgreSQL, MongoDB. Cloud: AWS, GCP, Azure, Vercel, Railway, Streamlit Cloud, Render, Fly.io, or your preferred provider.

Do you work with my cloud environment or yours?

We deploy directly to YOUR environment (AWS, GCP, Azure, Vercel, Railway, Streamlit Cloud, etc.). You maintain full control and ownership. We never host your data - everything stays in your infrastructure.

How do payments work?

50% upfront to start work, 50% upon successful delivery and verification. We only request final payment when you can verify the outcome meets specifications.

Can I see progress during development?

Yes. You receive milestone updates with verifiable evidence: test results, documentation, and deployment logs. You can request a demo at any milestone.

What if the scope changes mid-project?

Small adjustments are included. Major scope changes are discussed and may require timeline/pricing adjustments. We always communicate changes before proceeding.

Do you provide ongoing maintenance?

Maintenance packages are available separately. All tiers include free critical bug fixes for 1-3 weeks post-delivery. We can discuss long-term support if needed.

How do you reduce communication overhead?

We keep it simple: a single point of contact, clear weekly milestones, async demo videos and staging links, and a shared tracker. You stay informed without meeting fatigue—jump in only when decisions are needed.

Back to Case Studies

Hybrid Agentic RAG with LangSmith & Traceloop Observability

How we delivered a hybrid Agentic RAG system with dual observability (LangSmith + Traceloop), 12+ RAGAS 2025 evaluation metrics, and production-ready TypeScript in just 3 weeks.

Agentic RAGLangSmithTraceloopRAGAS 2025LLM EvaluationOpenTelemetryTypeScript

Key Results

New RAG Strategies

Keyword, RRF-fusion, Dual-tool Agentic added to existing Vector

12+

Evaluation Metrics

RAGAS 2025: precision, recall, faithfulness, F-scores, pairwise

Dual-tool

Winning Strategy

LLM intelligently selects best retrieval method per query type

The Challenge

The client had a basic vector search implementation for their AI assistant but needed to explore whether alternative retrieval strategies could improve results. They wanted to compare their existing approach against keyword search, RRF-fusion, and agentic dual-tool methods - but had no systematic way to evaluate them.

The team faced critical questions:

Which retrieval strategy works best for which query types?
How do we measure RAG quality beyond simple accuracy?
How do we evaluate both production traces AND synthetic test cases?
How do we implement continuous improvement with measurable metrics?

Without answers, they were flying blind - unable to optimize their AI assistant's retrieval layer with confidence.

The Solution

We delivered a comprehensive RAG evaluation and observability system:

Hybrid Agentic Retrieval POC

Built 3 new retrieval strategies to compare against existing vector search
Keyword search for exact term matching
RRF-fusion: Single tool that executes both vector and keyword search per call, then fuses results using Reciprocal Rank Fusion (e.g., 7:3 ratio favoring vector)
Dual-tool agentic: Agent has access to both vector and keyword tools separately, deciding which to call based on query
Systematic A/B testing infrastructure for strategy comparison

12+ Evaluation Metrics (RAGAS 2025 Aligned)

Tier	Metric	What It Measures
1 - Critical	Pairwise Preference	Overall response quality
1 - Critical	Context Recall	% of relevant docs retrieved
1 - Critical	Response Relevance	Does response answer the question?
2 - Diagnostic	Pairwise Retrieval	Quality of retrieved context
2 - Diagnostic	Context Precision	% of retrieved docs that are relevant
2 - Diagnostic	Faithfulness	Response grounded in context
2 - Diagnostic	RAG Score	Harmonic mean of 4 core metrics
3 - Additional	F0.5/F1/F2 Scores	Precision/recall trade-offs
3 - Additional	Tool Usage	Effectiveness of tool calls
3 - Additional	Retrieval Efficiency	For agentic multi-step RAG

Smart Evaluators with Dual Mode

Production Trace Evaluation: LLM-as-judge semantic analysis (no re-execution)
Synthetic Dataset Evaluation: Fresh workflow runs against criteria
Same evaluators adapt automatically based on data type

Dual Observability Stack

LangSmith integration for experiment tracking and trace analysis
Traceloop via Vercel AI SDK OpenTelemetry (9 AI calls instrumented)
Real-time monitoring without instrumentation complexity

Technologies Used

TypeScriptLangSmithTraceloopOpenTelemetryVercel AI SDKRAGAS 2025LLM-as-JudgeVector SearchAgentic RAG

The Results

How Dual-tool Agentic Works

📝Factual Questions

"Where is the API key?"

Selects keyword search

💡Conceptual Questions

"How does auth work?"

Selects vector search

🧩Complex Multi-part

"How to configure and show examples"

Uses both iteratively

Query Type	Example	LLM Decision
📝Factual Questions	"Where is the API key?"	Selects keyword search
💡Conceptual Questions	"How does auth work?"	Selects vector search
🧩Complex Multi-part	"How to configure and show examples"	Uses both iteratively

Why Dual-tool Agentic Strategy Wins

The dual-tool agentic approach emerged as the recommended strategy because it allows the LLM to intelligently select the most appropriate retrieval method based on query characteristics:

Adaptive Intelligence: The agent analyzes each query and decides whether to use vector search (for semantic/conceptual questions), keyword search (for exact matches), or both
Query-Aware Routing: Complex multi-part questions benefit from iterative tool selection, while simple factual queries get direct keyword matches
Best of Both Worlds: Unlike fixed fusion ratios, the agent dynamically optimizes retrieval per query context

Technical Deliverables

12+ evaluation metrics with tiered priority system (RAGAS 2025 aligned)
Smart evaluators adapting between production traces and synthetic datasets
Synthetic dataset generation for consistent testing
RAG score computation (harmonic mean + F-scores)
Full LangSmith experiment tracking
Traceloop observability via OpenTelemetry

Business Impact

Intelligent, query-aware retrieval strategy
Comprehensive evaluation framework for continuous improvement
Production-ready TypeScript code from day one
Zero integration overhead with existing infrastructure

Ready to Transform Your Project?

Let's discuss how we can deliver production-ready solutions for your business.

Schedule a Call

Tier

Metric

What It Measures

1 - Critical

Pairwise Preference

Overall response quality

1 - Critical

Context Recall

% of relevant docs retrieved

1 - Critical

Response Relevance

Does response answer the question?

2 - Diagnostic

Pairwise Retrieval

Quality of retrieved context

2 - Diagnostic

Context Precision

% of retrieved docs that are relevant

2 - Diagnostic

Faithfulness

Response grounded in context

2 - Diagnostic

RAG Score

Harmonic mean of 4 core metrics

3 - Additional

F0.5/F1/F2 Scores

Precision/recall trade-offs

3 - Additional

Tool Usage

Effectiveness of tool calls

3 - Additional

Retrieval Efficiency

For agentic multi-step RAG

Query Type

Example

LLM Decision

📝Factual Questions

"Where is the API key?"

Selects keyword search

💡Conceptual Questions

"How does auth work?"

Selects vector search

🧩Complex Multi-part

"How to configure and show examples"

Uses both iteratively

Why Dual-tool Agentic Strategy Wins

The dual-tool agentic approach emerged as the recommended strategy because it allows the LLM to intelligently select the most appropriate retrieval method based on query characteristics:

Adaptive Intelligence: The agent analyzes each query and decides whether to use vector search (for semantic/conceptual questions), keyword search (for exact matches), or both
Query-Aware Routing: Complex multi-part questions benefit from iterative tool selection, while simple factual queries get direct keyword matches
Best of Both Worlds: Unlike fixed fusion ratios, the agent dynamically optimizes retrieval per query context

Technical Deliverables

12+ evaluation metrics with tiered priority system (RAGAS 2025 aligned)
Smart evaluators adapting between production traces and synthetic datasets
Synthetic dataset generation for consistent testing
RAG score computation (harmonic mean + F-scores)
Full LangSmith experiment tracking
Traceloop observability via OpenTelemetry

Business Impact

Intelligent, query-aware retrieval strategy
Comprehensive evaluation framework for continuous improvement
Production-ready TypeScript code from day one
Zero integration overhead with existing infrastructure