Skip to main content

Roadmap v0.1

Productizing Intelligent Routing with Comprehensive Evaluation

Release Goal

This release focuses on productizing the semantic router with:

  1. Intelligent routing with configurable reasoning modes and model-family-aware templating
  2. Kubernetes-native deployment with auto-configuration from model evaluation
  3. Comprehensive benchmarking and monitoring beyond MMLU-Pro
  4. Production-ready caching and observability

Key P0 Deliverables

  • Router intelligence: Reasoning controller, ExtProc plugins, semantic caching
  • Operations: K8s operator, benchmarks, monitoring
  • Quality: Test coverage, integration tests, structured logging

Priority Criteria

P0
Critical / Must-Have

Directly impacts core functionality or correctness. Without this, the system cannot be reliably used in production.

P1
Important / Should-Have

Improves system quality, efficiency, or usability but is not blocking the basic workflow.

P2
Nice-to-Have / Exploratory

Experimental or advanced features that extend system capability.

RouterCore (area/core)

Model Selection and Configuration

#1 Reasoning mode controller

P0
Acceptance: Configurable reasoning effort levels per category; template handling for different model families (GPT OSS/Qwen3/DeepSeek/etc); metrics for reasoning mode decisions and model-specific template usage.

Routing Logic

#2 ExtProc modular architecture

P0
Acceptance: Interface for different classifiers and functionality modules; config for feature gating and module enablement; Plugin API versioning.

#3 Load and latency-aware endpoint resilience

P2
Acceptance: Endpoint selection using request concurrency and/or TTFT/TPOT. Using SLO driven metrics to automatic failover with load weighted selection between redundant endpoints; circuit breaker with error rate and load signal deviation thresholds.

Semantic Cache

#4 Production-ready semantic caching

P0
Acceptance: Support more backends in semantic caching; hit rates are tracked; cache eviction is configurable; performance benchmarks are included.

Research (area/research)

#5 Multi-factor routing algorithm

P1
Acceptance: Routing formula combining quality (model_scores), load (ModelLoad counter), and latency (ModelCompletionLatency histogram), and token usage and pricing; configurable for broad SLO based targets; documented in architecture guide.

#6 Dynamic model scoring system

P2
Acceptance: Online model score updates based on model accuracy, latency, and cost metrics; auto-updates model_scores in config; replaces static scoring in A/B test or through RL.

#7 Expand the use cases and evaluations

P2
Acceptance: Explore more use cases and evaluations for different categories, model families and tasks.

Networking (area/networking)

#8 Envoy ExtProc integration for AI gateways

P0
Acceptance: 1. ExtProc header/body mutation consistent with LLM-d/Envoy AI Gateway filter chains (documented setup for each) 2. Example Envoy configs for common patterns (e.g., A/B test, canary routing)

Bench (area/benchmark)

#9 Router benchmark CLI

P0
Acceptance: Command run_bench.sh with: 1. Per-category metrics: accuracy, response time, token counts (prompt/completion/total) 2. Per-model metrics: success rate, error distribution, latency distribution 3. Export to CSV/JSON for analysis

#10 Performance test suite

P0
Acceptance: Comprehensive test framework (including but not limited to MMLU-Pro accuracy, PII/jailbreak detection, latency) with configurable thresholds; CI integration with baseline metrics.

#11 Reasoning mode evaluation

P1
Acceptance: Compare standard vs. reasoning mode using: 1. Response correctness on MMLU(-Pro) and non-MMLU test sets 2. Token usage (completion_tokens/prompt_tokens ratio) 3. Response time per output token

User Experience (area/user-experience)

#12 Developer quickstart examples

P0
Acceptance: A new user can reproduce an evaluation report in under 10 minutes.

Test and Release (area/tooling, area/ci)

#13 More ExtProc test coverage

P0
Acceptance: 1. Ginkgo test suite for ExtProc components (request/response handling, model selection) 2. Integration tests for model-specific reasoning (GPT OSS/Qwen3/DeepSeek template kwargs) 3. Config validation tests (model paths, thresholds, cache settings)

#14 Classifier test framework

P1
Acceptance: 1. Test vectors for category/PII/jailbreak detection with expected outputs 2. Mock BERT model for fast CI runs 3. Snapshot tests for classification boundaries

Environment (area/environment)

#15 Container and k8s deployment readiness

P0
Acceptance: K8s manifests with model init container, health/readiness probes, resource limits, and metrics endpoints; documented deployment flow. Operator that uses LLM model eval to generate config yaml and startup the k8s deployment

#16 Model management automation

P1
Acceptance: Automated model download/verification from HuggingFace; version pinning; graceful fallback on missing models or revisions; HuggingFace model uploading CI to ensure models are fully evaluated before overwriting existing models.

Observability (area/observability)

#17 Minimal operator dashboard

P0
Acceptance: Grafana panels from logs/metrics for reasoning rate, cost, latency, and refusal rates.

#18 Structured logs and metrics

P0
Acceptance: Model choice, reasoning flag, token counts, cost, and reason codes are emitted; alerts are configurable.

#19 Routing policy visualization

P1
Acceptance: Grafana dashboard showing routing flow (source->target models), confidence distributions, and cost metrics; alerts on threshold violations.

Docs (area/document)

#20 Reasoning routing quickstart

P0
Acceptance: Short guide with config.yaml fields, example request/response, and a comprehensive evaluation command, within a recorded video for demo the reasoning use case..

#21 Policy cookbook and troubleshooting

P1
Acceptance: Short recipes with config.yaml snippets for categories/model_scores & use_reasoning, classifier thresholds, model_config PII policy, and vllm_endpoints mapping; troubleshooting maps common logs/errors to exact config fixes.

#22 Model performance evaluation guide

P1
Acceptance: Documents automated workflow to evaluate models (including but not limited to MMLU-Pro), generate performance-based routing config, and update categories[].model_scores; includes example evaluation->config pipeline.