Documentation — Wardis

Installation

Wardis runs as a set of Docker containers. You need Docker and Docker Compose installed on your machine. The entire stack — gateway, dashboard, PostgreSQL, ClickHouse, and Redis — starts with one command.

# Clone the repository

$ git clone https://github.com/yourorg/wardis

$ cd wardis

# Start the infrastructure (PostgreSQL, ClickHouse, Redis)

$ docker compose up

# In a separate terminal — install dependencies and start gateway + dashboard

$ pnpm install && pnpm dev

The gateway starts on localhost:4000 and the dashboard on localhost:3000.

For semantic cache support (optional), start the full profile which includes the embedding service and Qdrant:

$ docker compose --profile full up

Requirements

→ Docker & Docker Compose

→ Node.js 20+

→ pnpm 9+

→ ~2 GB RAM for full stack (4 GB recommended with semantic cache)

First login & setup

When you open localhost:3000 for the first time, you'll see the setup wizard. Create your admin account — this becomes the organization owner.

Step 1 Create your admin account with email and password.

Step 2 Name your organization. This becomes the top level for all teams, keys, and budgets.

Step 3 Create your first team (optional — you can add teams later).

Step 4 Add a provider API key (OpenAI, Anthropic, etc.) so the gateway can proxy requests.

Connect a provider

Provider API keys are managed exclusively through the dashboard — environment variables like OPENAI_API_KEY are intentionally ignored. This ensures every request is tracked, budgeted, and audited.

Go to Gateway → Providers in the sidebar and add your provider's API key. Wardis supports:

OpenAI

Anthropic

Google Gemini

AWS Bedrock

Azure OpenAI

Ollama

Any OpenAI-compatible provider also works — DeepSeek, Groq, Mistral, Together AI, Fireworks AI, Cerebras, and Z.ai (GLM).

Your first request

Wardis exposes an OpenAI-compatible API. Point your existing code at the gateway instead of the provider — no other changes needed.

# Create an API key in the dashboard first (Workspace → API Keys)

$ curl http://localhost:4000/v1/chat/completions \

-H "Authorization: Bearer YOUR_WARDIS_KEY" \

-H "Content-Type: application/json" \

-d '{

"model": "gpt-4o",

"messages": [{"role": "user", "content": "Hello"}]

}'

Or in Python with the OpenAI SDK:

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:4000/v1",

api_key="YOUR_WARDIS_KEY"

)

response = client.chat.completions.create(

model="gpt-4o",

messages=[{"role": "user", "content": "Hello"}]

)

The request is proxied to OpenAI, and the response is returned in the same format. Wardis logs the token count, cost, and latency automatically. Check the dashboard to see it appear in real time.

Dashboard overview

The dashboard is organized into four sections, accessible from the left sidebar:

Analytics

Dashboard overview, usage logs, and agent task tracking. This is where you see costs, token counts, and request history.

Control

Budgets, alerts, and automation rules. Set spending limits per org, team, or key. Configure notifications and budget-triggered routing.

Gateway

Provider configuration, routing rules, and semantic cache settings. Manage which AI providers are available and how requests are routed.

Workspace

API keys, teams, users & RBAC, and audit log. Manage access, invite team members, and review all actions.

Usage & analytics

The usage page shows every request that passes through the gateway. You can filter by date range, model, provider, team, API key, and status. Each row shows the model used, token count (input/output), cost in USD, latency, and status.

Click any request to see the full details including request metadata, headers, and the associated agent task (if any). All event data is stored in ClickHouse for fast queries even over millions of rows.

API keys

API keys authenticate requests to the Wardis gateway. Each key belongs to a team and can have its own budget limit. Keys are hashed with bcrypt and never stored in plain text.

Go to Workspace → API Keys to create, rotate, or deactivate keys. Each key shows its usage metrics — requests, tokens, and cost — directly in the table.

# Use your Wardis API key in requests — not the provider key

$ curl -H "Authorization: Bearer wd-your-key-here" \

http://localhost:4000/v1/chat/completions

Keys are prefixed with wd- to distinguish them from provider keys.

Teams

Teams group users and API keys together. Each team can have its own monthly budget. Costs from all keys belonging to a team are aggregated on the team level.

For SaaS builders, teams map naturally to customers — one team per customer, one key per customer, one budget per customer. See the SaaS use case for details.

Providers

Wardis proxies requests to AI providers using an OpenAI-compatible API. You configure provider API keys through the dashboard, not through environment variables. This ensures every request is fully tracked and audited.

Supported providers

Native adapters: OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure OpenAI, Ollama. Each has a dedicated adapter that handles request/response format translation.

OpenAI-compatible: DeepSeek, Groq, Mistral, Together AI, Fireworks AI, Cerebras, Z.ai (GLM). Any provider that accepts the OpenAI chat completions format works out of the box.

Model pricing

Wardis ships with pricing data for over 2,100 models via the LiteLLM pricing database. Prices are updated regularly. You can view and search all models on the Gateway → Providers page in the dashboard.

Budgets

Budgets are enforced in real time on every request before it reaches the provider. The hierarchy is: organization → team → API key. A request is blocked if any level in the hierarchy has exceeded its limit.

# Budget hierarchy example

org_budget: $10,000/month

team_engineering: $5,000

key_prod: $3,000

key_staging: $500

team_support: $2,000

Enforcement behavior

At 80% of budget — a warning alert is sent (email, Slack, or webhook).

At 95% — budget-aware routing can automatically switch to a cheaper model.

At 100% — the request is rejected with HTTP 429.

Budgets reset automatically at the start of each calendar month (UTC). Set budgets inline on the Control → Budgets page.

Alerts

Alert rules notify you when something needs attention. You can create rules for budget thresholds, cost spikes, error rates, and latency.

Budget threshold — triggered when a team or key reaches a percentage of its budget.

Cost spike — triggered when spend increases more than 200% in a 5-minute window.

Error rate — triggered when provider errors exceed a threshold.

Latency — triggered when response times exceed a threshold.

Alerts can be sent via email, Slack webhook, or generic webhook. Each rule has a configurable cooldown period to prevent alert fatigue. Manage rules on the Control → Alerts page.

Agent tracking

AI agents — especially multi-agent systems — can generate hundreds of LLM requests across orchestrators, subagents, and tool calls. Wardis correlates all of these into a single task with a detailed cost breakdown by agent and by level in the agent tree.

How it works

Agents inject tracking headers into their requests:

X-Wardis-Task-ID: task_abc123

X-Wardis-Agent-ID: agent_orchestrator

X-Wardis-Agent-Type: team_lead

X-Wardis-Parent-Task-ID: parent_task_xyz

Wardis groups all requests sharing the same task ID and builds the agent tree automatically. The result is a single view showing the orchestrator, each subagent, and every tool call — with cost, token count, and duration at every level.

Agent types

standalone — a single agent working alone.

team_lead — an orchestrator coordinating other agents.

subagent — a child agent spawned by a parent task.

teammate — a peer agent in a team (parallel execution).

external — agents from external frameworks (LangChain, CrewAI, etc.).

View agent tasks on the Analytics → Agents page. Each task shows the full breakdown with a timeline of when each agent started and finished.

Routing

Routing rules control how requests are directed to providers. You can route based on model, cost, latency, or custom conditions. Fallback chains let you automatically retry with a different provider if the primary one fails.

Budget-aware routing

When a team or key approaches its budget limit, Wardis can automatically switch requests to a cheaper model. For example, routing from claude-opus-4.7 to haiku-4.5 when 95% of the budget is consumed. Configure this on the Control → Automation page.

Fallback chains

If a provider returns an error or is unreachable, routing rules can automatically fall back to an alternative provider. Configure routes on the Gateway → Routing page.

Semantic cache

The semantic cache avoids sending duplicate or near-duplicate requests to providers. It works on two levels: exact match (SHA-256 hash via Redis) and semantic similarity (embeddings via Qdrant). Cache is opt-in per API key.

Setup

Semantic cache requires the embedding service and Qdrant. Start them with the full Docker profile:

$ docker compose --profile full up

Then enable caching per API key on the Gateway → Cache page. You can configure the TTL and similarity threshold per key.

Cached responses include the header X-Wardis-Cache: hit so you can verify cache behavior. Cache stats — hit rate and estimated savings — are shown on the cache page.

API reference

The gateway exposes an OpenAI-compatible proxy API and a management API for programmatic control.

Proxy endpoints

POST/v1/chat/completionsChat completion proxy

POST/v1/completionsLegacy completions

POST/v1/embeddingsEmbeddings proxy

GET/v1/modelsList available models

Management API

GET/api/overviewDashboard metrics

GET/api/usageUsage data with filters

GET/api/keysList API keys

POST/api/keysCreate API key

GET/api/teamsList teams

GET/api/budgets/statusBudget status

GET/api/tasks/:taskIdAgent task details

GET/api/alerts/rulesAlert rules

GET/api/cache/statsCache statistics

All management endpoints require session authentication (cookie-based via the dashboard) or a valid API key.

MCP server

Wardis includes an MCP (Model Context Protocol) server that Claude Code and other agents can connect to for automatic cost tracking. When configured, agents can start tasks, check budgets, and receive cost reports without any manual header injection.

Available tools

start_task(name, budget_limit?) — creates a task, returns a task ID to inject into headers.

end_task(task_id) — closes the task and returns total cost, duration, and request count.

get_budget_status(team_id?) — returns current budget remaining and percentage used.

set_model_preference(task_type, model) — saves a model preference for a type of task.

get_cost_report(period) — returns a cost report for the specified time period.

Configuration

Add the Wardis MCP server to your .mcp.json:

{

"mcpServers": {

"wardis": {

"url": "http://localhost:4000/mcp"

}

Architecture

Wardis is a monorepo built with pnpm workspaces. The main components are:

Gateway — apps/gateway

Node.js / TypeScript / Fastify. Handles proxy, auth, token counting, cost calculation, budget enforcement, agent correlation, routing, and caching. Runs on port 4000.

Dashboard — apps/dashboard

Next.js 15 / React 19 / TailwindCSS. The management UI. Proxies API calls to the gateway on the same origin. Runs on port 3000.

Embedding Service — apps/embedding-service

Python / FastAPI. Generates embeddings for the semantic cache using paraphrase-multilingual-MiniLM-L12-v2. Runs on port 8080. Optional — only needed for semantic cache.

Data stores

PostgreSQL

Primary database. Stores organizations, teams, users, API keys, provider configs, alert rules, routing rules, and audit log. Managed via Drizzle ORM.

ClickHouse

Analytics database. Stores all LLM request events and agent task aggregations. Columnar storage for fast analytical queries over millions of rows.

Redis

Real-time budget tracking, rate limiting, exact-match cache, and alert cooldown tracking. Sub-millisecond operations.

Qdrant

Vector database for semantic cache. Stores request embeddings and finds similar requests by cosine similarity. Optional — only needed for semantic cache.

Getting started with Wardis