Observability
Prometheus + Grafana monitoring stack for Geetanjali.
Overview
The observability stack provides metrics collection, visualization, and alerting. It runs as optional containers alongside the core application.
Backend /metrics ──┐
├──▶ Prometheus (scrape 15s) ──▶ Grafana (dashboards + alerts)
Worker /metrics ──┘
Quick Start
# Start with observability
make obs-up
# Or: docker compose -f docker-compose.observability.yml up -d
# Access dashboards
# Production: https://grafana.geetanjaliapp.com
# Local dev: http://localhost:3000 (Grafana)
# Prometheus is internal-only (not exposed)
Default Grafana login: admin / (set GRAFANA_ADMIN_PASSWORD in .env)
Metrics Architecture
Metrics are split across separate modules to prevent duplicate registration:
| Module | Scraped From | Purpose |
|---|---|---|
metrics_business.py |
Backend only | Business gauges (consultations, users, feedback) |
metrics_infra.py |
Backend only | Infrastructure gauges (postgres, redis, queue) |
metrics_events.py |
Both services | Event counters (cache, email, vector search) |
metrics_llm.py |
Worker only | LLM request counters and circuit breaker states |
Why the split? Each Python process has its own Prometheus registry. Without separation, both backend and worker would register the same gauges, causing duplicate/conflicting values.
Metrics Reference
Business Metrics (Backend only)
| Metric | Type | Description |
|---|---|---|
geetanjali_consultations_total |
Gauge | Total completed consultations |
geetanjali_consultations_24h |
Gauge | Consultations in last 24 hours |
geetanjali_consultation_completion_rate |
Gauge | Ratio of completed to total (0-1) |
geetanjali_active_users_24h |
Gauge | Unique users in last 24 hours |
geetanjali_registered_users_total |
Gauge | Total registered users |
geetanjali_signups_24h |
Gauge | New registrations in last 24 hours |
geetanjali_exports_total |
Gauge | Total exports (all formats) |
geetanjali_exports_24h |
Gauge | Exports in last 24 hours |
geetanjali_verses_served_total |
Gauge | Total verses cited across all outputs |
geetanjali_avg_messages_per_case |
Gauge | Average messages per consultation |
geetanjali_feedback_positive_rate |
Gauge | Percentage of positive feedback (0-1) |
Newsletter Metrics (Backend only)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_newsletter_subscribers_total |
Gauge | - | Active verified subscribers |
geetanjali_newsletter_subscribers_by_time |
Gauge | send_time |
Subscribers per time slot |
geetanjali_newsletter_emails_sent_24h |
Gauge | - | Emails delivered in last 24 hours |
Sharing Metrics (Backend only)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_shared_cases_total |
Gauge | mode |
Shared cases by visibility (public/link) |
geetanjali_case_views_24h |
Gauge | - | Views on shared cases in last 24 hours |
SEO Metrics (Backend only)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_seo_pages_total |
Gauge | page_type |
Generated pages by type (verse, chapter, topic) |
geetanjali_seo_generation_pages_generated |
Gauge | - | Pages generated in last run |
geetanjali_seo_generation_pages_skipped |
Gauge | - | Pages skipped (unchanged) |
geetanjali_seo_generation_pages_errors |
Gauge | - | Errors in last run |
geetanjali_seo_generation_last_duration_seconds |
Gauge | - | Last run duration |
geetanjali_seo_generation_last_success_timestamp |
Gauge | - | Last success Unix timestamp |
Infrastructure Metrics (Backend only)
| Metric | Type | Description |
|---|---|---|
geetanjali_postgres_up |
Gauge | PostgreSQL availability (1/0) |
geetanjali_postgres_connections_active |
Gauge | Active database connections |
geetanjali_postgres_connections_idle |
Gauge | Idle database connections |
geetanjali_postgres_database_size_bytes |
Gauge | Database size in bytes |
geetanjali_redis_connections |
Gauge | Active Redis connections |
geetanjali_redis_memory_usage_percent |
Gauge | Redis memory usage % |
geetanjali_ollama_up |
Gauge | Ollama LLM availability (1/0) |
geetanjali_ollama_models_loaded |
Gauge | Models loaded in Ollama |
geetanjali_chromadb_up |
Gauge | ChromaDB availability (1/0) |
geetanjali_chromadb_collection_count |
Gauge | Vectors in collection |
Queue Metrics (Backend only)
| Metric | Type | Description |
|---|---|---|
geetanjali_queue_depth |
Gauge | Jobs waiting in RQ queue |
geetanjali_worker_count |
Gauge | Active RQ workers |
geetanjali_failed_jobs |
Gauge | Failed jobs in RQ registry |
LLM Metrics (Primarily Worker)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_llm_requests_total |
Counter | provider, status |
LLM API requests |
geetanjali_llm_tokens_total |
Counter | provider, token_type |
Tokens used (input/output) |
geetanjali_llm_fallback_total |
Counter | primary, fallback, reason |
Fallback events |
geetanjali_llm_circuit_breaker_state |
Gauge | provider |
Circuit breaker state |
geetanjali_escalation_reasons_total |
Counter | reason, provider |
Escalation events by reason |
geetanjali_confidence_post_repair |
Histogram | provider |
Confidence distribution after repair |
geetanjali_repair_success_total |
Counter | field, status |
Repair attempts by field and outcome |
geetanjali_consultation_cost_usd_total |
Counter | provider |
Total LLM cost by provider |
geetanjali_consultation_tokens_total |
Counter | provider |
Total tokens consumed by provider |
geetanjali_daily_limit_exceeded_total |
Counter | tracking_type |
Daily limit hits (ip/session) |
geetanjali_request_validation_rejected_total |
Counter | reason |
Rejected requests (token_too_large/duplicate) |
Escalation Metrics:
reasonlabels:missing_critical_field_options,missing_critical_field_recommended_action,missing_critical_field_executive_summary,low_confidence_post_repairproviderlabels:gemini,anthropic,ollamafieldlabels:options,recommended_action,executive_summary,reflection_prompts,sources,scholar_flagstatuslabels:success,failed
Query Examples:
# Escalation rate (%)
(increase(geetanjali_escalation_reasons_total[5m]) /
increase(geetanjali_consultation_total[5m])) * 100
# Post-escalation confidence (p95)
histogram_quantile(0.95, sum(rate(geetanjali_confidence_post_repair_bucket{provider="anthropic"}[5m])) by(le))
# Repair success rate
(increase(geetanjali_repair_success_total{status="success"}[1h]) /
increase(geetanjali_repair_success_total[1h])) * 100
Note: LLM metrics primarily come from the worker service. In development with RQ disabled, backend may also emit these metrics.
Cache Metrics (Both services)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_cache_hits_total |
Counter | key_type |
Cache hits by type |
geetanjali_cache_misses_total |
Counter | key_type |
Cache misses by type |
Key types: verse, search, metadata, case, rag, featured, other
API Metrics (Both services)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_api_errors_total |
Counter | error_type, endpoint |
API errors by type |
Email Metrics (Both services)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_email_sends_total |
Counter | email_type, result |
Send attempts |
geetanjali_email_send_duration_seconds |
Histogram | email_type |
Send latency |
geetanjali_email_circuit_breaker_state |
Gauge | - | Circuit breaker state |
Vector Search Metrics (Both services)
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_vector_search_fallback_total |
Counter | reason |
Fallbacks to SQL search |
geetanjali_chromadb_circuit_breaker_state |
Gauge | - | Circuit breaker state |
Circuit Breaker Metrics
All circuit breakers expose a state gauge (0=closed, 1=half_open, 2=open):
| Metric | Service | Labels |
|---|---|---|
geetanjali_llm_circuit_breaker_state |
LLM providers | provider (ollama, anthropic) |
geetanjali_chromadb_circuit_breaker_state |
Vector store | - |
geetanjali_email_circuit_breaker_state |
Email service | - |
State transitions are tracked:
| Metric | Type | Labels | Description |
|---|---|---|---|
geetanjali_circuit_breaker_transitions_total |
Counter | service, from_state, to_state |
State changes |
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ Observability Stack │
│ │
│ ┌────────────┐ │
│ │ Backend │────┐ │
│ │ :8000 │ │ ┌────────────┐ ┌────────────┐ │
│ │ /metrics │ ├───▶│ Prometheus │───▶│ Grafana │ │
│ └────────────┘ │ │ :9090 │ │ :3000 │ │
│ │ └────────────┘ └─────┬──────┘ │
│ ┌────────────┐ │ │ │
│ │ Worker │────┘ │ Alerts │
│ │ :8001 │ ▼ │
│ │ /metrics │ ┌──────────┐ │
│ └────────────┘ │ Resend │ │
│ │ (email) │ │
│ Backend scrapes: └──────────┘ │
│ • Business gauges │
│ • Infrastructure gauges │
│ • Event counters │
│ │
│ Worker scrapes: │
│ • LLM request counters │
│ • Event counters (cache, email) │
│ • Circuit breaker states │
└──────────────────────────────────────────────────────────────────┘
Metrics Collection
Gauge Collection (Backend)
Business and infrastructure gauges are collected by an APScheduler job running every 60 seconds:
# backend/services/metrics_collector.py
class MetricsCollector:
def collect_all(self):
self._collect_business_metrics()
self._collect_infrastructure_metrics()
self._collect_queue_metrics()
Counter/Histogram Collection (Both)
Event-based metrics (counters, histograms) are updated in real-time when events occur:
- Cache hits/misses tracked on each operation
- LLM requests tracked per API call
- Email sends tracked per delivery attempt
Query Patterns
Different metric types require different Prometheus queries due to how metrics are collected:
| Type | Pattern | Example | Why |
|---|---|---|---|
| Gauges | Filter by job | geetanjali_postgres_up{job="backend"} |
Collected only by backend scheduler |
| Counters | Sum across jobs | sum(geetanjali_cache_hits_total) |
Events occur in both services |
| CB States | Max (worst state) | max(geetanjali_llm_circuit_breaker_state{provider="ollama"}) |
Show worst state across services |
Prometheus Configuration
Dual-scrape config in monitoring/prometheus/prometheus.yml:
scrape_configs:
# Backend: all metrics (business, infra, events)
- job_name: 'backend'
static_configs:
- targets: ['backend:8000']
metrics_path: /metrics
# Worker: event counters and LLM metrics only
- job_name: 'worker'
static_configs:
- targets: ['worker:8001']
metrics_path: /metrics
Grafana Dashboards
Geetanjali Overview
Pre-configured dashboard at monitoring/grafana/dashboards/geetanjali-overview.json:
- Service Health: Database, cache, LLM, vector store status
- Circuit Breakers: LLM (Ollama/Anthropic), ChromaDB, Email states
- Business: Consultations, active users, completion rate, feedback
- Queue: Job depth, worker count, failed jobs
- Newsletter: Subscribers, emails sent
- SEO Pages (collapsed): Total pages, time since generation, errors, pages by type
Importing Dashboards
Dashboards are auto-provisioned via Grafana’s provisioning. Manual import:
- Open Grafana (local: http://localhost:3000, prod: https://grafana.geetanjaliapp.com)
- Go to Dashboards → Import
- Upload from
monitoring/grafana/dashboards/
Alerting
Configuring Alerts
- In Grafana, go to Alerting → Contact Points
- Add Resend integration with your API key
- Create alert rules for critical metrics
Recommended Alerts
| Alert | Condition | Severity |
|---|---|---|
| Database Down | geetanjali_postgres_up{job="backend"} == 0 for 1m |
Critical |
| Worker Down | up{job="worker"} == 0 for 2m |
Critical |
| LLM Circuit Open | max(geetanjali_llm_circuit_breaker_state) == 2 for 5m |
Warning |
| ChromaDB Circuit Open | geetanjali_chromadb_circuit_breaker_state == 2 for 5m |
Warning |
| Queue Backlog | geetanjali_queue_depth > 10 for 5m |
Warning |
| High Failed Jobs | geetanjali_failed_jobs > 5 |
Warning |
| No Recent Activity | geetanjali_consultations_24h == 0 for 24h |
Info |
| Newsletter Not Sent | geetanjali_newsletter_emails_sent_24h == 0 for 24h |
Warning |
| Escalation Rate Spike | (increase(geetanjali_escalation_reasons_total[5m]) / increase(geetanjali_consultation_total[5m])) * 100 > 5 for 5m |
Warning |
| Low Post-Escalation Confidence | histogram_quantile(0.95, sum(rate(geetanjali_confidence_post_repair_bucket{provider="anthropic"}[5m])) by(le)) < 0.85 for 10m |
Warning |
| High Daily Limit Hits | increase(geetanjali_daily_limit_exceeded_total[1h]) > 20 |
Info |
| High Cost Spike | increase(geetanjali_consultation_cost_usd_total[1h]) > 100 |
Warning |
Troubleshooting
Metrics not appearing
- Check backend logs:
docker compose logs backend | grep metrics - Check worker logs:
docker compose logs worker | grep metrics - Verify endpoints:
curl http://localhost:8000/metrics # Backend curl http://localhost:8001/metrics # Worker - Check Prometheus targets: http://localhost:9090/targets
Duplicate metrics in Grafana
If gauges show multiple values, ensure you’re filtering by job:
# Correct - backend-only gauge
geetanjali_postgres_up{job="backend"}
# Incorrect - may get values from both services
geetanjali_postgres_up
Grafana can’t connect to Prometheus
- Verify Prometheus is running:
docker compose ps prometheus - Check data source URL is
http://prometheus:9090(internal Docker network)
Stale metrics
Gauges refresh every 60 seconds. If values seem stale:
- Check APScheduler is running in backend logs
- Verify database connectivity
Environment Variables
| Variable | Default | Description |
|---|---|---|
METRICS_ENABLED |
true | Enable /metrics endpoint |
METRICS_COLLECTION_INTERVAL |
60 | Seconds between gauge collection |
See Also
- Deployment — Docker Compose files and deployment modes
- Architecture — System design overview
- Setup Guide — Development environment