Prometheus Metrics
Olytix Core exposes comprehensive Prometheus metrics for monitoring system health, query performance, and operational insights. This guide covers available metrics, configuration, and Grafana dashboard examples.
Enabling Metrics
Configuration
Enable metrics via environment variables:
# Enable Prometheus metrics
OLYTIX_METRICS__ENABLED=true
# Metrics endpoint path (default: /metrics)
OLYTIX_METRICS__PATH=/metrics
# Include default process metrics
OLYTIX_METRICS__INCLUDE_PROCESS=true
# Include Python runtime metrics
OLYTIX_METRICS__INCLUDE_RUNTIME=true
Verify Metrics Endpoint
# Check metrics endpoint
curl http://localhost:8000/metrics
# Sample output
# HELP olytix-core_http_requests_total Total HTTP requests
# TYPE olytix-core_http_requests_total counter
olytix-core_http_requests_total{method="GET",endpoint="/api/v1/query",status="200"} 1523
Available Metrics
HTTP Request Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_http_requests_total | Counter | method, endpoint, status | Total HTTP requests |
olytix-core_http_request_duration_seconds | Histogram | method, endpoint | Request latency distribution |
olytix-core_http_requests_in_progress | Gauge | method, endpoint | Currently processing requests |
olytix-core_http_request_size_bytes | Histogram | method, endpoint | Request payload size |
olytix-core_http_response_size_bytes | Histogram | method, endpoint | Response payload size |
Query Engine Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_query_total | Counter | cube, type, status | Total queries executed |
olytix-core_query_duration_seconds | Histogram | cube, type | Query execution time |
olytix-core_query_rows_returned | Histogram | cube | Number of rows returned |
olytix-core_query_compile_duration_seconds | Histogram | cube | SQL compilation time |
olytix-core_query_warehouse_duration_seconds | Histogram | warehouse, cube | Warehouse query time |
Cache Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_cache_hits_total | Counter | cache_type | Cache hit count |
olytix-core_cache_misses_total | Counter | cache_type | Cache miss count |
olytix-core_cache_size_bytes | Gauge | cache_type | Current cache size |
olytix-core_cache_evictions_total | Counter | cache_type | Cache eviction count |
Pre-aggregation Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_preagg_builds_total | Counter | cube, preagg, status | Pre-aggregation build count |
olytix-core_preagg_build_duration_seconds | Histogram | cube, preagg | Build duration |
olytix-core_preagg_hits_total | Counter | cube, preagg | Pre-aggregation query hits |
olytix-core_preagg_size_bytes | Gauge | cube, preagg | Pre-aggregation storage size |
olytix-core_preagg_rows | Gauge | cube, preagg | Row count in pre-aggregation |
Warehouse Connection Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_warehouse_connections_active | Gauge | warehouse | Active connections |
olytix-core_warehouse_connections_idle | Gauge | warehouse | Idle connections |
olytix-core_warehouse_connection_errors_total | Counter | warehouse, error_type | Connection errors |
olytix-core_warehouse_query_errors_total | Counter | warehouse, error_type | Query execution errors |
Worker Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_worker_tasks_total | Counter | task_name, status | Total tasks executed |
olytix-core_worker_task_duration_seconds | Histogram | task_name | Task execution time |
olytix-core_worker_queue_size | Gauge | queue_name | Pending tasks in queue |
olytix-core_worker_active_tasks | Gauge | worker_id | Currently running tasks |
System Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
olytix-core_info | Gauge | version | Olytix Core version info |
olytix-core_uptime_seconds | Gauge | - | Time since startup |
process_cpu_seconds_total | Counter | - | CPU time used |
process_resident_memory_bytes | Gauge | - | Memory usage |
process_open_fds | Gauge | - | Open file descriptors |
Prometheus Configuration
Basic Scrape Config
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'olytix-core'
static_configs:
- targets: ['olytix-core-api:8000']
metrics_path: /metrics
scheme: http
- job_name: 'olytix-core-workers'
static_configs:
- targets: ['olytix-core-worker-1:8001', 'olytix-core-worker-2:8001']
Kubernetes ServiceMonitor
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: olytix-core
namespace: olytix-core
labels:
app.kubernetes.io/name: olytix-core
spec:
selector:
matchLabels:
app.kubernetes.io/name: olytix-core
app.kubernetes.io/component: api
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- olytix-core
PodMonitor for Workers
# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: olytix-core-workers
namespace: olytix-core
spec:
selector:
matchLabels:
app.kubernetes.io/name: olytix-core
app.kubernetes.io/component: worker
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s
Alerting Rules
PrometheusRule for Olytix Core
# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: olytix-core-alerts
namespace: olytix-core
spec:
groups:
- name: olytix-core.rules
rules:
# High error rate
- alert: Olytix CoreHighErrorRate
expr: |
sum(rate(olytix-core_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(olytix-core_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
# High latency
- alert: Olytix CoreHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile latency is {{ $value | humanizeDuration }}"
# Query performance degradation
- alert: Olytix CoreSlowQueries
expr: |
histogram_quantile(0.95,
sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube)
) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Slow queries detected for cube {{ $labels.cube }}"
description: "95th percentile query time is {{ $value | humanizeDuration }}"
# Low cache hit rate
- alert: Olytix CoreLowCacheHitRate
expr: |
sum(rate(olytix-core_cache_hits_total[5m]))
/ (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m])))
< 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }}"
# Warehouse connection issues
- alert: Olytix CoreWarehouseConnectionErrors
expr: |
sum(rate(olytix-core_warehouse_connection_errors_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Warehouse connection errors"
description: "Connection errors at {{ $value }} per second"
# Worker queue backlog
- alert: Olytix CoreWorkerQueueBacklog
expr: |
olytix-core_worker_queue_size > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Worker queue backlog growing"
description: "Queue size is {{ $value }} tasks"
# Pre-aggregation build failures
- alert: Olytix CorePreaggBuildFailures
expr: |
sum(rate(olytix-core_preagg_builds_total{status="failed"}[1h])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pre-aggregation build failures"
description: "{{ $value }} failed builds in the last hour"
# Pod not ready
- alert: Olytix CorePodNotReady
expr: |
kube_pod_status_ready{namespace="olytix-core", condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Olytix Core pod not ready"
description: "Pod {{ $labels.pod }} is not ready"
Grafana Dashboards
Olytix Core Overview Dashboard
{
"dashboard": {
"title": "Olytix Core Overview",
"uid": "olytix-core-overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(olytix-core_http_requests_total[5m])) by (status)",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Request Latency (p95)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.50, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p50"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_http_requests_total{status=~\"5..\"}[5m])) / sum(rate(olytix-core_http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_cache_hits_total[5m])) / (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m]))) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 50, "color": "yellow"},
{"value": 80, "color": "green"}
]
}
}
}
},
{
"title": "Active Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(olytix-core_warehouse_connections_active)"
}
]
},
{
"title": "Queue Size",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 8},
"targets": [
{
"expr": "sum(olytix-core_worker_queue_size)"
}
]
}
]
}
}
Query Performance Dashboard
{
"dashboard": {
"title": "Olytix Core Query Performance",
"uid": "olytix-core-queries",
"panels": [
{
"title": "Query Rate by Cube",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(olytix-core_query_total[5m])) by (cube)",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Query Latency by Cube (p95)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube))",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Pre-aggregation Hit Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_preagg_hits_total[5m])) by (cube) / sum(rate(olytix-core_query_total[5m])) by (cube) * 100",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Warehouse Query Time",
"type": "heatmap",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_query_warehouse_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
]
}
]
}
}
Import Dashboards
# Import via Grafana API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @olytix-core-overview-dashboard.json
# Or use Grafana provisioning
# /etc/grafana/provisioning/dashboards/olytix-core.yaml
apiVersion: 1
providers:
- name: 'Olytix Core'
folder: 'Olytix Core'
type: file
options:
path: /var/lib/grafana/dashboards/olytix-core
Custom Metrics
Adding Application Metrics
# In your Olytix Core project
from prometheus_client import Counter, Histogram
# Define custom metrics
CUSTOM_QUERY_COUNTER = Counter(
'olytix-core_custom_queries_total',
'Custom query counter',
['query_type', 'department']
)
CUSTOM_PROCESSING_TIME = Histogram(
'olytix-core_custom_processing_seconds',
'Custom processing time',
['operation']
)
# Use in code
CUSTOM_QUERY_COUNTER.labels(query_type='forecast', department='sales').inc()
with CUSTOM_PROCESSING_TIME.labels(operation='transform').time():
# processing logic
pass
Next Steps
- Configure structured logging for detailed tracing
- Set up circuit breakers for resilience
- Implement retry policies for reliability