Skip to main content

Prometheus Metrics

For Platform Teams

Olytix Core exposes comprehensive Prometheus metrics for monitoring system health, query performance, and operational insights. This guide covers available metrics, configuration, and Grafana dashboard examples.

Enabling Metrics

Configuration

Enable metrics via environment variables:

# Enable Prometheus metrics
OLYTIX_METRICS__ENABLED=true

# Metrics endpoint path (default: /metrics)
OLYTIX_METRICS__PATH=/metrics

# Include default process metrics
OLYTIX_METRICS__INCLUDE_PROCESS=true

# Include Python runtime metrics
OLYTIX_METRICS__INCLUDE_RUNTIME=true

Verify Metrics Endpoint

# Check metrics endpoint
curl http://localhost:8000/metrics

# Sample output
# HELP olytix-core_http_requests_total Total HTTP requests
# TYPE olytix-core_http_requests_total counter
olytix-core_http_requests_total{method="GET",endpoint="/api/v1/query",status="200"} 1523

Available Metrics

HTTP Request Metrics

MetricTypeLabelsDescription
olytix-core_http_requests_totalCountermethod, endpoint, statusTotal HTTP requests
olytix-core_http_request_duration_secondsHistogrammethod, endpointRequest latency distribution
olytix-core_http_requests_in_progressGaugemethod, endpointCurrently processing requests
olytix-core_http_request_size_bytesHistogrammethod, endpointRequest payload size
olytix-core_http_response_size_bytesHistogrammethod, endpointResponse payload size

Query Engine Metrics

MetricTypeLabelsDescription
olytix-core_query_totalCountercube, type, statusTotal queries executed
olytix-core_query_duration_secondsHistogramcube, typeQuery execution time
olytix-core_query_rows_returnedHistogramcubeNumber of rows returned
olytix-core_query_compile_duration_secondsHistogramcubeSQL compilation time
olytix-core_query_warehouse_duration_secondsHistogramwarehouse, cubeWarehouse query time

Cache Metrics

MetricTypeLabelsDescription
olytix-core_cache_hits_totalCountercache_typeCache hit count
olytix-core_cache_misses_totalCountercache_typeCache miss count
olytix-core_cache_size_bytesGaugecache_typeCurrent cache size
olytix-core_cache_evictions_totalCountercache_typeCache eviction count

Pre-aggregation Metrics

MetricTypeLabelsDescription
olytix-core_preagg_builds_totalCountercube, preagg, statusPre-aggregation build count
olytix-core_preagg_build_duration_secondsHistogramcube, preaggBuild duration
olytix-core_preagg_hits_totalCountercube, preaggPre-aggregation query hits
olytix-core_preagg_size_bytesGaugecube, preaggPre-aggregation storage size
olytix-core_preagg_rowsGaugecube, preaggRow count in pre-aggregation

Warehouse Connection Metrics

MetricTypeLabelsDescription
olytix-core_warehouse_connections_activeGaugewarehouseActive connections
olytix-core_warehouse_connections_idleGaugewarehouseIdle connections
olytix-core_warehouse_connection_errors_totalCounterwarehouse, error_typeConnection errors
olytix-core_warehouse_query_errors_totalCounterwarehouse, error_typeQuery execution errors

Worker Metrics

MetricTypeLabelsDescription
olytix-core_worker_tasks_totalCountertask_name, statusTotal tasks executed
olytix-core_worker_task_duration_secondsHistogramtask_nameTask execution time
olytix-core_worker_queue_sizeGaugequeue_namePending tasks in queue
olytix-core_worker_active_tasksGaugeworker_idCurrently running tasks

System Metrics

MetricTypeLabelsDescription
olytix-core_infoGaugeversionOlytix Core version info
olytix-core_uptime_secondsGauge-Time since startup
process_cpu_seconds_totalCounter-CPU time used
process_resident_memory_bytesGauge-Memory usage
process_open_fdsGauge-Open file descriptors

Prometheus Configuration

Basic Scrape Config

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'olytix-core'
static_configs:
- targets: ['olytix-core-api:8000']
metrics_path: /metrics
scheme: http

- job_name: 'olytix-core-workers'
static_configs:
- targets: ['olytix-core-worker-1:8001', 'olytix-core-worker-2:8001']

Kubernetes ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: olytix-core
namespace: olytix-core
labels:
app.kubernetes.io/name: olytix-core
spec:
selector:
matchLabels:
app.kubernetes.io/name: olytix-core
app.kubernetes.io/component: api
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
namespaceSelector:
matchNames:
- olytix-core

PodMonitor for Workers

# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: olytix-core-workers
namespace: olytix-core
spec:
selector:
matchLabels:
app.kubernetes.io/name: olytix-core
app.kubernetes.io/component: worker
podMetricsEndpoints:
- port: metrics
path: /metrics
interval: 30s

Alerting Rules

PrometheusRule for Olytix Core

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: olytix-core-alerts
namespace: olytix-core
spec:
groups:
- name: olytix-core.rules
rules:
# High error rate
- alert: Olytix CoreHighErrorRate
expr: |
sum(rate(olytix-core_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(olytix-core_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

# High latency
- alert: Olytix CoreHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le)
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency"
description: "95th percentile latency is {{ $value | humanizeDuration }}"

# Query performance degradation
- alert: Olytix CoreSlowQueries
expr: |
histogram_quantile(0.95,
sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube)
) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Slow queries detected for cube {{ $labels.cube }}"
description: "95th percentile query time is {{ $value | humanizeDuration }}"

# Low cache hit rate
- alert: Olytix CoreLowCacheHitRate
expr: |
sum(rate(olytix-core_cache_hits_total[5m]))
/ (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m])))
< 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Low cache hit rate"
description: "Cache hit rate is {{ $value | humanizePercentage }}"

# Warehouse connection issues
- alert: Olytix CoreWarehouseConnectionErrors
expr: |
sum(rate(olytix-core_warehouse_connection_errors_total[5m])) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Warehouse connection errors"
description: "Connection errors at {{ $value }} per second"

# Worker queue backlog
- alert: Olytix CoreWorkerQueueBacklog
expr: |
olytix-core_worker_queue_size > 100
for: 10m
labels:
severity: warning
annotations:
summary: "Worker queue backlog growing"
description: "Queue size is {{ $value }} tasks"

# Pre-aggregation build failures
- alert: Olytix CorePreaggBuildFailures
expr: |
sum(rate(olytix-core_preagg_builds_total{status="failed"}[1h])) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pre-aggregation build failures"
description: "{{ $value }} failed builds in the last hour"

# Pod not ready
- alert: Olytix CorePodNotReady
expr: |
kube_pod_status_ready{namespace="olytix-core", condition="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Olytix Core pod not ready"
description: "Pod {{ $labels.pod }} is not ready"

Grafana Dashboards

Olytix Core Overview Dashboard

{
"dashboard": {
"title": "Olytix Core Overview",
"uid": "olytix-core-overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(olytix-core_http_requests_total[5m])) by (status)",
"legendFormat": "{{status}}"
}
]
},
{
"title": "Request Latency (p95)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p95"
},
{
"expr": "histogram_quantile(0.50, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
"legendFormat": "p50"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_http_requests_total{status=~\"5..\"}[5m])) / sum(rate(olytix-core_http_requests_total[5m])) * 100",
"legendFormat": "Error %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "Cache Hit Rate",
"type": "gauge",
"gridPos": {"h": 4, "w": 6, "x": 6, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_cache_hits_total[5m])) / (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m]))) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"value": 0, "color": "red"},
{"value": 50, "color": "yellow"},
{"value": 80, "color": "green"}
]
}
}
}
},
{
"title": "Active Connections",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(olytix-core_warehouse_connections_active)"
}
]
},
{
"title": "Queue Size",
"type": "stat",
"gridPos": {"h": 4, "w": 6, "x": 18, "y": 8},
"targets": [
{
"expr": "sum(olytix-core_worker_queue_size)"
}
]
}
]
}
}

Query Performance Dashboard

{
"dashboard": {
"title": "Olytix Core Query Performance",
"uid": "olytix-core-queries",
"panels": [
{
"title": "Query Rate by Cube",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(olytix-core_query_total[5m])) by (cube)",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Query Latency by Cube (p95)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube))",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Pre-aggregation Hit Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_preagg_hits_total[5m])) by (cube) / sum(rate(olytix-core_query_total[5m])) by (cube) * 100",
"legendFormat": "{{cube}}"
}
]
},
{
"title": "Warehouse Query Time",
"type": "heatmap",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [
{
"expr": "sum(rate(olytix-core_query_warehouse_duration_seconds_bucket[5m])) by (le)",
"format": "heatmap"
}
]
}
]
}
}

Import Dashboards

# Import via Grafana API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @olytix-core-overview-dashboard.json

# Or use Grafana provisioning
# /etc/grafana/provisioning/dashboards/olytix-core.yaml
apiVersion: 1
providers:
- name: 'Olytix Core'
folder: 'Olytix Core'
type: file
options:
path: /var/lib/grafana/dashboards/olytix-core

Custom Metrics

Adding Application Metrics

# In your Olytix Core project
from prometheus_client import Counter, Histogram

# Define custom metrics
CUSTOM_QUERY_COUNTER = Counter(
'olytix-core_custom_queries_total',
'Custom query counter',
['query_type', 'department']
)

CUSTOM_PROCESSING_TIME = Histogram(
'olytix-core_custom_processing_seconds',
'Custom processing time',
['operation']
)

# Use in code
CUSTOM_QUERY_COUNTER.labels(query_type='forecast', department='sales').inc()

with CUSTOM_PROCESSING_TIME.labels(operation='transform').time():
# processing logic
pass

Next Steps