Prometheus Metrics

For Platform Teams

Olytix Core exposes comprehensive Prometheus metrics for monitoring system health, query performance, and operational insights. This guide covers available metrics, configuration, and Grafana dashboard examples.

Enabling Metrics

Configuration

Enable metrics via environment variables:

# Enable Prometheus metrics
OLYTIX_METRICS__ENABLED=true

# Metrics endpoint path (default: /metrics)
OLYTIX_METRICS__PATH=/metrics

# Include default process metrics
OLYTIX_METRICS__INCLUDE_PROCESS=true

# Include Python runtime metrics
OLYTIX_METRICS__INCLUDE_RUNTIME=true

Verify Metrics Endpoint

# Check metrics endpoint
curl http://localhost:8000/metrics

# Sample output
# HELP olytix-core_http_requests_total Total HTTP requests
# TYPE olytix-core_http_requests_total counter
olytix-core_http_requests_total{method="GET",endpoint="/api/v1/query",status="200"} 1523

Available Metrics

HTTP Request Metrics

Metric	Type	Labels	Description
`olytix-core_http_requests_total`	Counter	method, endpoint, status	Total HTTP requests
`olytix-core_http_request_duration_seconds`	Histogram	method, endpoint	Request latency distribution
`olytix-core_http_requests_in_progress`	Gauge	method, endpoint	Currently processing requests
`olytix-core_http_request_size_bytes`	Histogram	method, endpoint	Request payload size
`olytix-core_http_response_size_bytes`	Histogram	method, endpoint	Response payload size

Query Engine Metrics

Metric	Type	Labels	Description
`olytix-core_query_total`	Counter	cube, type, status	Total queries executed
`olytix-core_query_duration_seconds`	Histogram	cube, type	Query execution time
`olytix-core_query_rows_returned`	Histogram	cube	Number of rows returned
`olytix-core_query_compile_duration_seconds`	Histogram	cube	SQL compilation time
`olytix-core_query_warehouse_duration_seconds`	Histogram	warehouse, cube	Warehouse query time

Cache Metrics

Metric	Type	Labels	Description
`olytix-core_cache_hits_total`	Counter	cache_type	Cache hit count
`olytix-core_cache_misses_total`	Counter	cache_type	Cache miss count
`olytix-core_cache_size_bytes`	Gauge	cache_type	Current cache size
`olytix-core_cache_evictions_total`	Counter	cache_type	Cache eviction count

Pre-aggregation Metrics

Metric	Type	Labels	Description
`olytix-core_preagg_builds_total`	Counter	cube, preagg, status	Pre-aggregation build count
`olytix-core_preagg_build_duration_seconds`	Histogram	cube, preagg	Build duration
`olytix-core_preagg_hits_total`	Counter	cube, preagg	Pre-aggregation query hits
`olytix-core_preagg_size_bytes`	Gauge	cube, preagg	Pre-aggregation storage size
`olytix-core_preagg_rows`	Gauge	cube, preagg	Row count in pre-aggregation

Warehouse Connection Metrics

Metric	Type	Labels	Description
`olytix-core_warehouse_connections_active`	Gauge	warehouse	Active connections
`olytix-core_warehouse_connections_idle`	Gauge	warehouse	Idle connections
`olytix-core_warehouse_connection_errors_total`	Counter	warehouse, error_type	Connection errors
`olytix-core_warehouse_query_errors_total`	Counter	warehouse, error_type	Query execution errors

Worker Metrics

Metric	Type	Labels	Description
`olytix-core_worker_tasks_total`	Counter	task_name, status	Total tasks executed
`olytix-core_worker_task_duration_seconds`	Histogram	task_name	Task execution time
`olytix-core_worker_queue_size`	Gauge	queue_name	Pending tasks in queue
`olytix-core_worker_active_tasks`	Gauge	worker_id	Currently running tasks

System Metrics

Metric	Type	Labels	Description
`olytix-core_info`	Gauge	version	Olytix Core version info
`olytix-core_uptime_seconds`	Gauge	-	Time since startup
`process_cpu_seconds_total`	Counter	-	CPU time used
`process_resident_memory_bytes`	Gauge	-	Memory usage
`process_open_fds`	Gauge	-	Open file descriptors

Prometheus Configuration

Basic Scrape Config

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'olytix-core'
    static_configs:
      - targets: ['olytix-core-api:8000']
    metrics_path: /metrics
    scheme: http

  - job_name: 'olytix-core-workers'
    static_configs:
      - targets: ['olytix-core-worker-1:8001', 'olytix-core-worker-2:8001']

Kubernetes ServiceMonitor

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: olytix-core
  namespace: olytix-core
  labels:
    app.kubernetes.io/name: olytix-core
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: olytix-core
      app.kubernetes.io/component: api
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
      - olytix-core

PodMonitor for Workers

# podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: olytix-core-workers
  namespace: olytix-core
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: olytix-core
      app.kubernetes.io/component: worker
  podMetricsEndpoints:
    - port: metrics
      path: /metrics
      interval: 30s

Alerting Rules

PrometheusRule for Olytix Core

# prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: olytix-core-alerts
  namespace: olytix-core
spec:
  groups:
    - name: olytix-core.rules
      rules:
        # High error rate
        - alert: Olytix CoreHighErrorRate
          expr: |
            sum(rate(olytix-core_http_requests_total{status=~"5.."}[5m]))
            / sum(rate(olytix-core_http_requests_total[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "High error rate detected"
            description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"

        # High latency
        - alert: Olytix CoreHighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le)
            ) > 2
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High request latency"
            description: "95th percentile latency is {{ $value | humanizeDuration }}"

        # Query performance degradation
        - alert: Olytix CoreSlowQueries
          expr: |
            histogram_quantile(0.95,
              sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube)
            ) > 10
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Slow queries detected for cube {{ $labels.cube }}"
            description: "95th percentile query time is {{ $value | humanizeDuration }}"

        # Low cache hit rate
        - alert: Olytix CoreLowCacheHitRate
          expr: |
            sum(rate(olytix-core_cache_hits_total[5m]))
            / (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m])))
            < 0.5
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Low cache hit rate"
            description: "Cache hit rate is {{ $value | humanizePercentage }}"

        # Warehouse connection issues
        - alert: Olytix CoreWarehouseConnectionErrors
          expr: |
            sum(rate(olytix-core_warehouse_connection_errors_total[5m])) > 0.1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Warehouse connection errors"
            description: "Connection errors at {{ $value }} per second"

        # Worker queue backlog
        - alert: Olytix CoreWorkerQueueBacklog
          expr: |
            olytix-core_worker_queue_size > 100
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Worker queue backlog growing"
            description: "Queue size is {{ $value }} tasks"

        # Pre-aggregation build failures
        - alert: Olytix CorePreaggBuildFailures
          expr: |
            sum(rate(olytix-core_preagg_builds_total{status="failed"}[1h])) > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Pre-aggregation build failures"
            description: "{{ $value }} failed builds in the last hour"

        # Pod not ready
        - alert: Olytix CorePodNotReady
          expr: |
            kube_pod_status_ready{namespace="olytix-core", condition="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Olytix Core pod not ready"
            description: "Pod {{ $labels.pod }} is not ready"

Grafana Dashboards

Olytix Core Overview Dashboard

{
  "dashboard": {
    "title": "Olytix Core Overview",
    "uid": "olytix-core-overview",
    "panels": [
      {
        "title": "Request Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_http_requests_total[5m])) by (status)",
            "legendFormat": "{{status}}"
          }
        ]
      },
      {
        "title": "Request Latency (p95)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(olytix-core_http_request_duration_seconds_bucket[5m])) by (le))",
            "legendFormat": "p50"
          }
        ]
      },
      {
        "title": "Error Rate",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_http_requests_total{status=~\"5..\"}[5m])) / sum(rate(olytix-core_http_requests_total[5m])) * 100",
            "legendFormat": "Error %"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 1, "color": "yellow"},
                {"value": 5, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Cache Hit Rate",
        "type": "gauge",
        "gridPos": {"h": 4, "w": 6, "x": 6, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_cache_hits_total[5m])) / (sum(rate(olytix-core_cache_hits_total[5m])) + sum(rate(olytix-core_cache_misses_total[5m]))) * 100"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"value": 0, "color": "red"},
                {"value": 50, "color": "yellow"},
                {"value": 80, "color": "green"}
              ]
            }
          }
        }
      },
      {
        "title": "Active Connections",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum(olytix-core_warehouse_connections_active)"
          }
        ]
      },
      {
        "title": "Queue Size",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 18, "y": 8},
        "targets": [
          {
            "expr": "sum(olytix-core_worker_queue_size)"
          }
        ]
      }
    ]
  }
}

Query Performance Dashboard

{
  "dashboard": {
    "title": "Olytix Core Query Performance",
    "uid": "olytix-core-queries",
    "panels": [
      {
        "title": "Query Rate by Cube",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_query_total[5m])) by (cube)",
            "legendFormat": "{{cube}}"
          }
        ]
      },
      {
        "title": "Query Latency by Cube (p95)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(olytix-core_query_duration_seconds_bucket[5m])) by (le, cube))",
            "legendFormat": "{{cube}}"
          }
        ]
      },
      {
        "title": "Pre-aggregation Hit Rate",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_preagg_hits_total[5m])) by (cube) / sum(rate(olytix-core_query_total[5m])) by (cube) * 100",
            "legendFormat": "{{cube}}"
          }
        ]
      },
      {
        "title": "Warehouse Query Time",
        "type": "heatmap",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "sum(rate(olytix-core_query_warehouse_duration_seconds_bucket[5m])) by (le)",
            "format": "heatmap"
          }
        ]
      }
    ]
  }
}

Import Dashboards

# Import via Grafana API
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
  -H "Content-Type: application/json" \
  -d @olytix-core-overview-dashboard.json

# Or use Grafana provisioning
# /etc/grafana/provisioning/dashboards/olytix-core.yaml
apiVersion: 1
providers:
  - name: 'Olytix Core'
    folder: 'Olytix Core'
    type: file
    options:
      path: /var/lib/grafana/dashboards/olytix-core

Custom Metrics

Adding Application Metrics

# In your Olytix Core project
from prometheus_client import Counter, Histogram

# Define custom metrics
CUSTOM_QUERY_COUNTER = Counter(
    'olytix-core_custom_queries_total',
    'Custom query counter',
    ['query_type', 'department']
)

CUSTOM_PROCESSING_TIME = Histogram(
    'olytix-core_custom_processing_seconds',
    'Custom processing time',
    ['operation']
)

# Use in code
CUSTOM_QUERY_COUNTER.labels(query_type='forecast', department='sales').inc()

with CUSTOM_PROCESSING_TIME.labels(operation='transform').time():
    # processing logic
    pass

Next Steps

Configure structured logging for detailed tracing
Set up circuit breakers for resilience
Implement retry policies for reliability

Enabling Metrics​

Configuration​

Verify Metrics Endpoint​

Available Metrics​

HTTP Request Metrics​

Query Engine Metrics​

Cache Metrics​

Pre-aggregation Metrics​

Warehouse Connection Metrics​

Worker Metrics​

System Metrics​

Prometheus Configuration​

Basic Scrape Config​

Kubernetes ServiceMonitor​

PodMonitor for Workers​

Alerting Rules​

PrometheusRule for Olytix Core​

Grafana Dashboards​

Olytix Core Overview Dashboard​

Query Performance Dashboard​

Import Dashboards​

Custom Metrics​

Adding Application Metrics​

Next Steps​