Pre-Aggregations

For Data Analysts

Pre-aggregations are materialized rollup tables that dramatically accelerate query performance and reduce data warehouse costs by pre-computing common aggregations.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│                  PRE-AGGREGATION SYSTEM                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌──────────────┐                    ┌──────────────┐              │
│   │    QUERY     │                    │   PREAGG     │              │
│   │              │                    │   MATCHER    │              │
│   │ measures     │───────────────────▶│              │              │
│   │ dimensions   │                    │ Find match   │              │
│   │ filters      │                    │ or fallback  │              │
│   └──────────────┘                    └──────┬───────┘              │
│                                              │                       │
│                    ┌─────────────────────────┼─────────────────┐    │
│                    │                         │                 │    │
│                    ▼                         ▼                 ▼    │
│   ┌──────────────────────┐   ┌──────────────────────┐   ┌─────────┐│
│   │   PRE-AGG TABLE      │   │   PRE-AGG TABLE      │   │  BASE   ││
│   │   daily_by_region    │   │   monthly_summary    │   │  TABLE  ││
│   │                      │   │                      │   │         ││
│   │   50 MB              │   │   5 MB               │   │  50 GB  ││
│   │   ~2 seconds         │   │   ~0.5 seconds       │   │  ~60s   ││
│   └──────────────────────┘   └──────────────────────┘   └─────────┘│
│                                                                      │
│   ┌─────────────────────────────────────────────────────────────┐   │
│   │                    BUILD & REFRESH                           │   │
│   │                                                              │   │
│   │  ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐     │   │
│   │  │ Scheduled│   │On-Demand│   │Incremental│  │  Full   │     │   │
│   │  │ (Cron)  │   │ (API)   │   │ Refresh  │  │ Rebuild │     │   │
│   │  └─────────┘   └─────────┘   └─────────┘   └─────────┘     │   │
│   │                                                              │   │
│   └─────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Key Features

Automatic Query Matching

When you run a query, Olytix Core automatically checks if a pre-aggregation can answer it:

Match Type	Description	Performance
Exact	Pre-agg has exactly the needed measures/dimensions	Fastest
Superset	Pre-agg has more than needed, filters applied	Very fast
Rollup	Pre-agg can be aggregated further (day → month)	Fast
None	Query must hit base table	Slowest

Significant Cost Reduction

Pre-aggregations typically reduce query costs by 90-99%:

Without Pre-Aggregation:
├── Scan: 50 GB of raw data
├── Time: ~60 seconds
└── Cost: $0.25 per query

With Pre-Aggregation:
├── Scan: 50 MB (0.1% of original)
├── Time: ~2 seconds (97% faster)
└── Cost: $0.0025 per query (99% savings)

Flexible Refresh Options

Scheduled - Cron-based automatic refresh
On-demand - Trigger refresh via API
Incremental - Update only new/changed data
Full rebuild - Complete reconstruction

Build Status Monitoring

Track the health of your pre-aggregations:

Build success/failure status
Last refresh time
Row counts and size
Query match statistics

Usage

Defining Pre-Aggregations

Pre-aggregations are defined in your cube YAML files:

# cubes/orders.yml
cubes:
  - name: Orders
    sql: SELECT * FROM orders

    measures:
      - name: revenue
        type: sum
        sql: amount

      - name: count
        type: count

      - name: avg_order_value
        type: avg
        sql: amount

    dimensions:
      - name: region
        type: string
        sql: region

      - name: product_category
        type: string
        sql: product_category

      - name: created_at
        type: time
        sql: created_at

    pre_aggregations:
      # Daily rollup by region
      - name: daily_by_region
        measures:
          - revenue
          - count
        dimensions:
          - region
        time_dimension: created_at
        granularity: day
        refresh:
          every: "1 hour"
        partition_granularity: month

      # Monthly summary (no dimensions - just totals)
      - name: monthly_summary
        measures:
          - revenue
          - count
          - avg_order_value
        time_dimension: created_at
        granularity: month
        refresh:
          every: "6 hours"

      # Full breakdown for detailed analysis
      - name: daily_full
        measures:
          - revenue
          - count
        dimensions:
          - region
          - product_category
        time_dimension: created_at
        granularity: day
        refresh:
          every: "2 hours"
        indexes:
          - columns: [region]
          - columns: [product_category]

Managing Pre-Aggregations

from olytix-core.preagg.service import PreAggregationService
from olytix-core.preagg.models import BuildRequest

service = PreAggregationService()

# List all pre-aggregations
preaggs = await service.list_preaggregations(cube_name="Orders")

# PreAggregationInfo:
# ├── name: "daily_by_region"
# ├── cube: "Orders"
# ├── measures: ["revenue", "count"]
# ├── dimensions: ["region"]
# ├── time_dimension: "created_at"
# ├── granularity: "day"
# ├── status: "ready"
# ├── last_built: "2024-01-15T10:00:00Z"
# ├── row_count: 45000
# ├── size_bytes: 52428800  # 50 MB
# ├── query_matches_24h: 145
# └── refresh_schedule: "0 * * * *"  # Every hour

# Get specific pre-aggregation status
status = await service.get_status(
    cube_name="Orders",
    preagg_name="daily_by_region"
)

Building Pre-Aggregations

# Trigger immediate build
build_result = await service.build(
    BuildRequest(
        cube_name="Orders",
        preagg_name="daily_by_region",
        build_type="incremental",  # or "full"
        date_range=("2024-01-01", "2024-01-31")  # Optional
    )
)

# BuildResult:
# ├── success: True
# ├── rows_affected: 1500
# ├── duration_seconds: 45
# ├── partitions_built: ["2024-01"]
# └── next_scheduled: "2024-01-15T11:00:00Z"

# Build all pre-aggregations for a cube
await service.build_all(cube_name="Orders")

Checking Query Matches

# Check if a query will use pre-aggregation
match = await service.check_match(
    query={
        "measures": ["Orders.revenue"],
        "dimensions": ["Orders.region"],
        "timeDimensions": [{
            "dimension": "Orders.created_at",
            "granularity": "day",
            "dateRange": "Last 30 days"
        }]
    }
)

# MatchResult:
# ├── matched: True
# ├── preagg_name: "daily_by_region"
# ├── match_type: "exact"
# ├── estimated_rows: 150
# └── estimated_scan_bytes: 15000

Invalidation

# Invalidate pre-aggregation (mark as stale)
await service.invalidate(
    cube_name="Orders",
    preagg_name="daily_by_region",
    reason="Source data updated"
)

# Invalidate with date range
await service.invalidate(
    cube_name="Orders",
    preagg_name="daily_by_region",
    date_range=("2024-01-15", "2024-01-15"),
    reason="Data correction for Jan 15"
)

API Endpoints

List Pre-Aggregations

GET /api/v1/preaggregations?cube=Orders

Get Status

GET /api/v1/preaggregations/Orders/daily_by_region/status

Trigger Build

POST /api/v1/preaggregations/Orders/daily_by_region/build
Content-Type: application/json

{
  "build_type": "incremental",
  "date_range": {
    "start": "2024-01-01",
    "end": "2024-01-31"
  }
}

Check Query Match

POST /api/v1/preaggregations/match
Content-Type: application/json

{
  "query": {
    "measures": ["Orders.revenue"],
    "dimensions": ["Orders.region"],
    "timeDimensions": [{
      "dimension": "Orders.created_at",
      "granularity": "day",
      "dateRange": "Last 30 days"
    }]
  }
}

Invalidate

POST /api/v1/preaggregations/Orders/daily_by_region/invalidate
Content-Type: application/json

{
  "reason": "Source data correction",
  "date_range": {
    "start": "2024-01-15",
    "end": "2024-01-15"
  }
}

Get Build History

GET /api/v1/preaggregations/Orders/daily_by_region/builds?
  limit=10

Pre-Aggregation Design

Choosing Measures

Include measures that are frequently queried together:

# Good - commonly queried together
pre_aggregations:
  - name: sales_metrics
    measures:
      - revenue
      - count
      - avg_order_value

# Avoid - rarely needed together
pre_aggregations:
  - name: everything
    measures:
      - revenue
      - count
      - avg_order_value
      - discount_total
      - shipping_cost
      - tax_amount
      - refund_amount

Choosing Dimensions

Include dimensions used for filtering and grouping:

# Good - supports common query patterns
pre_aggregations:
  - name: by_region_category
    measures: [revenue, count]
    dimensions:
      - region           # Frequently filtered
      - product_category # Frequently grouped

# Consider cardinality
# High cardinality dimensions (like customer_id) may not be suitable

Choosing Granularity

Match your typical query patterns:

Granularity	Best For	Typical Size
minute	Real-time dashboards	Large
hour	Intraday analysis	Medium-Large
day	Daily reporting	Medium
week	Weekly summaries	Small-Medium
month	Monthly/quarterly reports	Small

Rollup Support

Coarser granularities can be rolled up from finer ones:

minute → hour → day → week → month → quarter → year

Query for "monthly revenue" can use:
1. monthly_summary (exact match - fastest)
2. daily_by_region (rollup from day to month - fast)
3. Base table (no pre-agg - slowest)

Partitioning

Partition large pre-aggregations for efficient incremental updates:

pre_aggregations:
  - name: daily_by_region
    measures: [revenue, count]
    dimensions: [region]
    time_dimension: created_at
    granularity: day

    # Partition by month
    partition_granularity: month

    # Build range
    build_range:
      start: "2023-01-01"
      # end: defaults to now

Partition Benefits

Incremental refresh - Only rebuild affected partitions
Parallel builds - Build multiple partitions simultaneously
Selective invalidation - Invalidate specific date ranges
Storage efficiency - Drop old partitions

Refresh Strategies

Scheduled Refresh

pre_aggregations:
  - name: daily_metrics
    refresh:
      every: "1 hour"          # Check every hour
      sql_trigger:             # Or use SQL trigger
        select: "SELECT MAX(updated_at) FROM orders"

SQL Triggers

Refresh when source data changes:

pre_aggregations:
  - name: daily_metrics
    refresh:
      sql_trigger:
        select: "SELECT MAX(updated_at) FROM orders"
      check_interval: "5 minutes"

Manual Triggers

For event-driven refreshes:

# After data pipeline completes
await service.build(
    BuildRequest(
        cube_name="Orders",
        preagg_name="daily_by_region",
        build_type="incremental"
    )
)

Indexes

Add indexes to improve query performance on pre-aggregation tables:

pre_aggregations:
  - name: daily_full
    measures: [revenue, count]
    dimensions: [region, product_category, customer_type]
    time_dimension: created_at
    granularity: day

    indexes:
      - columns: [region]                    # Single column
      - columns: [product_category, region]  # Composite
      - columns: [customer_type]

Monitoring

Build Metrics

# Get pre-aggregation metrics
metrics = await service.get_metrics(
    cube_name="Orders",
    preagg_name="daily_by_region"
)

# PreAggMetrics:
# ├── builds_24h: 24
# ├── builds_failed_24h: 0
# ├── avg_build_duration_seconds: 45
# ├── query_matches_24h: 1250
# ├── query_matches_total: 45000
# ├── cache_hit_rate: 0.92
# ├── estimated_cost_savings_24h: 125.50
# └── size_trend: [
#     {"date": "2024-01-13", "size_mb": 48},
#     {"date": "2024-01-14", "size_mb": 49},
#     {"date": "2024-01-15", "size_mb": 50}
# ]

Alerts

Configure alerts for build failures:

from olytix-core.preagg.alerts import PreAggAlert

await service.configure_alert(
    PreAggAlert(
        cube_name="Orders",
        preagg_name="daily_by_region",
        alert_on=["build_failure", "stale_data"],
        stale_threshold_hours=2,
        channels=["email", "slack"],
        recipients=["data-team@company.com"]
    )
)

Best Practices

Design Guidelines

Start with query patterns - Analyze what queries run most frequently
Keep dimensions low cardinality - High cardinality defeats the purpose
Include time dimension - Most queries filter by time
Create multiple granularities - Different use cases need different rollups

Performance Tips

Partition large pre-aggs - Enable incremental refresh
Add appropriate indexes - For commonly filtered dimensions
Monitor build times - Long builds may indicate issues
Schedule off-peak - Don't compete with query workload

Maintenance

Monitor cache hit rates - Low rates mean pre-aggs aren't being used
Review unused pre-aggs - Remove ones with zero matches
Update for new patterns - Add pre-aggs for emerging query patterns
Clean old partitions - Implement retention policies

Configuration

# pre-aggregation configuration
preaggregation:
  enabled: true

  storage:
    type: "warehouse"  # warehouse, external
    schema: "preagg"   # Schema for pre-agg tables

  build:
    max_concurrent_builds: 3
    timeout_seconds: 3600
    retry_attempts: 3
    retry_delay_seconds: 60

  refresh:
    default_check_interval: "1 hour"
    min_check_interval: "5 minutes"

  query_matching:
    enabled: true
    prefer_preagg: true
    max_rollup_granularity: "month"

  monitoring:
    track_matches: true
    track_savings: true
    retention_days: 30

Troubleshooting

Pre-Agg Not Being Used

Check match - Use /match endpoint to see why
Verify freshness - Stale pre-aggs may be skipped
Check dimensions - Query may need dimensions not in pre-agg
Check granularity - Query granularity may be finer than pre-agg

Build Failures

Check source data - Schema changes can break builds
Check resources - Insufficient memory/timeout
Check permissions - Write access to pre-agg schema
Review logs - Detailed error in build history

Next Steps

Query Optimization - Cost prediction and savings
Query Explanation - See when pre-aggs are used
Pre-Aggregations Reference - Detailed YAML reference

Overview​

Key Features​

Automatic Query Matching​

Significant Cost Reduction​

Flexible Refresh Options​

Build Status Monitoring​

Usage​

Defining Pre-Aggregations​

Managing Pre-Aggregations​

Building Pre-Aggregations​

Checking Query Matches​

Invalidation​

API Endpoints​

List Pre-Aggregations​

Get Status​

Trigger Build​

Check Query Match​

Invalidate​

Get Build History​

Pre-Aggregation Design​

Choosing Measures​

Choosing Dimensions​

Choosing Granularity​

Rollup Support​

Partitioning​

Partition Benefits​

Refresh Strategies​

Scheduled Refresh​

SQL Triggers​

Manual Triggers​

Indexes​

Monitoring​

Build Metrics​

Alerts​

Best Practices​

Design Guidelines​

Performance Tips​

Maintenance​

Configuration​

Troubleshooting​

Pre-Agg Not Being Used​

Build Failures​

Next Steps​