Pre-aggregations

For Data Analysts

Pre-aggregations are materialized aggregations that dramatically improve query performance. By pre-computing common aggregations, Olytix Core can serve queries from cached results instead of scanning raw data.

What is a Pre-aggregation?

A pre-aggregation is a cached summary table that:

Pre-computes measure aggregations
Groups by specified dimensions
Refreshes on a schedule
Automatically serves matching queries

┌─────────────────────────────────────────────────────────────┐
│                 Pre-aggregation Flow                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   Query Request                                              │
│   ┌─────────────┐                                           │
│   │ SUM(revenue)│                                           │
│   │ BY month    │                                           │
│   └──────┬──────┘                                           │
│          │                                                   │
│          ▼                                                   │
│   ┌─────────────────┐    Match?    ┌─────────────────┐     │
│   │ Query Planner   │────Yes──────►│ Pre-aggregation │     │
│   └────────┬────────┘              │ (cached table)  │     │
│            │                       └─────────────────┘     │
│            │ No match                      │               │
│            ▼                               ▼               │
│   ┌─────────────────┐              ┌─────────────────┐     │
│   │   Raw Data      │              │  Fast Response  │     │
│   │   (slow scan)   │              │  (< 100ms)      │     │
│   └─────────────────┘              └─────────────────┘     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

When to Use Pre-aggregations

Use pre-aggregations when:

Queries consistently take more than a few seconds
The same aggregations are requested frequently
Data volumes exceed millions of rows
Dashboard load times need improvement
You want to reduce database load

Basic Pre-aggregation Definition

cubes:
  - name: orders
    sql: "SELECT * FROM {{ ref('fct_orders') }}"

    measures:
      - name: count
        type: count
      - name: total_revenue
        type: sum
        sql: total_amount

    dimensions:
      - name: order_date
        type: time
        sql: order_date
        granularities: [day, week, month, quarter, year]
      - name: status
        type: string
        sql: status

    pre_aggregations:
      - name: orders_by_day
        measures:
          - count
          - total_revenue
        dimensions:
          - status
        time_dimension: order_date
        granularity: day

Pre-aggregation Configuration Options

Each pre-aggregation supports these properties:

pre_aggregations:
  - name: orders_by_month
    measures:                          # Measures to pre-compute
      - count
      - total_revenue
    dimensions:                        # Dimensions to include
      - status
      - region
    time_dimension: order_date         # Time dimension for granularity
    granularity: month                 # Time granularity
    partition_granularity: month       # Partition size for incremental
    segments:                          # Optional segment filters
      - completed_orders
    refresh:                           # Refresh configuration
      every: "1 hour"
      incremental: true
      update_window: "7 days"
      timezone: "UTC"
    indexes:                           # Database indexes
      - columns: [status, region]
    build_range_start: "2020-01-01"   # Historical data start
    build_range_end: "now"             # Build up to now

Granularity Levels

Pre-aggregations support these time granularities:

Granularity	Description	Best For
`second`	Aggregated to second	Real-time monitoring
`minute`	Aggregated to minute	Short-term trends
`hour`	Aggregated to hour	Intraday analysis
`day`	Aggregated to day	Daily reporting
`week`	Aggregated to week	Weekly metrics
`month`	Aggregated to month	Monthly reporting
`quarter`	Aggregated to quarter	Quarterly reviews
`year`	Aggregated to year	Annual comparisons

Choose a granularity that is finer or equal to your most common query granularity.

Rollup Strategies

Simple Rollup

Pre-compute all measure combinations for specified dimensions:

pre_aggregations:
  - name: orders_by_status
    measures:
      - count
      - total_revenue
      - avg_order_value
    dimensions:
      - status
    time_dimension: order_date
    granularity: day

This serves queries like:

Revenue by status by day
Order count by status
Average order value by day

Multi-dimension Rollup

Include multiple dimensions for more query coverage:

pre_aggregations:
  - name: orders_by_region_status
    measures:
      - count
      - total_revenue
    dimensions:
      - region
      - status
      - payment_method
    time_dimension: order_date
    granularity: day

Time-only Rollup

For time-series queries without dimensional breakdown:

pre_aggregations:
  - name: orders_over_time
    measures:
      - count
      - total_revenue
      - unique_customers
    time_dimension: order_date
    granularity: day

Refresh Configuration

Control when and how pre-aggregations are refreshed:

Scheduled Refresh

pre_aggregations:
  - name: orders_daily
    measures: [count, total_revenue]
    time_dimension: order_date
    granularity: day
    refresh:
      every: "1 hour"        # Check for refresh every hour
      timezone: "UTC"        # Timezone for scheduling

Common refresh intervals:

5 minutes - Near real-time dashboards
1 hour - Standard operational reporting
1 day - Daily batch reporting
1 week - Historical analytics

Incremental Refresh

Only refresh recent data instead of the full table:

pre_aggregations:
  - name: orders_incremental
    measures: [count, total_revenue]
    time_dimension: order_date
    granularity: day
    partition_granularity: month
    refresh:
      every: "1 hour"
      incremental: true
      update_window: "7 days"   # Only refresh last 7 days

Incremental refresh is ideal when:

Historical data rarely changes
You have large datasets
You want to minimize refresh time

Partitioning

Partition pre-aggregations for efficient incremental updates:

pre_aggregations:
  - name: orders_partitioned
    measures: [count, total_revenue]
    dimensions: [status]
    time_dimension: order_date
    granularity: day
    partition_granularity: month      # Partition by month
    refresh:
      every: "1 hour"
      incremental: true
      update_window: "30 days"

Partition granularity determines how data is divided:

day - One partition per day
week - One partition per week
month - One partition per month (recommended)
year - One partition per year

Build Range

Control the historical range for pre-aggregation builds:

pre_aggregations:
  - name: orders_historical
    measures: [count, total_revenue]
    time_dimension: order_date
    granularity: month
    build_range_start: "2020-01-01"   # Start from 2020
    build_range_end: "now"             # Build up to current time

Use build ranges to:

Limit historical data processing
Control storage usage
Speed up initial builds

Indexes

Add database indexes to improve query performance:

pre_aggregations:
  - name: orders_indexed
    measures: [count, total_revenue]
    dimensions: [status, region, payment_method]
    time_dimension: order_date
    granularity: day
    indexes:
      - columns: [status]
      - columns: [region, status]
      - columns: [order_date, status]

Index columns that are frequently used in:

Filter conditions
Group by clauses
Join conditions

Segment Filters

Pre-aggregate only specific data segments:

cubes:
  - name: orders
    sql: "SELECT * FROM {{ ref('fct_orders') }}"

    segments:
      - name: completed
        sql: "status = 'completed'"

    pre_aggregations:
      - name: completed_orders_daily
        measures: [count, total_revenue]
        time_dimension: order_date
        granularity: day
        segments:
          - completed

Complete Example

Here is a comprehensive cube with multiple pre-aggregation strategies:

cubes:
  - name: orders
    sql: "SELECT * FROM {{ ref('fct_orders') }}"
    description: "E-commerce order transactions"

    measures:
      - name: count
        type: count
        description: "Total orders"

      - name: total_revenue
        type: sum
        sql: total_amount
        format: currency

      - name: avg_order_value
        type: avg
        sql: total_amount
        format: currency

      - name: unique_customers
        type: count_distinct
        sql: customer_id

    dimensions:
      - name: order_id
        type: number
        sql: order_id
        primary_key: true

      - name: order_date
        type: time
        sql: order_date
        granularities: [day, week, month, quarter, year]

      - name: status
        type: string
        sql: status

      - name: region
        type: string
        sql: region

      - name: payment_method
        type: string
        sql: payment_method

    segments:
      - name: completed
        sql: "status = 'completed'"

    pre_aggregations:
      # High-frequency dashboard queries
      - name: main_dashboard
        measures:
          - count
          - total_revenue
          - unique_customers
        dimensions:
          - status
          - region
        time_dimension: order_date
        granularity: day
        partition_granularity: month
        refresh:
          every: "15 minutes"
          incremental: true
          update_window: "3 days"
        indexes:
          - columns: [status]
          - columns: [region]

      # Monthly executive reports
      - name: monthly_summary
        measures:
          - count
          - total_revenue
          - avg_order_value
          - unique_customers
        dimensions:
          - region
        time_dimension: order_date
        granularity: month
        refresh:
          every: "1 day"
        build_range_start: "2020-01-01"

      # Payment method analysis
      - name: payment_analysis
        measures:
          - count
          - total_revenue
        dimensions:
          - payment_method
          - region
        time_dimension: order_date
        granularity: week
        refresh:
          every: "1 hour"

      # Completed orders only
      - name: completed_orders_daily
        measures:
          - count
          - total_revenue
        time_dimension: order_date
        granularity: day
        segments:
          - completed
        refresh:
          every: "30 minutes"
          incremental: true
          update_window: "7 days"

      # Time-series only (no dimensions)
      - name: orders_trend
        measures:
          - count
          - total_revenue
        time_dimension: order_date
        granularity: hour
        partition_granularity: day
        refresh:
          every: "5 minutes"
          incremental: true
          update_window: "1 day"

Query Matching

Olytix Core automatically uses pre-aggregations when a query matches:

Measures: All requested measures are in the pre-aggregation
Dimensions: All requested dimensions are in the pre-aggregation
Time granularity: Requested granularity is equal to or coarser than pre-aggregation
Time range: Query time range overlaps with pre-aggregation data
Segments: Query segments match pre-aggregation segments

Example query that matches main_dashboard pre-aggregation:

{
  "measures": ["orders.total_revenue", "orders.count"],
  "dimensions": ["orders.region", "orders.order_date.month"],
  "filters": [
    {
      "dimension": "orders.order_date",
      "operator": "gte",
      "values": ["2024-01-01"]
    }
  ]
}

Monitoring Pre-aggregations

Check pre-aggregation status via API:

# List all pre-aggregations
curl http://localhost:8000/api/v1/pre-aggregations

# Check specific pre-aggregation status
curl http://localhost:8000/api/v1/pre-aggregations/orders.main_dashboard

Response includes:

Last build time
Row count
Build duration
Next scheduled refresh

Best Practices

Start with Common Queries

Analyze your most frequent queries and create pre-aggregations for those patterns:

# If dashboards always query by region and month
pre_aggregations:
  - name: regional_monthly
    measures: [count, total_revenue]
    dimensions: [region]
    time_dimension: order_date
    granularity: month

Use Incremental for Large Datasets

For tables with millions of rows, always use incremental refresh:

pre_aggregations:
  - name: large_table_preagg
    measures: [count, total_revenue]
    time_dimension: order_date
    granularity: day
    partition_granularity: month
    refresh:
      incremental: true
      update_window: "7 days"

Match Granularity to Query Patterns

Choose the finest granularity commonly queried:

# If users query by day but also need week/month rollups
pre_aggregations:
  - name: orders_daily
    granularity: day    # Can serve day, week, month, quarter, year queries

Limit Dimension Combinations

Too many dimensions create large pre-aggregation tables:

# Good: Focused dimensions
pre_aggregations:
  - name: focused_preagg
    dimensions: [status, region]    # 2 dimensions

# Avoid: Too many dimensions
pre_aggregations:
  - name: bloated_preagg
    dimensions: [status, region, payment_method, category, channel, source]

Create Multiple Targeted Pre-aggregations

Instead of one large pre-aggregation, create several targeted ones:

pre_aggregations:
  # For regional reports
  - name: by_region
    measures: [count, total_revenue]
    dimensions: [region]
    granularity: day

  # For status monitoring
  - name: by_status
    measures: [count]
    dimensions: [status]
    granularity: hour

  # For payment analysis
  - name: by_payment
    measures: [count, total_revenue]
    dimensions: [payment_method]
    granularity: day

Next Steps

Cube Fundamentals - Review cube structure
Measures - Optimize measure definitions
Dimensions - Plan dimension granularities

What is a Pre-aggregation?​

When to Use Pre-aggregations​

Basic Pre-aggregation Definition​

Pre-aggregation Configuration Options​

Granularity Levels​

Rollup Strategies​

Simple Rollup​

Multi-dimension Rollup​

Time-only Rollup​

Refresh Configuration​

Scheduled Refresh​

Incremental Refresh​

Partitioning​

Build Range​

Indexes​

Segment Filters​

Complete Example​

Query Matching​

Monitoring Pre-aggregations​

Best Practices​

Start with Common Queries​

Use Incremental for Large Datasets​

Match Granularity to Query Patterns​

Limit Dimension Combinations​

Create Multiple Targeted Pre-aggregations​

Next Steps​