Skip to main content

Pre-aggregations

For Data Analysts

Pre-aggregations are materialized aggregations that dramatically improve query performance. By pre-computing common aggregations, Olytix Core can serve queries from cached results instead of scanning raw data.

What is a Pre-aggregation?

A pre-aggregation is a cached summary table that:

  • Pre-computes measure aggregations
  • Groups by specified dimensions
  • Refreshes on a schedule
  • Automatically serves matching queries
┌─────────────────────────────────────────────────────────────┐
│ Pre-aggregation Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Query Request │
│ ┌─────────────┐ │
│ │ SUM(revenue)│ │
│ │ BY month │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ Match? ┌─────────────────┐ │
│ │ Query Planner │────Yes──────►│ Pre-aggregation │ │
│ └────────┬────────┘ │ (cached table) │ │
│ │ └─────────────────┘ │
│ │ No match │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Raw Data │ │ Fast Response │ │
│ │ (slow scan) │ │ (< 100ms) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘

When to Use Pre-aggregations

Use pre-aggregations when:

  • Queries consistently take more than a few seconds
  • The same aggregations are requested frequently
  • Data volumes exceed millions of rows
  • Dashboard load times need improvement
  • You want to reduce database load

Basic Pre-aggregation Definition

cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"

measures:
- name: count
type: count
- name: total_revenue
type: sum
sql: total_amount

dimensions:
- name: order_date
type: time
sql: order_date
granularities: [day, week, month, quarter, year]
- name: status
type: string
sql: status

pre_aggregations:
- name: orders_by_day
measures:
- count
- total_revenue
dimensions:
- status
time_dimension: order_date
granularity: day

Pre-aggregation Configuration Options

Each pre-aggregation supports these properties:

pre_aggregations:
- name: orders_by_month
measures: # Measures to pre-compute
- count
- total_revenue
dimensions: # Dimensions to include
- status
- region
time_dimension: order_date # Time dimension for granularity
granularity: month # Time granularity
partition_granularity: month # Partition size for incremental
segments: # Optional segment filters
- completed_orders
refresh: # Refresh configuration
every: "1 hour"
incremental: true
update_window: "7 days"
timezone: "UTC"
indexes: # Database indexes
- columns: [status, region]
build_range_start: "2020-01-01" # Historical data start
build_range_end: "now" # Build up to now

Granularity Levels

Pre-aggregations support these time granularities:

GranularityDescriptionBest For
secondAggregated to secondReal-time monitoring
minuteAggregated to minuteShort-term trends
hourAggregated to hourIntraday analysis
dayAggregated to dayDaily reporting
weekAggregated to weekWeekly metrics
monthAggregated to monthMonthly reporting
quarterAggregated to quarterQuarterly reviews
yearAggregated to yearAnnual comparisons

Choose a granularity that is finer or equal to your most common query granularity.

Rollup Strategies

Simple Rollup

Pre-compute all measure combinations for specified dimensions:

pre_aggregations:
- name: orders_by_status
measures:
- count
- total_revenue
- avg_order_value
dimensions:
- status
time_dimension: order_date
granularity: day

This serves queries like:

  • Revenue by status by day
  • Order count by status
  • Average order value by day

Multi-dimension Rollup

Include multiple dimensions for more query coverage:

pre_aggregations:
- name: orders_by_region_status
measures:
- count
- total_revenue
dimensions:
- region
- status
- payment_method
time_dimension: order_date
granularity: day

Time-only Rollup

For time-series queries without dimensional breakdown:

pre_aggregations:
- name: orders_over_time
measures:
- count
- total_revenue
- unique_customers
time_dimension: order_date
granularity: day

Refresh Configuration

Control when and how pre-aggregations are refreshed:

Scheduled Refresh

pre_aggregations:
- name: orders_daily
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
refresh:
every: "1 hour" # Check for refresh every hour
timezone: "UTC" # Timezone for scheduling

Common refresh intervals:

  • 5 minutes - Near real-time dashboards
  • 1 hour - Standard operational reporting
  • 1 day - Daily batch reporting
  • 1 week - Historical analytics

Incremental Refresh

Only refresh recent data instead of the full table:

pre_aggregations:
- name: orders_incremental
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true
update_window: "7 days" # Only refresh last 7 days

Incremental refresh is ideal when:

  • Historical data rarely changes
  • You have large datasets
  • You want to minimize refresh time

Partitioning

Partition pre-aggregations for efficient incremental updates:

pre_aggregations:
- name: orders_partitioned
measures: [count, total_revenue]
dimensions: [status]
time_dimension: order_date
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "30 days"

Partition granularity determines how data is divided:

  • day - One partition per day
  • week - One partition per week
  • month - One partition per month (recommended)
  • year - One partition per year

Build Range

Control the historical range for pre-aggregation builds:

pre_aggregations:
- name: orders_historical
measures: [count, total_revenue]
time_dimension: order_date
granularity: month
build_range_start: "2020-01-01" # Start from 2020
build_range_end: "now" # Build up to current time

Use build ranges to:

  • Limit historical data processing
  • Control storage usage
  • Speed up initial builds

Indexes

Add database indexes to improve query performance:

pre_aggregations:
- name: orders_indexed
measures: [count, total_revenue]
dimensions: [status, region, payment_method]
time_dimension: order_date
granularity: day
indexes:
- columns: [status]
- columns: [region, status]
- columns: [order_date, status]

Index columns that are frequently used in:

  • Filter conditions
  • Group by clauses
  • Join conditions

Segment Filters

Pre-aggregate only specific data segments:

cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"

segments:
- name: completed
sql: "status = 'completed'"

pre_aggregations:
- name: completed_orders_daily
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
segments:
- completed

Complete Example

Here is a comprehensive cube with multiple pre-aggregation strategies:

cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"
description: "E-commerce order transactions"

measures:
- name: count
type: count
description: "Total orders"

- name: total_revenue
type: sum
sql: total_amount
format: currency

- name: avg_order_value
type: avg
sql: total_amount
format: currency

- name: unique_customers
type: count_distinct
sql: customer_id

dimensions:
- name: order_id
type: number
sql: order_id
primary_key: true

- name: order_date
type: time
sql: order_date
granularities: [day, week, month, quarter, year]

- name: status
type: string
sql: status

- name: region
type: string
sql: region

- name: payment_method
type: string
sql: payment_method

segments:
- name: completed
sql: "status = 'completed'"

pre_aggregations:
# High-frequency dashboard queries
- name: main_dashboard
measures:
- count
- total_revenue
- unique_customers
dimensions:
- status
- region
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
every: "15 minutes"
incremental: true
update_window: "3 days"
indexes:
- columns: [status]
- columns: [region]

# Monthly executive reports
- name: monthly_summary
measures:
- count
- total_revenue
- avg_order_value
- unique_customers
dimensions:
- region
time_dimension: order_date
granularity: month
refresh:
every: "1 day"
build_range_start: "2020-01-01"

# Payment method analysis
- name: payment_analysis
measures:
- count
- total_revenue
dimensions:
- payment_method
- region
time_dimension: order_date
granularity: week
refresh:
every: "1 hour"

# Completed orders only
- name: completed_orders_daily
measures:
- count
- total_revenue
time_dimension: order_date
granularity: day
segments:
- completed
refresh:
every: "30 minutes"
incremental: true
update_window: "7 days"

# Time-series only (no dimensions)
- name: orders_trend
measures:
- count
- total_revenue
time_dimension: order_date
granularity: hour
partition_granularity: day
refresh:
every: "5 minutes"
incremental: true
update_window: "1 day"

Query Matching

Olytix Core automatically uses pre-aggregations when a query matches:

  1. Measures: All requested measures are in the pre-aggregation
  2. Dimensions: All requested dimensions are in the pre-aggregation
  3. Time granularity: Requested granularity is equal to or coarser than pre-aggregation
  4. Time range: Query time range overlaps with pre-aggregation data
  5. Segments: Query segments match pre-aggregation segments

Example query that matches main_dashboard pre-aggregation:

{
"measures": ["orders.total_revenue", "orders.count"],
"dimensions": ["orders.region", "orders.order_date.month"],
"filters": [
{
"dimension": "orders.order_date",
"operator": "gte",
"values": ["2024-01-01"]
}
]
}

Monitoring Pre-aggregations

Check pre-aggregation status via API:

# List all pre-aggregations
curl http://localhost:8000/api/v1/pre-aggregations

# Check specific pre-aggregation status
curl http://localhost:8000/api/v1/pre-aggregations/orders.main_dashboard

Response includes:

  • Last build time
  • Row count
  • Build duration
  • Next scheduled refresh

Best Practices

Start with Common Queries

Analyze your most frequent queries and create pre-aggregations for those patterns:

# If dashboards always query by region and month
pre_aggregations:
- name: regional_monthly
measures: [count, total_revenue]
dimensions: [region]
time_dimension: order_date
granularity: month

Use Incremental for Large Datasets

For tables with millions of rows, always use incremental refresh:

pre_aggregations:
- name: large_table_preagg
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
incremental: true
update_window: "7 days"

Match Granularity to Query Patterns

Choose the finest granularity commonly queried:

# If users query by day but also need week/month rollups
pre_aggregations:
- name: orders_daily
granularity: day # Can serve day, week, month, quarter, year queries

Limit Dimension Combinations

Too many dimensions create large pre-aggregation tables:

# Good: Focused dimensions
pre_aggregations:
- name: focused_preagg
dimensions: [status, region] # 2 dimensions

# Avoid: Too many dimensions
pre_aggregations:
- name: bloated_preagg
dimensions: [status, region, payment_method, category, channel, source]

Create Multiple Targeted Pre-aggregations

Instead of one large pre-aggregation, create several targeted ones:

pre_aggregations:
# For regional reports
- name: by_region
measures: [count, total_revenue]
dimensions: [region]
granularity: day

# For status monitoring
- name: by_status
measures: [count]
dimensions: [status]
granularity: hour

# For payment analysis
- name: by_payment
measures: [count, total_revenue]
dimensions: [payment_method]
granularity: day

Next Steps