Pre-aggregations
Pre-aggregations are materialized aggregations that dramatically improve query performance. By pre-computing common aggregations, Olytix Core can serve queries from cached results instead of scanning raw data.
What is a Pre-aggregation?
A pre-aggregation is a cached summary table that:
- Pre-computes measure aggregations
- Groups by specified dimensions
- Refreshes on a schedule
- Automatically serves matching queries
┌─────────────────────────────────────────────────────────────┐
│ Pre-aggregation Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Query Request │
│ ┌─────────────┐ │
│ │ SUM(revenue)│ │
│ │ BY month │ │
│ └──────┬──────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ Match? ┌─────────────────┐ │
│ │ Query Planner │────Yes──────►│ Pre-aggregation │ │
│ └────────┬────────┘ │ (cached table) │ │
│ │ └─────────────────┘ │
│ │ No match │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Raw Data │ │ Fast Response │ │
│ │ (slow scan) │ │ (< 100ms) │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
When to Use Pre-aggregations
Use pre-aggregations when:
- Queries consistently take more than a few seconds
- The same aggregations are requested frequently
- Data volumes exceed millions of rows
- Dashboard load times need improvement
- You want to reduce database load
Basic Pre-aggregation Definition
cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"
measures:
- name: count
type: count
- name: total_revenue
type: sum
sql: total_amount
dimensions:
- name: order_date
type: time
sql: order_date
granularities: [day, week, month, quarter, year]
- name: status
type: string
sql: status
pre_aggregations:
- name: orders_by_day
measures:
- count
- total_revenue
dimensions:
- status
time_dimension: order_date
granularity: day
Pre-aggregation Configuration Options
Each pre-aggregation supports these properties:
pre_aggregations:
- name: orders_by_month
measures: # Measures to pre-compute
- count
- total_revenue
dimensions: # Dimensions to include
- status
- region
time_dimension: order_date # Time dimension for granularity
granularity: month # Time granularity
partition_granularity: month # Partition size for incremental
segments: # Optional segment filters
- completed_orders
refresh: # Refresh configuration
every: "1 hour"
incremental: true
update_window: "7 days"
timezone: "UTC"
indexes: # Database indexes
- columns: [status, region]
build_range_start: "2020-01-01" # Historical data start
build_range_end: "now" # Build up to now
Granularity Levels
Pre-aggregations support these time granularities:
| Granularity | Description | Best For |
|---|---|---|
second | Aggregated to second | Real-time monitoring |
minute | Aggregated to minute | Short-term trends |
hour | Aggregated to hour | Intraday analysis |
day | Aggregated to day | Daily reporting |
week | Aggregated to week | Weekly metrics |
month | Aggregated to month | Monthly reporting |
quarter | Aggregated to quarter | Quarterly reviews |
year | Aggregated to year | Annual comparisons |
Choose a granularity that is finer or equal to your most common query granularity.
Rollup Strategies
Simple Rollup
Pre-compute all measure combinations for specified dimensions:
pre_aggregations:
- name: orders_by_status
measures:
- count
- total_revenue
- avg_order_value
dimensions:
- status
time_dimension: order_date
granularity: day
This serves queries like:
- Revenue by status by day
- Order count by status
- Average order value by day
Multi-dimension Rollup
Include multiple dimensions for more query coverage:
pre_aggregations:
- name: orders_by_region_status
measures:
- count
- total_revenue
dimensions:
- region
- status
- payment_method
time_dimension: order_date
granularity: day
Time-only Rollup
For time-series queries without dimensional breakdown:
pre_aggregations:
- name: orders_over_time
measures:
- count
- total_revenue
- unique_customers
time_dimension: order_date
granularity: day
Refresh Configuration
Control when and how pre-aggregations are refreshed:
Scheduled Refresh
pre_aggregations:
- name: orders_daily
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
refresh:
every: "1 hour" # Check for refresh every hour
timezone: "UTC" # Timezone for scheduling
Common refresh intervals:
5 minutes- Near real-time dashboards1 hour- Standard operational reporting1 day- Daily batch reporting1 week- Historical analytics
Incremental Refresh
Only refresh recent data instead of the full table:
pre_aggregations:
- name: orders_incremental
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
every: "1 hour"
incremental: true
update_window: "7 days" # Only refresh last 7 days
Incremental refresh is ideal when:
- Historical data rarely changes
- You have large datasets
- You want to minimize refresh time
Partitioning
Partition pre-aggregations for efficient incremental updates:
pre_aggregations:
- name: orders_partitioned
measures: [count, total_revenue]
dimensions: [status]
time_dimension: order_date
granularity: day
partition_granularity: month # Partition by month
refresh:
every: "1 hour"
incremental: true
update_window: "30 days"
Partition granularity determines how data is divided:
day- One partition per dayweek- One partition per weekmonth- One partition per month (recommended)year- One partition per year
Build Range
Control the historical range for pre-aggregation builds:
pre_aggregations:
- name: orders_historical
measures: [count, total_revenue]
time_dimension: order_date
granularity: month
build_range_start: "2020-01-01" # Start from 2020
build_range_end: "now" # Build up to current time
Use build ranges to:
- Limit historical data processing
- Control storage usage
- Speed up initial builds
Indexes
Add database indexes to improve query performance:
pre_aggregations:
- name: orders_indexed
measures: [count, total_revenue]
dimensions: [status, region, payment_method]
time_dimension: order_date
granularity: day
indexes:
- columns: [status]
- columns: [region, status]
- columns: [order_date, status]
Index columns that are frequently used in:
- Filter conditions
- Group by clauses
- Join conditions
Segment Filters
Pre-aggregate only specific data segments:
cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"
segments:
- name: completed
sql: "status = 'completed'"
pre_aggregations:
- name: completed_orders_daily
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
segments:
- completed
Complete Example
Here is a comprehensive cube with multiple pre-aggregation strategies:
cubes:
- name: orders
sql: "SELECT * FROM {{ ref('fct_orders') }}"
description: "E-commerce order transactions"
measures:
- name: count
type: count
description: "Total orders"
- name: total_revenue
type: sum
sql: total_amount
format: currency
- name: avg_order_value
type: avg
sql: total_amount
format: currency
- name: unique_customers
type: count_distinct
sql: customer_id
dimensions:
- name: order_id
type: number
sql: order_id
primary_key: true
- name: order_date
type: time
sql: order_date
granularities: [day, week, month, quarter, year]
- name: status
type: string
sql: status
- name: region
type: string
sql: region
- name: payment_method
type: string
sql: payment_method
segments:
- name: completed
sql: "status = 'completed'"
pre_aggregations:
# High-frequency dashboard queries
- name: main_dashboard
measures:
- count
- total_revenue
- unique_customers
dimensions:
- status
- region
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
every: "15 minutes"
incremental: true
update_window: "3 days"
indexes:
- columns: [status]
- columns: [region]
# Monthly executive reports
- name: monthly_summary
measures:
- count
- total_revenue
- avg_order_value
- unique_customers
dimensions:
- region
time_dimension: order_date
granularity: month
refresh:
every: "1 day"
build_range_start: "2020-01-01"
# Payment method analysis
- name: payment_analysis
measures:
- count
- total_revenue
dimensions:
- payment_method
- region
time_dimension: order_date
granularity: week
refresh:
every: "1 hour"
# Completed orders only
- name: completed_orders_daily
measures:
- count
- total_revenue
time_dimension: order_date
granularity: day
segments:
- completed
refresh:
every: "30 minutes"
incremental: true
update_window: "7 days"
# Time-series only (no dimensions)
- name: orders_trend
measures:
- count
- total_revenue
time_dimension: order_date
granularity: hour
partition_granularity: day
refresh:
every: "5 minutes"
incremental: true
update_window: "1 day"
Query Matching
Olytix Core automatically uses pre-aggregations when a query matches:
- Measures: All requested measures are in the pre-aggregation
- Dimensions: All requested dimensions are in the pre-aggregation
- Time granularity: Requested granularity is equal to or coarser than pre-aggregation
- Time range: Query time range overlaps with pre-aggregation data
- Segments: Query segments match pre-aggregation segments
Example query that matches main_dashboard pre-aggregation:
{
"measures": ["orders.total_revenue", "orders.count"],
"dimensions": ["orders.region", "orders.order_date.month"],
"filters": [
{
"dimension": "orders.order_date",
"operator": "gte",
"values": ["2024-01-01"]
}
]
}
Monitoring Pre-aggregations
Check pre-aggregation status via API:
# List all pre-aggregations
curl http://localhost:8000/api/v1/pre-aggregations
# Check specific pre-aggregation status
curl http://localhost:8000/api/v1/pre-aggregations/orders.main_dashboard
Response includes:
- Last build time
- Row count
- Build duration
- Next scheduled refresh
Best Practices
Start with Common Queries
Analyze your most frequent queries and create pre-aggregations for those patterns:
# If dashboards always query by region and month
pre_aggregations:
- name: regional_monthly
measures: [count, total_revenue]
dimensions: [region]
time_dimension: order_date
granularity: month
Use Incremental for Large Datasets
For tables with millions of rows, always use incremental refresh:
pre_aggregations:
- name: large_table_preagg
measures: [count, total_revenue]
time_dimension: order_date
granularity: day
partition_granularity: month
refresh:
incremental: true
update_window: "7 days"
Match Granularity to Query Patterns
Choose the finest granularity commonly queried:
# If users query by day but also need week/month rollups
pre_aggregations:
- name: orders_daily
granularity: day # Can serve day, week, month, quarter, year queries
Limit Dimension Combinations
Too many dimensions create large pre-aggregation tables:
# Good: Focused dimensions
pre_aggregations:
- name: focused_preagg
dimensions: [status, region] # 2 dimensions
# Avoid: Too many dimensions
pre_aggregations:
- name: bloated_preagg
dimensions: [status, region, payment_method, category, channel, source]
Create Multiple Targeted Pre-aggregations
Instead of one large pre-aggregation, create several targeted ones:
pre_aggregations:
# For regional reports
- name: by_region
measures: [count, total_revenue]
dimensions: [region]
granularity: day
# For status monitoring
- name: by_status
measures: [count]
dimensions: [status]
granularity: hour
# For payment analysis
- name: by_payment
measures: [count, total_revenue]
dimensions: [payment_method]
granularity: day
Next Steps
- Cube Fundamentals - Review cube structure
- Measures - Optimize measure definitions
- Dimensions - Plan dimension granularities