Pre-Aggregations
Pre-aggregations are materialized rollup tables that dramatically accelerate query performance and reduce data warehouse costs by pre-computing common aggregations.
Overview
┌─────────────────────────────────────────────────────────────────────┐
│ PRE-AGGREGATION SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ QUERY │ │ PREAGG │ │
│ │ │ │ MATCHER │ │
│ │ measures │───────────────────▶│ │ │
│ │ dimensions │ │ Find match │ │
│ │ filters │ │ or fallback │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────┐│
│ │ PRE-AGG TABLE │ │ PRE-AGG TABLE │ │ BASE ││
│ │ daily_by_region │ │ monthly_summary │ │ TABLE ││
│ │ │ │ │ │ ││
│ │ 50 MB │ │ 5 MB │ │ 50 GB ││
│ │ ~2 seconds │ │ ~0.5 seconds │ │ ~60s ││
│ └──────────────────────┘ └──────────────────────┘ └─────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BUILD & REFRESH │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Scheduled│ │On-Demand│ │Incremental│ │ Full │ │ │
│ │ │ (Cron) │ │ (API) │ │ Refresh │ │ Rebuild │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Key Features
Automatic Query Matching
When you run a query, Olytix Core automatically checks if a pre-aggregation can answer it:
| Match Type | Description | Performance |
|---|---|---|
| Exact | Pre-agg has exactly the needed measures/dimensions | Fastest |
| Superset | Pre-agg has more than needed, filters applied | Very fast |
| Rollup | Pre-agg can be aggregated further (day → month) | Fast |
| None | Query must hit base table | Slowest |
Significant Cost Reduction
Pre-aggregations typically reduce query costs by 90-99%:
Without Pre-Aggregation:
├── Scan: 50 GB of raw data
├── Time: ~60 seconds
└── Cost: $0.25 per query
With Pre-Aggregation:
├── Scan: 50 MB (0.1% of original)
├── Time: ~2 seconds (97% faster)
└── Cost: $0.0025 per query (99% savings)
Flexible Refresh Options
- Scheduled - Cron-based automatic refresh
- On-demand - Trigger refresh via API
- Incremental - Update only new/changed data
- Full rebuild - Complete reconstruction
Build Status Monitoring
Track the health of your pre-aggregations:
- Build success/failure status
- Last refresh time
- Row counts and size
- Query match statistics
Usage
Defining Pre-Aggregations
Pre-aggregations are defined in your cube YAML files:
# cubes/orders.yml
cubes:
- name: Orders
sql: SELECT * FROM orders
measures:
- name: revenue
type: sum
sql: amount
- name: count
type: count
- name: avg_order_value
type: avg
sql: amount
dimensions:
- name: region
type: string
sql: region
- name: product_category
type: string
sql: product_category
- name: created_at
type: time
sql: created_at
pre_aggregations:
# Daily rollup by region
- name: daily_by_region
measures:
- revenue
- count
dimensions:
- region
time_dimension: created_at
granularity: day
refresh:
every: "1 hour"
partition_granularity: month
# Monthly summary (no dimensions - just totals)
- name: monthly_summary
measures:
- revenue
- count
- avg_order_value
time_dimension: created_at
granularity: month
refresh:
every: "6 hours"
# Full breakdown for detailed analysis
- name: daily_full
measures:
- revenue
- count
dimensions:
- region
- product_category
time_dimension: created_at
granularity: day
refresh:
every: "2 hours"
indexes:
- columns: [region]
- columns: [product_category]
Managing Pre-Aggregations
from olytix-core.preagg.service import PreAggregationService
from olytix-core.preagg.models import BuildRequest
service = PreAggregationService()
# List all pre-aggregations
preaggs = await service.list_preaggregations(cube_name="Orders")
# PreAggregationInfo:
# ├── name: "daily_by_region"
# ├── cube: "Orders"
# ├── measures: ["revenue", "count"]
# ├── dimensions: ["region"]
# ├── time_dimension: "created_at"
# ├── granularity: "day"
# ├── status: "ready"
# ├── last_built: "2024-01-15T10:00:00Z"
# ├── row_count: 45000
# ├── size_bytes: 52428800 # 50 MB
# ├── query_matches_24h: 145
# └── refresh_schedule: "0 * * * *" # Every hour
# Get specific pre-aggregation status
status = await service.get_status(
cube_name="Orders",
preagg_name="daily_by_region"
)
Building Pre-Aggregations
# Trigger immediate build
build_result = await service.build(
BuildRequest(
cube_name="Orders",
preagg_name="daily_by_region",
build_type="incremental", # or "full"
date_range=("2024-01-01", "2024-01-31") # Optional
)
)
# BuildResult:
# ├── success: True
# ├── rows_affected: 1500
# ├── duration_seconds: 45
# ├── partitions_built: ["2024-01"]
# └── next_scheduled: "2024-01-15T11:00:00Z"
# Build all pre-aggregations for a cube
await service.build_all(cube_name="Orders")
Checking Query Matches
# Check if a query will use pre-aggregation
match = await service.check_match(
query={
"measures": ["Orders.revenue"],
"dimensions": ["Orders.region"],
"timeDimensions": [{
"dimension": "Orders.created_at",
"granularity": "day",
"dateRange": "Last 30 days"
}]
}
)
# MatchResult:
# ├── matched: True
# ├── preagg_name: "daily_by_region"
# ├── match_type: "exact"
# ├── estimated_rows: 150
# └── estimated_scan_bytes: 15000
Invalidation
# Invalidate pre-aggregation (mark as stale)
await service.invalidate(
cube_name="Orders",
preagg_name="daily_by_region",
reason="Source data updated"
)
# Invalidate with date range
await service.invalidate(
cube_name="Orders",
preagg_name="daily_by_region",
date_range=("2024-01-15", "2024-01-15"),
reason="Data correction for Jan 15"
)
API Endpoints
List Pre-Aggregations
GET /api/v1/preaggregations?cube=Orders
Get Status
GET /api/v1/preaggregations/Orders/daily_by_region/status
Trigger Build
POST /api/v1/preaggregations/Orders/daily_by_region/build
Content-Type: application/json
{
"build_type": "incremental",
"date_range": {
"start": "2024-01-01",
"end": "2024-01-31"
}
}
Check Query Match
POST /api/v1/preaggregations/match
Content-Type: application/json
{
"query": {
"measures": ["Orders.revenue"],
"dimensions": ["Orders.region"],
"timeDimensions": [{
"dimension": "Orders.created_at",
"granularity": "day",
"dateRange": "Last 30 days"
}]
}
}
Invalidate
POST /api/v1/preaggregations/Orders/daily_by_region/invalidate
Content-Type: application/json
{
"reason": "Source data correction",
"date_range": {
"start": "2024-01-15",
"end": "2024-01-15"
}
}
Get Build History
GET /api/v1/preaggregations/Orders/daily_by_region/builds?
limit=10
Pre-Aggregation Design
Choosing Measures
Include measures that are frequently queried together:
# Good - commonly queried together
pre_aggregations:
- name: sales_metrics
measures:
- revenue
- count
- avg_order_value
# Avoid - rarely needed together
pre_aggregations:
- name: everything
measures:
- revenue
- count
- avg_order_value
- discount_total
- shipping_cost
- tax_amount
- refund_amount
Choosing Dimensions
Include dimensions used for filtering and grouping:
# Good - supports common query patterns
pre_aggregations:
- name: by_region_category
measures: [revenue, count]
dimensions:
- region # Frequently filtered
- product_category # Frequently grouped
# Consider cardinality
# High cardinality dimensions (like customer_id) may not be suitable
Choosing Granularity
Match your typical query patterns:
| Granularity | Best For | Typical Size |
|---|---|---|
| minute | Real-time dashboards | Large |
| hour | Intraday analysis | Medium-Large |
| day | Daily reporting | Medium |
| week | Weekly summaries | Small-Medium |
| month | Monthly/quarterly reports | Small |
Rollup Support
Coarser granularities can be rolled up from finer ones:
minute → hour → day → week → month → quarter → year
Query for "monthly revenue" can use:
1. monthly_summary (exact match - fastest)
2. daily_by_region (rollup from day to month - fast)
3. Base table (no pre-agg - slowest)
Partitioning
Partition large pre-aggregations for efficient incremental updates:
pre_aggregations:
- name: daily_by_region
measures: [revenue, count]
dimensions: [region]
time_dimension: created_at
granularity: day
# Partition by month
partition_granularity: month
# Build range
build_range:
start: "2023-01-01"
# end: defaults to now
Partition Benefits
- Incremental refresh - Only rebuild affected partitions
- Parallel builds - Build multiple partitions simultaneously
- Selective invalidation - Invalidate specific date ranges
- Storage efficiency - Drop old partitions
Refresh Strategies
Scheduled Refresh
pre_aggregations:
- name: daily_metrics
refresh:
every: "1 hour" # Check every hour
sql_trigger: # Or use SQL trigger
select: "SELECT MAX(updated_at) FROM orders"
SQL Triggers
Refresh when source data changes:
pre_aggregations:
- name: daily_metrics
refresh:
sql_trigger:
select: "SELECT MAX(updated_at) FROM orders"
check_interval: "5 minutes"
Manual Triggers
For event-driven refreshes:
# After data pipeline completes
await service.build(
BuildRequest(
cube_name="Orders",
preagg_name="daily_by_region",
build_type="incremental"
)
)
Indexes
Add indexes to improve query performance on pre-aggregation tables:
pre_aggregations:
- name: daily_full
measures: [revenue, count]
dimensions: [region, product_category, customer_type]
time_dimension: created_at
granularity: day
indexes:
- columns: [region] # Single column
- columns: [product_category, region] # Composite
- columns: [customer_type]
Monitoring
Build Metrics
# Get pre-aggregation metrics
metrics = await service.get_metrics(
cube_name="Orders",
preagg_name="daily_by_region"
)
# PreAggMetrics:
# ├── builds_24h: 24
# ├── builds_failed_24h: 0
# ├── avg_build_duration_seconds: 45
# ├── query_matches_24h: 1250
# ├── query_matches_total: 45000
# ├── cache_hit_rate: 0.92
# ├── estimated_cost_savings_24h: 125.50
# └── size_trend: [
# {"date": "2024-01-13", "size_mb": 48},
# {"date": "2024-01-14", "size_mb": 49},
# {"date": "2024-01-15", "size_mb": 50}
# ]
Alerts
Configure alerts for build failures:
from olytix-core.preagg.alerts import PreAggAlert
await service.configure_alert(
PreAggAlert(
cube_name="Orders",
preagg_name="daily_by_region",
alert_on=["build_failure", "stale_data"],
stale_threshold_hours=2,
channels=["email", "slack"],
recipients=["data-team@company.com"]
)
)
Best Practices
Design Guidelines
- Start with query patterns - Analyze what queries run most frequently
- Keep dimensions low cardinality - High cardinality defeats the purpose
- Include time dimension - Most queries filter by time
- Create multiple granularities - Different use cases need different rollups
Performance Tips
- Partition large pre-aggs - Enable incremental refresh
- Add appropriate indexes - For commonly filtered dimensions
- Monitor build times - Long builds may indicate issues
- Schedule off-peak - Don't compete with query workload
Maintenance
- Monitor cache hit rates - Low rates mean pre-aggs aren't being used
- Review unused pre-aggs - Remove ones with zero matches
- Update for new patterns - Add pre-aggs for emerging query patterns
- Clean old partitions - Implement retention policies
Configuration
# pre-aggregation configuration
preaggregation:
enabled: true
storage:
type: "warehouse" # warehouse, external
schema: "preagg" # Schema for pre-agg tables
build:
max_concurrent_builds: 3
timeout_seconds: 3600
retry_attempts: 3
retry_delay_seconds: 60
refresh:
default_check_interval: "1 hour"
min_check_interval: "5 minutes"
query_matching:
enabled: true
prefer_preagg: true
max_rollup_granularity: "month"
monitoring:
track_matches: true
track_savings: true
retention_days: 30
Troubleshooting
Pre-Agg Not Being Used
- Check match - Use
/matchendpoint to see why - Verify freshness - Stale pre-aggs may be skipped
- Check dimensions - Query may need dimensions not in pre-agg
- Check granularity - Query granularity may be finer than pre-agg
Build Failures
- Check source data - Schema changes can break builds
- Check resources - Insufficient memory/timeout
- Check permissions - Write access to pre-agg schema
- Review logs - Detailed error in build history
Next Steps
- Query Optimization - Cost prediction and savings
- Query Explanation - See when pre-aggs are used
- Pre-Aggregations Reference - Detailed YAML reference