Skip to main content

Pre-Aggregations

For Data Analysts

Pre-aggregations are materialized rollup tables that dramatically accelerate query performance and reduce data warehouse costs by pre-computing common aggregations.

Overview

┌─────────────────────────────────────────────────────────────────────┐
│ PRE-AGGREGATION SYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ QUERY │ │ PREAGG │ │
│ │ │ │ MATCHER │ │
│ │ measures │───────────────────▶│ │ │
│ │ dimensions │ │ Find match │ │
│ │ filters │ │ or fallback │ │
│ └──────────────┘ └──────┬───────┘ │
│ │ │
│ ┌─────────────────────────┼─────────────────┐ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────────────────┐ ┌──────────────────────┐ ┌─────────┐│
│ │ PRE-AGG TABLE │ │ PRE-AGG TABLE │ │ BASE ││
│ │ daily_by_region │ │ monthly_summary │ │ TABLE ││
│ │ │ │ │ │ ││
│ │ 50 MB │ │ 5 MB │ │ 50 GB ││
│ │ ~2 seconds │ │ ~0.5 seconds │ │ ~60s ││
│ └──────────────────────┘ └──────────────────────┘ └─────────┘│
│ │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ BUILD & REFRESH │ │
│ │ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Scheduled│ │On-Demand│ │Incremental│ │ Full │ │ │
│ │ │ (Cron) │ │ (API) │ │ Refresh │ │ Rebuild │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Key Features

Automatic Query Matching

When you run a query, Olytix Core automatically checks if a pre-aggregation can answer it:

Match TypeDescriptionPerformance
ExactPre-agg has exactly the needed measures/dimensionsFastest
SupersetPre-agg has more than needed, filters appliedVery fast
RollupPre-agg can be aggregated further (day → month)Fast
NoneQuery must hit base tableSlowest

Significant Cost Reduction

Pre-aggregations typically reduce query costs by 90-99%:

Without Pre-Aggregation:
├── Scan: 50 GB of raw data
├── Time: ~60 seconds
└── Cost: $0.25 per query

With Pre-Aggregation:
├── Scan: 50 MB (0.1% of original)
├── Time: ~2 seconds (97% faster)
└── Cost: $0.0025 per query (99% savings)

Flexible Refresh Options

  • Scheduled - Cron-based automatic refresh
  • On-demand - Trigger refresh via API
  • Incremental - Update only new/changed data
  • Full rebuild - Complete reconstruction

Build Status Monitoring

Track the health of your pre-aggregations:

  • Build success/failure status
  • Last refresh time
  • Row counts and size
  • Query match statistics

Usage

Defining Pre-Aggregations

Pre-aggregations are defined in your cube YAML files:

# cubes/orders.yml
cubes:
- name: Orders
sql: SELECT * FROM orders

measures:
- name: revenue
type: sum
sql: amount

- name: count
type: count

- name: avg_order_value
type: avg
sql: amount

dimensions:
- name: region
type: string
sql: region

- name: product_category
type: string
sql: product_category

- name: created_at
type: time
sql: created_at

pre_aggregations:
# Daily rollup by region
- name: daily_by_region
measures:
- revenue
- count
dimensions:
- region
time_dimension: created_at
granularity: day
refresh:
every: "1 hour"
partition_granularity: month

# Monthly summary (no dimensions - just totals)
- name: monthly_summary
measures:
- revenue
- count
- avg_order_value
time_dimension: created_at
granularity: month
refresh:
every: "6 hours"

# Full breakdown for detailed analysis
- name: daily_full
measures:
- revenue
- count
dimensions:
- region
- product_category
time_dimension: created_at
granularity: day
refresh:
every: "2 hours"
indexes:
- columns: [region]
- columns: [product_category]

Managing Pre-Aggregations

from olytix-core.preagg.service import PreAggregationService
from olytix-core.preagg.models import BuildRequest

service = PreAggregationService()

# List all pre-aggregations
preaggs = await service.list_preaggregations(cube_name="Orders")

# PreAggregationInfo:
# ├── name: "daily_by_region"
# ├── cube: "Orders"
# ├── measures: ["revenue", "count"]
# ├── dimensions: ["region"]
# ├── time_dimension: "created_at"
# ├── granularity: "day"
# ├── status: "ready"
# ├── last_built: "2024-01-15T10:00:00Z"
# ├── row_count: 45000
# ├── size_bytes: 52428800 # 50 MB
# ├── query_matches_24h: 145
# └── refresh_schedule: "0 * * * *" # Every hour

# Get specific pre-aggregation status
status = await service.get_status(
cube_name="Orders",
preagg_name="daily_by_region"
)

Building Pre-Aggregations

# Trigger immediate build
build_result = await service.build(
BuildRequest(
cube_name="Orders",
preagg_name="daily_by_region",
build_type="incremental", # or "full"
date_range=("2024-01-01", "2024-01-31") # Optional
)
)

# BuildResult:
# ├── success: True
# ├── rows_affected: 1500
# ├── duration_seconds: 45
# ├── partitions_built: ["2024-01"]
# └── next_scheduled: "2024-01-15T11:00:00Z"

# Build all pre-aggregations for a cube
await service.build_all(cube_name="Orders")

Checking Query Matches

# Check if a query will use pre-aggregation
match = await service.check_match(
query={
"measures": ["Orders.revenue"],
"dimensions": ["Orders.region"],
"timeDimensions": [{
"dimension": "Orders.created_at",
"granularity": "day",
"dateRange": "Last 30 days"
}]
}
)

# MatchResult:
# ├── matched: True
# ├── preagg_name: "daily_by_region"
# ├── match_type: "exact"
# ├── estimated_rows: 150
# └── estimated_scan_bytes: 15000

Invalidation

# Invalidate pre-aggregation (mark as stale)
await service.invalidate(
cube_name="Orders",
preagg_name="daily_by_region",
reason="Source data updated"
)

# Invalidate with date range
await service.invalidate(
cube_name="Orders",
preagg_name="daily_by_region",
date_range=("2024-01-15", "2024-01-15"),
reason="Data correction for Jan 15"
)

API Endpoints

List Pre-Aggregations

GET /api/v1/preaggregations?cube=Orders

Get Status

GET /api/v1/preaggregations/Orders/daily_by_region/status

Trigger Build

POST /api/v1/preaggregations/Orders/daily_by_region/build
Content-Type: application/json

{
"build_type": "incremental",
"date_range": {
"start": "2024-01-01",
"end": "2024-01-31"
}
}

Check Query Match

POST /api/v1/preaggregations/match
Content-Type: application/json

{
"query": {
"measures": ["Orders.revenue"],
"dimensions": ["Orders.region"],
"timeDimensions": [{
"dimension": "Orders.created_at",
"granularity": "day",
"dateRange": "Last 30 days"
}]
}
}

Invalidate

POST /api/v1/preaggregations/Orders/daily_by_region/invalidate
Content-Type: application/json

{
"reason": "Source data correction",
"date_range": {
"start": "2024-01-15",
"end": "2024-01-15"
}
}

Get Build History

GET /api/v1/preaggregations/Orders/daily_by_region/builds?
limit=10

Pre-Aggregation Design

Choosing Measures

Include measures that are frequently queried together:

# Good - commonly queried together
pre_aggregations:
- name: sales_metrics
measures:
- revenue
- count
- avg_order_value

# Avoid - rarely needed together
pre_aggregations:
- name: everything
measures:
- revenue
- count
- avg_order_value
- discount_total
- shipping_cost
- tax_amount
- refund_amount

Choosing Dimensions

Include dimensions used for filtering and grouping:

# Good - supports common query patterns
pre_aggregations:
- name: by_region_category
measures: [revenue, count]
dimensions:
- region # Frequently filtered
- product_category # Frequently grouped

# Consider cardinality
# High cardinality dimensions (like customer_id) may not be suitable

Choosing Granularity

Match your typical query patterns:

GranularityBest ForTypical Size
minuteReal-time dashboardsLarge
hourIntraday analysisMedium-Large
dayDaily reportingMedium
weekWeekly summariesSmall-Medium
monthMonthly/quarterly reportsSmall

Rollup Support

Coarser granularities can be rolled up from finer ones:

minute → hour → day → week → month → quarter → year

Query for "monthly revenue" can use:
1. monthly_summary (exact match - fastest)
2. daily_by_region (rollup from day to month - fast)
3. Base table (no pre-agg - slowest)

Partitioning

Partition large pre-aggregations for efficient incremental updates:

pre_aggregations:
- name: daily_by_region
measures: [revenue, count]
dimensions: [region]
time_dimension: created_at
granularity: day

# Partition by month
partition_granularity: month

# Build range
build_range:
start: "2023-01-01"
# end: defaults to now

Partition Benefits

  1. Incremental refresh - Only rebuild affected partitions
  2. Parallel builds - Build multiple partitions simultaneously
  3. Selective invalidation - Invalidate specific date ranges
  4. Storage efficiency - Drop old partitions

Refresh Strategies

Scheduled Refresh

pre_aggregations:
- name: daily_metrics
refresh:
every: "1 hour" # Check every hour
sql_trigger: # Or use SQL trigger
select: "SELECT MAX(updated_at) FROM orders"

SQL Triggers

Refresh when source data changes:

pre_aggregations:
- name: daily_metrics
refresh:
sql_trigger:
select: "SELECT MAX(updated_at) FROM orders"
check_interval: "5 minutes"

Manual Triggers

For event-driven refreshes:

# After data pipeline completes
await service.build(
BuildRequest(
cube_name="Orders",
preagg_name="daily_by_region",
build_type="incremental"
)
)

Indexes

Add indexes to improve query performance on pre-aggregation tables:

pre_aggregations:
- name: daily_full
measures: [revenue, count]
dimensions: [region, product_category, customer_type]
time_dimension: created_at
granularity: day

indexes:
- columns: [region] # Single column
- columns: [product_category, region] # Composite
- columns: [customer_type]

Monitoring

Build Metrics

# Get pre-aggregation metrics
metrics = await service.get_metrics(
cube_name="Orders",
preagg_name="daily_by_region"
)

# PreAggMetrics:
# ├── builds_24h: 24
# ├── builds_failed_24h: 0
# ├── avg_build_duration_seconds: 45
# ├── query_matches_24h: 1250
# ├── query_matches_total: 45000
# ├── cache_hit_rate: 0.92
# ├── estimated_cost_savings_24h: 125.50
# └── size_trend: [
# {"date": "2024-01-13", "size_mb": 48},
# {"date": "2024-01-14", "size_mb": 49},
# {"date": "2024-01-15", "size_mb": 50}
# ]

Alerts

Configure alerts for build failures:

from olytix-core.preagg.alerts import PreAggAlert

await service.configure_alert(
PreAggAlert(
cube_name="Orders",
preagg_name="daily_by_region",
alert_on=["build_failure", "stale_data"],
stale_threshold_hours=2,
channels=["email", "slack"],
recipients=["data-team@company.com"]
)
)

Best Practices

Design Guidelines

  1. Start with query patterns - Analyze what queries run most frequently
  2. Keep dimensions low cardinality - High cardinality defeats the purpose
  3. Include time dimension - Most queries filter by time
  4. Create multiple granularities - Different use cases need different rollups

Performance Tips

  1. Partition large pre-aggs - Enable incremental refresh
  2. Add appropriate indexes - For commonly filtered dimensions
  3. Monitor build times - Long builds may indicate issues
  4. Schedule off-peak - Don't compete with query workload

Maintenance

  1. Monitor cache hit rates - Low rates mean pre-aggs aren't being used
  2. Review unused pre-aggs - Remove ones with zero matches
  3. Update for new patterns - Add pre-aggs for emerging query patterns
  4. Clean old partitions - Implement retention policies

Configuration

# pre-aggregation configuration
preaggregation:
enabled: true

storage:
type: "warehouse" # warehouse, external
schema: "preagg" # Schema for pre-agg tables

build:
max_concurrent_builds: 3
timeout_seconds: 3600
retry_attempts: 3
retry_delay_seconds: 60

refresh:
default_check_interval: "1 hour"
min_check_interval: "5 minutes"

query_matching:
enabled: true
prefer_preagg: true
max_rollup_granularity: "month"

monitoring:
track_matches: true
track_savings: true
retention_days: 30

Troubleshooting

Pre-Agg Not Being Used

  1. Check match - Use /match endpoint to see why
  2. Verify freshness - Stale pre-aggs may be skipped
  3. Check dimensions - Query may need dimensions not in pre-agg
  4. Check granularity - Query granularity may be finer than pre-agg

Build Failures

  1. Check source data - Schema changes can break builds
  2. Check resources - Insufficient memory/timeout
  3. Check permissions - Write access to pre-agg schema
  4. Review logs - Detailed error in build history

Next Steps