Retry Policies
Olytix Core implements configurable retry policies with multiple backoff strategies to handle transient failures gracefully. This guide covers retry configuration, backoff algorithms, and best practices for reliable operation.
Overview
Retry policies help handle transient failures by:
- Automatically retrying failed operations
- Using backoff strategies to avoid overwhelming services
- Adding jitter to prevent thundering herd problems
- Respecting circuit breakers to stop retries when appropriate
Retry Flow with Exponential Backoff
Click 'Start Demo' to see the retry flow in action.
▶️Start
⚡ExecuteRun operation
❓Success?
Yes
✅Success
⏳WaitBackoff
🔢Max?0/3
Yes
❌Failure
Retry with exponential backoff
Legend
Start
Execute
Retry/Wait
Success
Failure
Configuration
Environment Variables
# Enable retry policies
OLYTIX_RETRY__ENABLED=true
# Default retry settings
OLYTIX_RETRY__MAX_ATTEMPTS=3
OLYTIX_RETRY__INITIAL_DELAY_SECONDS=1.0
OLYTIX_RETRY__MAX_DELAY_SECONDS=60.0
OLYTIX_RETRY__BACKOFF_MULTIPLIER=2.0
OLYTIX_RETRY__JITTER=true
# Warehouse-specific settings
OLYTIX_RETRY__WAREHOUSE__MAX_ATTEMPTS=5
OLYTIX_RETRY__WAREHOUSE__INITIAL_DELAY_SECONDS=2.0
OLYTIX_RETRY__WAREHOUSE__MAX_DELAY_SECONDS=120.0
# Cache-specific settings
OLYTIX_RETRY__CACHE__MAX_ATTEMPTS=2
OLYTIX_RETRY__CACHE__INITIAL_DELAY_SECONDS=0.1
OLYTIX_RETRY__CACHE__MAX_DELAY_SECONDS=1.0
Configuration File
# config/retry.yaml
retry:
enabled: true
# Default configuration
default:
max_attempts: 3
initial_delay_seconds: 1.0
max_delay_seconds: 60.0
backoff_strategy: exponential
backoff_multiplier: 2.0
jitter: true
jitter_factor: 0.25
retryable_exceptions:
- ConnectionError
- TimeoutError
- TemporaryError
non_retryable_exceptions:
- ValidationError
- AuthenticationError
- NotFoundError
# Per-service configurations
services:
warehouse:
max_attempts: 5
initial_delay_seconds: 2.0
max_delay_seconds: 120.0
backoff_strategy: exponential
retryable_status_codes: [429, 500, 502, 503, 504]
redis_cache:
max_attempts: 2
initial_delay_seconds: 0.1
max_delay_seconds: 1.0
backoff_strategy: fixed
external_api:
max_attempts: 4
initial_delay_seconds: 1.0
max_delay_seconds: 30.0
backoff_strategy: exponential_with_full_jitter
respect_retry_after: true
Backoff Strategies
Fixed Delay
Waits the same duration between each retry:
from olytix-core.resilience import RetryPolicy, FixedBackoff
policy = RetryPolicy(
max_attempts=3,
backoff=FixedBackoff(delay_seconds=2.0),
)
# Retry delays: 2s, 2s, 2s
Exponential Backoff
Doubles the delay after each failure:
from olytix-core.resilience import RetryPolicy, ExponentialBackoff
policy = RetryPolicy(
max_attempts=5,
backoff=ExponentialBackoff(
initial_delay_seconds=1.0,
multiplier=2.0,
max_delay_seconds=60.0,
),
)
# Retry delays: 1s, 2s, 4s, 8s, 16s (capped at 60s)
Exponential with Jitter
Adds randomness to prevent thundering herd:
from olytix-core.resilience import RetryPolicy, ExponentialBackoffWithJitter
# Decorrelated jitter (recommended)
policy = RetryPolicy(
max_attempts=5,
backoff=ExponentialBackoffWithJitter(
initial_delay_seconds=1.0,
max_delay_seconds=60.0,
jitter_type="decorrelated", # or "full", "equal"
),
)
# Full jitter: random between 0 and exponential delay
# Equal jitter: exponential/2 + random(0, exponential/2)
# Decorrelated: random between base and previous * 3
Linear Backoff
Increases delay by fixed amount:
from olytix-core.resilience import RetryPolicy, LinearBackoff
policy = RetryPolicy(
max_attempts=5,
backoff=LinearBackoff(
initial_delay_seconds=1.0,
increment_seconds=2.0,
max_delay_seconds=30.0,
),
)
# Retry delays: 1s, 3s, 5s, 7s, 9s
Custom Backoff
Implement your own backoff logic:
from olytix-core.resilience import RetryPolicy, BackoffStrategy
class FibonacciBackoff(BackoffStrategy):
"""Fibonacci sequence backoff."""
def __init__(self, max_delay_seconds: float = 60.0):
self.max_delay_seconds = max_delay_seconds
def get_delay(self, attempt: int) -> float:
a, b = 1, 1
for _ in range(attempt):
a, b = b, a + b
return min(a, self.max_delay_seconds)
policy = RetryPolicy(
max_attempts=6,
backoff=FibonacciBackoff(max_delay_seconds=30.0),
)
# Retry delays: 1s, 1s, 2s, 3s, 5s, 8s
Implementation
Decorator Usage
from olytix-core.resilience import retry
@retry(
max_attempts=3,
backoff="exponential",
initial_delay=1.0,
retryable_exceptions=[ConnectionError, TimeoutError],
)
async def query_warehouse(sql: str) -> list[dict]:
"""Query warehouse with automatic retry."""
return await warehouse.execute(sql)
Context Manager
from olytix-core.resilience import RetryContext
async def execute_with_retry(operation):
"""Execute operation with retry context."""
async with RetryContext(
max_attempts=3,
backoff=ExponentialBackoff(initial_delay_seconds=1.0),
) as retry:
while True:
try:
return await operation()
except (ConnectionError, TimeoutError) as e:
if not await retry.should_retry(e):
raise
await retry.wait()
Class-Based Implementation
from olytix-core.resilience import RetryPolicy, RetryConfig
class ResilientWarehouseAdapter:
"""Warehouse adapter with retry support."""
def __init__(self, config: WarehouseConfig):
self.warehouse = Warehouse(config)
self.retry_policy = RetryPolicy(
config=RetryConfig(
max_attempts=config.retry.max_attempts,
backoff=ExponentialBackoffWithJitter(
initial_delay_seconds=config.retry.initial_delay,
max_delay_seconds=config.retry.max_delay,
),
),
on_retry=self._on_retry,
)
async def execute(self, sql: str) -> list[dict]:
"""Execute SQL with automatic retry."""
return await self.retry_policy.execute(
lambda: self.warehouse.execute(sql),
context={"sql_hash": hash(sql)},
)
async def _on_retry(
self,
attempt: int,
error: Exception,
delay: float,
context: dict,
):
"""Log retry attempts."""
logger.warning(
"query_retry",
attempt=attempt,
error_type=type(error).__name__,
error_message=str(error),
next_delay_seconds=delay,
sql_hash=context.get("sql_hash"),
)
Advanced Patterns
Retry with Circuit Breaker
from olytix-core.resilience import retry, circuit_breaker
@circuit_breaker(name="warehouse", failure_threshold=5)
@retry(max_attempts=3, backoff="exponential")
async def query_warehouse(sql: str) -> list[dict]:
"""Query with both retry and circuit breaker.
Retry handles transient failures.
Circuit breaker prevents retries when service is down.
"""
return await warehouse.execute(sql)
Conditional Retry
from olytix-core.resilience import retry, RetryCondition
def should_retry(error: Exception, response: Any) -> bool:
"""Determine if operation should be retried."""
if isinstance(error, RateLimitError):
return True
if isinstance(error, HTTPError):
return error.status_code in [429, 500, 502, 503, 504]
return False
@retry(
max_attempts=5,
condition=should_retry,
backoff="exponential",
)
async def call_external_api(request: dict) -> dict:
"""Call API with conditional retry."""
return await api_client.post("/endpoint", json=request)
Retry-After Header Support
from olytix-core.resilience import retry, RateLimitBackoff
@retry(
max_attempts=5,
backoff=RateLimitBackoff(
fallback_delay_seconds=60.0,
respect_retry_after=True,
),
)
async def rate_limited_api_call(data: dict) -> dict:
"""Call API respecting Retry-After headers."""
try:
return await api_client.post("/endpoint", json=data)
except RateLimitError as e:
# RateLimitBackoff will use e.retry_after if available
raise
Idempotency Keys
import uuid
from olytix-core.resilience import retry
@retry(max_attempts=3, backoff="exponential")
async def create_resource(data: dict, idempotency_key: str = None) -> dict:
"""Create resource with idempotency for safe retries."""
key = idempotency_key or str(uuid.uuid4())
return await api_client.post(
"/resources",
json=data,
headers={"Idempotency-Key": key},
)
# Safe to retry - same key ensures operation runs once
result = await create_resource(
{"name": "example"},
idempotency_key="user-123-create-resource-abc",
)
Retry Budget
from olytix-core.resilience import RetryBudget, retry
# Limit total retries across all requests
budget = RetryBudget(
max_retries_per_second=10,
retry_ratio=0.2, # Max 20% of requests can be retries
min_retries_per_second=2, # Always allow some retries
)
@retry(max_attempts=3, budget=budget)
async def query_with_budget(sql: str) -> list[dict]:
"""Query with retry budget to prevent retry storms."""
return await warehouse.execute(sql)
Error Handling
Retryable vs Non-Retryable Errors
from olytix-core.resilience import RetryPolicy
policy = RetryPolicy(
max_attempts=3,
retryable_exceptions=[
# Transient errors - worth retrying
ConnectionError,
TimeoutError,
ConnectionResetError,
TemporaryError,
],
non_retryable_exceptions=[
# Permanent errors - don't retry
ValidationError,
AuthenticationError,
AuthorizationError,
NotFoundError,
BadRequestError,
],
)
Custom Error Classification
from olytix-core.resilience import retry, ErrorClassifier
class WarehouseErrorClassifier(ErrorClassifier):
"""Classify warehouse errors for retry decisions."""
def is_retryable(self, error: Exception) -> bool:
if isinstance(error, WarehouseError):
# Retry on specific error codes
return error.code in [
"QUERY_TIMEOUT",
"CONNECTION_LOST",
"TEMPORARY_FAILURE",
"RATE_LIMITED",
]
return isinstance(error, (ConnectionError, TimeoutError))
def get_delay_hint(self, error: Exception) -> float | None:
"""Get delay hint from error."""
if hasattr(error, "retry_after"):
return error.retry_after
return None
@retry(
max_attempts=5,
error_classifier=WarehouseErrorClassifier(),
)
async def query_warehouse(sql: str) -> list[dict]:
return await warehouse.execute(sql)
Monitoring
Retry Metrics
# Prometheus metrics exposed by Olytix Core
olytix-core_retry_attempts_total{service="warehouse", result="success"} 1523
olytix-core_retry_attempts_total{service="warehouse", result="failure"} 45
olytix-core_retry_attempts_total{service="warehouse", result="exhausted"} 12
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.5"} 2.1
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.95"} 15.3
Alerting Rules
# prometheus-rules.yaml
groups:
- name: retry_alerts
rules:
- alert: HighRetryRate
expr: |
sum(rate(olytix-core_retry_attempts_total{result!="success"}[5m]))
/ sum(rate(olytix-core_retry_attempts_total[5m])) > 0.3
for: 5m
labels:
severity: warning
annotations:
summary: "High retry rate for {{ $labels.service }}"
description: "{{ $value | humanizePercentage }} of calls require retries"
- alert: RetryExhaustion
expr: |
rate(olytix-core_retry_attempts_total{result="exhausted"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Retry exhaustion for {{ $labels.service }}"
description: "Requests failing after all retry attempts"
- alert: HighRetryLatency
expr: |
histogram_quantile(0.95, olytix-core_retry_delay_seconds_bucket) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "High retry delays"
description: "95th percentile retry delay is {{ $value }}s"
Logging
from olytix-core.resilience import RetryPolicy
policy = RetryPolicy(
max_attempts=3,
on_retry=lambda attempt, error, delay: logger.warning(
"operation_retry",
attempt=attempt,
max_attempts=3,
error_type=type(error).__name__,
error_message=str(error),
next_delay_seconds=delay,
),
on_exhausted=lambda error: logger.error(
"retry_exhausted",
error_type=type(error).__name__,
error_message=str(error),
),
)
Best Practices
1. Set Appropriate Timeouts
# Total timeout = initial_delay * (multiplier^max_attempts - 1) / (multiplier - 1)
# With max_delay cap applied
# Example: 3 retries with 1s initial, 2x multiplier
# Delays: 1s + 2s + 4s = 7s total wait time
# Plus operation time: ~10-15s total
@retry(
max_attempts=3,
initial_delay=1.0,
backoff_multiplier=2.0,
# Also set operation timeout
timeout_seconds=30.0, # Total timeout including retries
)
async def query_with_timeout(sql: str):
return await warehouse.execute(sql, timeout=10)
2. Always Use Jitter
# Without jitter - all clients retry at same time
ExponentialBackoff(initial_delay=1.0) # Avoid
# With jitter - retries spread out
ExponentialBackoffWithJitter(
initial_delay=1.0,
jitter_type="decorrelated",
) # Recommended
3. Ensure Idempotency
# Non-idempotent operation - dangerous to retry
async def create_order(order: Order):
return await db.insert(order) # May create duplicates!
# Idempotent operation - safe to retry
async def create_order_idempotent(order: Order):
return await db.upsert(
order,
conflict_keys=["order_id"], # Won't create duplicates
)
4. Respect Backpressure
from olytix-core.resilience import retry, BackpressureHandler
@retry(
max_attempts=5,
backpressure=BackpressureHandler(
respect_retry_after=True,
max_backoff_seconds=300, # 5 min max
stop_on_quota_exceeded=True,
),
)
async def api_call_with_backpressure(data: dict):
"""Respects rate limits and quotas."""
return await external_api.call(data)
5. Log Retry Context
import structlog
logger = structlog.get_logger()
@retry(
max_attempts=3,
on_retry=lambda a, e, d: logger.warning(
"retrying_operation",
attempt=a,
error=str(e),
delay=d,
operation="warehouse_query",
),
)
async def tracked_operation():
# Log context automatically included in retry logs
return await warehouse.execute(sql)
Next Steps
- Configure circuit breakers for failure isolation
- Set up monitoring for retry metrics
- Configure logging for debugging retry issues