Skip to main content

Retry Policies

For Platform Teams

Olytix Core implements configurable retry policies with multiple backoff strategies to handle transient failures gracefully. This guide covers retry configuration, backoff algorithms, and best practices for reliable operation.

Overview

Retry policies help handle transient failures by:

  • Automatically retrying failed operations
  • Using backoff strategies to avoid overwhelming services
  • Adding jitter to prevent thundering herd problems
  • Respecting circuit breakers to stop retries when appropriate

Retry Flow with Exponential Backoff

Click 'Start Demo' to see the retry flow in action.

▶️Start
ExecuteRun operation
Success?
Yes
Success
WaitBackoff
🔢Max?0/3
Yes
Failure
Retry with exponential backoff
Legend
Start
Execute
Retry/Wait
Success
Failure

Configuration

Environment Variables

# Enable retry policies
OLYTIX_RETRY__ENABLED=true

# Default retry settings
OLYTIX_RETRY__MAX_ATTEMPTS=3
OLYTIX_RETRY__INITIAL_DELAY_SECONDS=1.0
OLYTIX_RETRY__MAX_DELAY_SECONDS=60.0
OLYTIX_RETRY__BACKOFF_MULTIPLIER=2.0
OLYTIX_RETRY__JITTER=true

# Warehouse-specific settings
OLYTIX_RETRY__WAREHOUSE__MAX_ATTEMPTS=5
OLYTIX_RETRY__WAREHOUSE__INITIAL_DELAY_SECONDS=2.0
OLYTIX_RETRY__WAREHOUSE__MAX_DELAY_SECONDS=120.0

# Cache-specific settings
OLYTIX_RETRY__CACHE__MAX_ATTEMPTS=2
OLYTIX_RETRY__CACHE__INITIAL_DELAY_SECONDS=0.1
OLYTIX_RETRY__CACHE__MAX_DELAY_SECONDS=1.0

Configuration File

# config/retry.yaml
retry:
enabled: true

# Default configuration
default:
max_attempts: 3
initial_delay_seconds: 1.0
max_delay_seconds: 60.0
backoff_strategy: exponential
backoff_multiplier: 2.0
jitter: true
jitter_factor: 0.25
retryable_exceptions:
- ConnectionError
- TimeoutError
- TemporaryError
non_retryable_exceptions:
- ValidationError
- AuthenticationError
- NotFoundError

# Per-service configurations
services:
warehouse:
max_attempts: 5
initial_delay_seconds: 2.0
max_delay_seconds: 120.0
backoff_strategy: exponential
retryable_status_codes: [429, 500, 502, 503, 504]

redis_cache:
max_attempts: 2
initial_delay_seconds: 0.1
max_delay_seconds: 1.0
backoff_strategy: fixed

external_api:
max_attempts: 4
initial_delay_seconds: 1.0
max_delay_seconds: 30.0
backoff_strategy: exponential_with_full_jitter
respect_retry_after: true

Backoff Strategies

Fixed Delay

Waits the same duration between each retry:

from olytix-core.resilience import RetryPolicy, FixedBackoff

policy = RetryPolicy(
max_attempts=3,
backoff=FixedBackoff(delay_seconds=2.0),
)

# Retry delays: 2s, 2s, 2s

Exponential Backoff

Doubles the delay after each failure:

from olytix-core.resilience import RetryPolicy, ExponentialBackoff

policy = RetryPolicy(
max_attempts=5,
backoff=ExponentialBackoff(
initial_delay_seconds=1.0,
multiplier=2.0,
max_delay_seconds=60.0,
),
)

# Retry delays: 1s, 2s, 4s, 8s, 16s (capped at 60s)

Exponential with Jitter

Adds randomness to prevent thundering herd:

from olytix-core.resilience import RetryPolicy, ExponentialBackoffWithJitter

# Decorrelated jitter (recommended)
policy = RetryPolicy(
max_attempts=5,
backoff=ExponentialBackoffWithJitter(
initial_delay_seconds=1.0,
max_delay_seconds=60.0,
jitter_type="decorrelated", # or "full", "equal"
),
)

# Full jitter: random between 0 and exponential delay
# Equal jitter: exponential/2 + random(0, exponential/2)
# Decorrelated: random between base and previous * 3

Linear Backoff

Increases delay by fixed amount:

from olytix-core.resilience import RetryPolicy, LinearBackoff

policy = RetryPolicy(
max_attempts=5,
backoff=LinearBackoff(
initial_delay_seconds=1.0,
increment_seconds=2.0,
max_delay_seconds=30.0,
),
)

# Retry delays: 1s, 3s, 5s, 7s, 9s

Custom Backoff

Implement your own backoff logic:

from olytix-core.resilience import RetryPolicy, BackoffStrategy

class FibonacciBackoff(BackoffStrategy):
"""Fibonacci sequence backoff."""

def __init__(self, max_delay_seconds: float = 60.0):
self.max_delay_seconds = max_delay_seconds

def get_delay(self, attempt: int) -> float:
a, b = 1, 1
for _ in range(attempt):
a, b = b, a + b
return min(a, self.max_delay_seconds)

policy = RetryPolicy(
max_attempts=6,
backoff=FibonacciBackoff(max_delay_seconds=30.0),
)

# Retry delays: 1s, 1s, 2s, 3s, 5s, 8s

Implementation

Decorator Usage

from olytix-core.resilience import retry

@retry(
max_attempts=3,
backoff="exponential",
initial_delay=1.0,
retryable_exceptions=[ConnectionError, TimeoutError],
)
async def query_warehouse(sql: str) -> list[dict]:
"""Query warehouse with automatic retry."""
return await warehouse.execute(sql)

Context Manager

from olytix-core.resilience import RetryContext

async def execute_with_retry(operation):
"""Execute operation with retry context."""
async with RetryContext(
max_attempts=3,
backoff=ExponentialBackoff(initial_delay_seconds=1.0),
) as retry:
while True:
try:
return await operation()
except (ConnectionError, TimeoutError) as e:
if not await retry.should_retry(e):
raise
await retry.wait()

Class-Based Implementation

from olytix-core.resilience import RetryPolicy, RetryConfig

class ResilientWarehouseAdapter:
"""Warehouse adapter with retry support."""

def __init__(self, config: WarehouseConfig):
self.warehouse = Warehouse(config)
self.retry_policy = RetryPolicy(
config=RetryConfig(
max_attempts=config.retry.max_attempts,
backoff=ExponentialBackoffWithJitter(
initial_delay_seconds=config.retry.initial_delay,
max_delay_seconds=config.retry.max_delay,
),
),
on_retry=self._on_retry,
)

async def execute(self, sql: str) -> list[dict]:
"""Execute SQL with automatic retry."""
return await self.retry_policy.execute(
lambda: self.warehouse.execute(sql),
context={"sql_hash": hash(sql)},
)

async def _on_retry(
self,
attempt: int,
error: Exception,
delay: float,
context: dict,
):
"""Log retry attempts."""
logger.warning(
"query_retry",
attempt=attempt,
error_type=type(error).__name__,
error_message=str(error),
next_delay_seconds=delay,
sql_hash=context.get("sql_hash"),
)

Advanced Patterns

Retry with Circuit Breaker

from olytix-core.resilience import retry, circuit_breaker

@circuit_breaker(name="warehouse", failure_threshold=5)
@retry(max_attempts=3, backoff="exponential")
async def query_warehouse(sql: str) -> list[dict]:
"""Query with both retry and circuit breaker.

Retry handles transient failures.
Circuit breaker prevents retries when service is down.
"""
return await warehouse.execute(sql)

Conditional Retry

from olytix-core.resilience import retry, RetryCondition

def should_retry(error: Exception, response: Any) -> bool:
"""Determine if operation should be retried."""
if isinstance(error, RateLimitError):
return True
if isinstance(error, HTTPError):
return error.status_code in [429, 500, 502, 503, 504]
return False

@retry(
max_attempts=5,
condition=should_retry,
backoff="exponential",
)
async def call_external_api(request: dict) -> dict:
"""Call API with conditional retry."""
return await api_client.post("/endpoint", json=request)

Retry-After Header Support

from olytix-core.resilience import retry, RateLimitBackoff

@retry(
max_attempts=5,
backoff=RateLimitBackoff(
fallback_delay_seconds=60.0,
respect_retry_after=True,
),
)
async def rate_limited_api_call(data: dict) -> dict:
"""Call API respecting Retry-After headers."""
try:
return await api_client.post("/endpoint", json=data)
except RateLimitError as e:
# RateLimitBackoff will use e.retry_after if available
raise

Idempotency Keys

import uuid
from olytix-core.resilience import retry

@retry(max_attempts=3, backoff="exponential")
async def create_resource(data: dict, idempotency_key: str = None) -> dict:
"""Create resource with idempotency for safe retries."""
key = idempotency_key or str(uuid.uuid4())

return await api_client.post(
"/resources",
json=data,
headers={"Idempotency-Key": key},
)

# Safe to retry - same key ensures operation runs once
result = await create_resource(
{"name": "example"},
idempotency_key="user-123-create-resource-abc",
)

Retry Budget

from olytix-core.resilience import RetryBudget, retry

# Limit total retries across all requests
budget = RetryBudget(
max_retries_per_second=10,
retry_ratio=0.2, # Max 20% of requests can be retries
min_retries_per_second=2, # Always allow some retries
)

@retry(max_attempts=3, budget=budget)
async def query_with_budget(sql: str) -> list[dict]:
"""Query with retry budget to prevent retry storms."""
return await warehouse.execute(sql)

Error Handling

Retryable vs Non-Retryable Errors

from olytix-core.resilience import RetryPolicy

policy = RetryPolicy(
max_attempts=3,
retryable_exceptions=[
# Transient errors - worth retrying
ConnectionError,
TimeoutError,
ConnectionResetError,
TemporaryError,
],
non_retryable_exceptions=[
# Permanent errors - don't retry
ValidationError,
AuthenticationError,
AuthorizationError,
NotFoundError,
BadRequestError,
],
)

Custom Error Classification

from olytix-core.resilience import retry, ErrorClassifier

class WarehouseErrorClassifier(ErrorClassifier):
"""Classify warehouse errors for retry decisions."""

def is_retryable(self, error: Exception) -> bool:
if isinstance(error, WarehouseError):
# Retry on specific error codes
return error.code in [
"QUERY_TIMEOUT",
"CONNECTION_LOST",
"TEMPORARY_FAILURE",
"RATE_LIMITED",
]
return isinstance(error, (ConnectionError, TimeoutError))

def get_delay_hint(self, error: Exception) -> float | None:
"""Get delay hint from error."""
if hasattr(error, "retry_after"):
return error.retry_after
return None

@retry(
max_attempts=5,
error_classifier=WarehouseErrorClassifier(),
)
async def query_warehouse(sql: str) -> list[dict]:
return await warehouse.execute(sql)

Monitoring

Retry Metrics

# Prometheus metrics exposed by Olytix Core
olytix-core_retry_attempts_total{service="warehouse", result="success"} 1523
olytix-core_retry_attempts_total{service="warehouse", result="failure"} 45
olytix-core_retry_attempts_total{service="warehouse", result="exhausted"} 12
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.5"} 2.1
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.95"} 15.3

Alerting Rules

# prometheus-rules.yaml
groups:
- name: retry_alerts
rules:
- alert: HighRetryRate
expr: |
sum(rate(olytix-core_retry_attempts_total{result!="success"}[5m]))
/ sum(rate(olytix-core_retry_attempts_total[5m])) > 0.3
for: 5m
labels:
severity: warning
annotations:
summary: "High retry rate for {{ $labels.service }}"
description: "{{ $value | humanizePercentage }} of calls require retries"

- alert: RetryExhaustion
expr: |
rate(olytix-core_retry_attempts_total{result="exhausted"}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "Retry exhaustion for {{ $labels.service }}"
description: "Requests failing after all retry attempts"

- alert: HighRetryLatency
expr: |
histogram_quantile(0.95, olytix-core_retry_delay_seconds_bucket) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "High retry delays"
description: "95th percentile retry delay is {{ $value }}s"

Logging

from olytix-core.resilience import RetryPolicy

policy = RetryPolicy(
max_attempts=3,
on_retry=lambda attempt, error, delay: logger.warning(
"operation_retry",
attempt=attempt,
max_attempts=3,
error_type=type(error).__name__,
error_message=str(error),
next_delay_seconds=delay,
),
on_exhausted=lambda error: logger.error(
"retry_exhausted",
error_type=type(error).__name__,
error_message=str(error),
),
)

Best Practices

1. Set Appropriate Timeouts

# Total timeout = initial_delay * (multiplier^max_attempts - 1) / (multiplier - 1)
# With max_delay cap applied

# Example: 3 retries with 1s initial, 2x multiplier
# Delays: 1s + 2s + 4s = 7s total wait time
# Plus operation time: ~10-15s total

@retry(
max_attempts=3,
initial_delay=1.0,
backoff_multiplier=2.0,
# Also set operation timeout
timeout_seconds=30.0, # Total timeout including retries
)
async def query_with_timeout(sql: str):
return await warehouse.execute(sql, timeout=10)

2. Always Use Jitter

# Without jitter - all clients retry at same time
ExponentialBackoff(initial_delay=1.0) # Avoid

# With jitter - retries spread out
ExponentialBackoffWithJitter(
initial_delay=1.0,
jitter_type="decorrelated",
) # Recommended

3. Ensure Idempotency

# Non-idempotent operation - dangerous to retry
async def create_order(order: Order):
return await db.insert(order) # May create duplicates!

# Idempotent operation - safe to retry
async def create_order_idempotent(order: Order):
return await db.upsert(
order,
conflict_keys=["order_id"], # Won't create duplicates
)

4. Respect Backpressure

from olytix-core.resilience import retry, BackpressureHandler

@retry(
max_attempts=5,
backpressure=BackpressureHandler(
respect_retry_after=True,
max_backoff_seconds=300, # 5 min max
stop_on_quota_exceeded=True,
),
)
async def api_call_with_backpressure(data: dict):
"""Respects rate limits and quotas."""
return await external_api.call(data)

5. Log Retry Context

import structlog

logger = structlog.get_logger()

@retry(
max_attempts=3,
on_retry=lambda a, e, d: logger.warning(
"retrying_operation",
attempt=a,
error=str(e),
delay=d,
operation="warehouse_query",
),
)
async def tracked_operation():
# Log context automatically included in retry logs
return await warehouse.execute(sql)

Next Steps