Retry Policies

For Platform Teams

Olytix Core implements configurable retry policies with multiple backoff strategies to handle transient failures gracefully. This guide covers retry configuration, backoff algorithms, and best practices for reliable operation.

Overview

Retry policies help handle transient failures by:

Automatically retrying failed operations
Using backoff strategies to avoid overwhelming services
Adding jitter to prevent thundering herd problems
Respecting circuit breakers to stop retries when appropriate

Retry Flow with Exponential Backoff

Click 'Start Demo' to see the retry flow in action.

▶️Start

⚡ExecuteRun operation

❓Success?

Yes

✅Success

⏳WaitBackoff

🔢Max?0/3

Yes

❌Failure

Retry with exponential backoff

Legend

Start

Execute

Retry/Wait

Success

Failure

Configuration

Environment Variables

# Enable retry policies
OLYTIX_RETRY__ENABLED=true

# Default retry settings
OLYTIX_RETRY__MAX_ATTEMPTS=3
OLYTIX_RETRY__INITIAL_DELAY_SECONDS=1.0
OLYTIX_RETRY__MAX_DELAY_SECONDS=60.0
OLYTIX_RETRY__BACKOFF_MULTIPLIER=2.0
OLYTIX_RETRY__JITTER=true

# Warehouse-specific settings
OLYTIX_RETRY__WAREHOUSE__MAX_ATTEMPTS=5
OLYTIX_RETRY__WAREHOUSE__INITIAL_DELAY_SECONDS=2.0
OLYTIX_RETRY__WAREHOUSE__MAX_DELAY_SECONDS=120.0

# Cache-specific settings
OLYTIX_RETRY__CACHE__MAX_ATTEMPTS=2
OLYTIX_RETRY__CACHE__INITIAL_DELAY_SECONDS=0.1
OLYTIX_RETRY__CACHE__MAX_DELAY_SECONDS=1.0

Configuration File

# config/retry.yaml
retry:
  enabled: true

  # Default configuration
  default:
    max_attempts: 3
    initial_delay_seconds: 1.0
    max_delay_seconds: 60.0
    backoff_strategy: exponential
    backoff_multiplier: 2.0
    jitter: true
    jitter_factor: 0.25
    retryable_exceptions:
      - ConnectionError
      - TimeoutError
      - TemporaryError
    non_retryable_exceptions:
      - ValidationError
      - AuthenticationError
      - NotFoundError

  # Per-service configurations
  services:
    warehouse:
      max_attempts: 5
      initial_delay_seconds: 2.0
      max_delay_seconds: 120.0
      backoff_strategy: exponential
      retryable_status_codes: [429, 500, 502, 503, 504]

    redis_cache:
      max_attempts: 2
      initial_delay_seconds: 0.1
      max_delay_seconds: 1.0
      backoff_strategy: fixed

    external_api:
      max_attempts: 4
      initial_delay_seconds: 1.0
      max_delay_seconds: 30.0
      backoff_strategy: exponential_with_full_jitter
      respect_retry_after: true

Backoff Strategies

Fixed Delay

Waits the same duration between each retry:

from olytix-core.resilience import RetryPolicy, FixedBackoff

policy = RetryPolicy(
    max_attempts=3,
    backoff=FixedBackoff(delay_seconds=2.0),
)

# Retry delays: 2s, 2s, 2s

Exponential Backoff

Doubles the delay after each failure:

from olytix-core.resilience import RetryPolicy, ExponentialBackoff

policy = RetryPolicy(
    max_attempts=5,
    backoff=ExponentialBackoff(
        initial_delay_seconds=1.0,
        multiplier=2.0,
        max_delay_seconds=60.0,
    ),
)

# Retry delays: 1s, 2s, 4s, 8s, 16s (capped at 60s)

Exponential with Jitter

Adds randomness to prevent thundering herd:

from olytix-core.resilience import RetryPolicy, ExponentialBackoffWithJitter

# Decorrelated jitter (recommended)
policy = RetryPolicy(
    max_attempts=5,
    backoff=ExponentialBackoffWithJitter(
        initial_delay_seconds=1.0,
        max_delay_seconds=60.0,
        jitter_type="decorrelated",  # or "full", "equal"
    ),
)

# Full jitter: random between 0 and exponential delay
# Equal jitter: exponential/2 + random(0, exponential/2)
# Decorrelated: random between base and previous * 3

Linear Backoff

Increases delay by fixed amount:

from olytix-core.resilience import RetryPolicy, LinearBackoff

policy = RetryPolicy(
    max_attempts=5,
    backoff=LinearBackoff(
        initial_delay_seconds=1.0,
        increment_seconds=2.0,
        max_delay_seconds=30.0,
    ),
)

# Retry delays: 1s, 3s, 5s, 7s, 9s

Custom Backoff

Implement your own backoff logic:

from olytix-core.resilience import RetryPolicy, BackoffStrategy

class FibonacciBackoff(BackoffStrategy):
    """Fibonacci sequence backoff."""

    def __init__(self, max_delay_seconds: float = 60.0):
        self.max_delay_seconds = max_delay_seconds

    def get_delay(self, attempt: int) -> float:
        a, b = 1, 1
        for _ in range(attempt):
            a, b = b, a + b
        return min(a, self.max_delay_seconds)

policy = RetryPolicy(
    max_attempts=6,
    backoff=FibonacciBackoff(max_delay_seconds=30.0),
)

# Retry delays: 1s, 1s, 2s, 3s, 5s, 8s

Implementation

Decorator Usage

from olytix-core.resilience import retry

@retry(
    max_attempts=3,
    backoff="exponential",
    initial_delay=1.0,
    retryable_exceptions=[ConnectionError, TimeoutError],
)
async def query_warehouse(sql: str) -> list[dict]:
    """Query warehouse with automatic retry."""
    return await warehouse.execute(sql)

Context Manager

from olytix-core.resilience import RetryContext

async def execute_with_retry(operation):
    """Execute operation with retry context."""
    async with RetryContext(
        max_attempts=3,
        backoff=ExponentialBackoff(initial_delay_seconds=1.0),
    ) as retry:
        while True:
            try:
                return await operation()
            except (ConnectionError, TimeoutError) as e:
                if not await retry.should_retry(e):
                    raise
                await retry.wait()

Class-Based Implementation

from olytix-core.resilience import RetryPolicy, RetryConfig

class ResilientWarehouseAdapter:
    """Warehouse adapter with retry support."""

    def __init__(self, config: WarehouseConfig):
        self.warehouse = Warehouse(config)
        self.retry_policy = RetryPolicy(
            config=RetryConfig(
                max_attempts=config.retry.max_attempts,
                backoff=ExponentialBackoffWithJitter(
                    initial_delay_seconds=config.retry.initial_delay,
                    max_delay_seconds=config.retry.max_delay,
                ),
            ),
            on_retry=self._on_retry,
        )

    async def execute(self, sql: str) -> list[dict]:
        """Execute SQL with automatic retry."""
        return await self.retry_policy.execute(
            lambda: self.warehouse.execute(sql),
            context={"sql_hash": hash(sql)},
        )

    async def _on_retry(
        self,
        attempt: int,
        error: Exception,
        delay: float,
        context: dict,
    ):
        """Log retry attempts."""
        logger.warning(
            "query_retry",
            attempt=attempt,
            error_type=type(error).__name__,
            error_message=str(error),
            next_delay_seconds=delay,
            sql_hash=context.get("sql_hash"),
        )

Advanced Patterns

Retry with Circuit Breaker

from olytix-core.resilience import retry, circuit_breaker

@circuit_breaker(name="warehouse", failure_threshold=5)
@retry(max_attempts=3, backoff="exponential")
async def query_warehouse(sql: str) -> list[dict]:
    """Query with both retry and circuit breaker.

    Retry handles transient failures.
    Circuit breaker prevents retries when service is down.
    """
    return await warehouse.execute(sql)

Conditional Retry

from olytix-core.resilience import retry, RetryCondition

def should_retry(error: Exception, response: Any) -> bool:
    """Determine if operation should be retried."""
    if isinstance(error, RateLimitError):
        return True
    if isinstance(error, HTTPError):
        return error.status_code in [429, 500, 502, 503, 504]
    return False

@retry(
    max_attempts=5,
    condition=should_retry,
    backoff="exponential",
)
async def call_external_api(request: dict) -> dict:
    """Call API with conditional retry."""
    return await api_client.post("/endpoint", json=request)

Retry-After Header Support

from olytix-core.resilience import retry, RateLimitBackoff

@retry(
    max_attempts=5,
    backoff=RateLimitBackoff(
        fallback_delay_seconds=60.0,
        respect_retry_after=True,
    ),
)
async def rate_limited_api_call(data: dict) -> dict:
    """Call API respecting Retry-After headers."""
    try:
        return await api_client.post("/endpoint", json=data)
    except RateLimitError as e:
        # RateLimitBackoff will use e.retry_after if available
        raise

Idempotency Keys

import uuid
from olytix-core.resilience import retry

@retry(max_attempts=3, backoff="exponential")
async def create_resource(data: dict, idempotency_key: str = None) -> dict:
    """Create resource with idempotency for safe retries."""
    key = idempotency_key or str(uuid.uuid4())

    return await api_client.post(
        "/resources",
        json=data,
        headers={"Idempotency-Key": key},
    )

# Safe to retry - same key ensures operation runs once
result = await create_resource(
    {"name": "example"},
    idempotency_key="user-123-create-resource-abc",
)

Retry Budget

from olytix-core.resilience import RetryBudget, retry

# Limit total retries across all requests
budget = RetryBudget(
    max_retries_per_second=10,
    retry_ratio=0.2,  # Max 20% of requests can be retries
    min_retries_per_second=2,  # Always allow some retries
)

@retry(max_attempts=3, budget=budget)
async def query_with_budget(sql: str) -> list[dict]:
    """Query with retry budget to prevent retry storms."""
    return await warehouse.execute(sql)

Error Handling

Retryable vs Non-Retryable Errors

from olytix-core.resilience import RetryPolicy

policy = RetryPolicy(
    max_attempts=3,
    retryable_exceptions=[
        # Transient errors - worth retrying
        ConnectionError,
        TimeoutError,
        ConnectionResetError,
        TemporaryError,
    ],
    non_retryable_exceptions=[
        # Permanent errors - don't retry
        ValidationError,
        AuthenticationError,
        AuthorizationError,
        NotFoundError,
        BadRequestError,
    ],
)

Custom Error Classification

from olytix-core.resilience import retry, ErrorClassifier

class WarehouseErrorClassifier(ErrorClassifier):
    """Classify warehouse errors for retry decisions."""

    def is_retryable(self, error: Exception) -> bool:
        if isinstance(error, WarehouseError):
            # Retry on specific error codes
            return error.code in [
                "QUERY_TIMEOUT",
                "CONNECTION_LOST",
                "TEMPORARY_FAILURE",
                "RATE_LIMITED",
            ]
        return isinstance(error, (ConnectionError, TimeoutError))

    def get_delay_hint(self, error: Exception) -> float | None:
        """Get delay hint from error."""
        if hasattr(error, "retry_after"):
            return error.retry_after
        return None

@retry(
    max_attempts=5,
    error_classifier=WarehouseErrorClassifier(),
)
async def query_warehouse(sql: str) -> list[dict]:
    return await warehouse.execute(sql)

Monitoring

Retry Metrics

# Prometheus metrics exposed by Olytix Core
olytix-core_retry_attempts_total{service="warehouse", result="success"} 1523
olytix-core_retry_attempts_total{service="warehouse", result="failure"} 45
olytix-core_retry_attempts_total{service="warehouse", result="exhausted"} 12
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.5"} 2.1
olytix-core_retry_delay_seconds{service="warehouse", quantile="0.95"} 15.3

Alerting Rules

# prometheus-rules.yaml
groups:
  - name: retry_alerts
    rules:
      - alert: HighRetryRate
        expr: |
          sum(rate(olytix-core_retry_attempts_total{result!="success"}[5m]))
          / sum(rate(olytix-core_retry_attempts_total[5m])) > 0.3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High retry rate for {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} of calls require retries"

      - alert: RetryExhaustion
        expr: |
          rate(olytix-core_retry_attempts_total{result="exhausted"}[5m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Retry exhaustion for {{ $labels.service }}"
          description: "Requests failing after all retry attempts"

      - alert: HighRetryLatency
        expr: |
          histogram_quantile(0.95, olytix-core_retry_delay_seconds_bucket) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High retry delays"
          description: "95th percentile retry delay is {{ $value }}s"

Logging

from olytix-core.resilience import RetryPolicy

policy = RetryPolicy(
    max_attempts=3,
    on_retry=lambda attempt, error, delay: logger.warning(
        "operation_retry",
        attempt=attempt,
        max_attempts=3,
        error_type=type(error).__name__,
        error_message=str(error),
        next_delay_seconds=delay,
    ),
    on_exhausted=lambda error: logger.error(
        "retry_exhausted",
        error_type=type(error).__name__,
        error_message=str(error),
    ),
)

Best Practices

1. Set Appropriate Timeouts

# Total timeout = initial_delay * (multiplier^max_attempts - 1) / (multiplier - 1)
# With max_delay cap applied

# Example: 3 retries with 1s initial, 2x multiplier
# Delays: 1s + 2s + 4s = 7s total wait time
# Plus operation time: ~10-15s total

@retry(
    max_attempts=3,
    initial_delay=1.0,
    backoff_multiplier=2.0,
    # Also set operation timeout
    timeout_seconds=30.0,  # Total timeout including retries
)
async def query_with_timeout(sql: str):
    return await warehouse.execute(sql, timeout=10)

2. Always Use Jitter

# Without jitter - all clients retry at same time
ExponentialBackoff(initial_delay=1.0)  # Avoid

# With jitter - retries spread out
ExponentialBackoffWithJitter(
    initial_delay=1.0,
    jitter_type="decorrelated",
)  # Recommended

3. Ensure Idempotency

# Non-idempotent operation - dangerous to retry
async def create_order(order: Order):
    return await db.insert(order)  # May create duplicates!

# Idempotent operation - safe to retry
async def create_order_idempotent(order: Order):
    return await db.upsert(
        order,
        conflict_keys=["order_id"],  # Won't create duplicates
    )

4. Respect Backpressure

from olytix-core.resilience import retry, BackpressureHandler

@retry(
    max_attempts=5,
    backpressure=BackpressureHandler(
        respect_retry_after=True,
        max_backoff_seconds=300,  # 5 min max
        stop_on_quota_exceeded=True,
    ),
)
async def api_call_with_backpressure(data: dict):
    """Respects rate limits and quotas."""
    return await external_api.call(data)

5. Log Retry Context

import structlog

logger = structlog.get_logger()

@retry(
    max_attempts=3,
    on_retry=lambda a, e, d: logger.warning(
        "retrying_operation",
        attempt=a,
        error=str(e),
        delay=d,
        operation="warehouse_query",
    ),
)
async def tracked_operation():
    # Log context automatically included in retry logs
    return await warehouse.execute(sql)

Next Steps

Configure circuit breakers for failure isolation
Set up monitoring for retry metrics
Configure logging for debugging retry issues

Overview​

Retry Flow with Exponential Backoff

Configuration​

Environment Variables​

Configuration File​

Backoff Strategies​

Fixed Delay​

Exponential Backoff​

Exponential with Jitter​

Linear Backoff​

Custom Backoff​

Implementation​

Decorator Usage​

Context Manager​

Class-Based Implementation​

Advanced Patterns​

Retry with Circuit Breaker​

Conditional Retry​

Retry-After Header Support​

Idempotency Keys​

Retry Budget​

Error Handling​

Retryable vs Non-Retryable Errors​

Custom Error Classification​

Monitoring​

Retry Metrics​

Alerting Rules​

Logging​

Best Practices​

1. Set Appropriate Timeouts​

2. Always Use Jitter​

3. Ensure Idempotency​

4. Respect Backpressure​

5. Log Retry Context​

Next Steps​