Skip to main content

Retry & Backoff

Retries keep transient failures from becoming outages. Use backoff to prevent retry storms.

Principles

  • Treat transient errors as retryable.
  • Fail fast on permanent errors (bad input, missing resources).
  • Add jitter to spread retries over time.

When to retry

  • Network timeouts
  • Rate-limited downstream APIs
  • Temporary resource exhaustion

Avoid retrying when:

  • Validation fails
  • Authorization fails
  • The task is not idempotent

Backoff strategy

  • Start with a short delay, then increase gradually.
  • Cap the maximum delay.
  • Add random jitter to avoid spikes.

Stem defaults

Workers use ExponentialJitterRetryStrategy by default:

  • base: 2 seconds
  • max: 5 minutes

Retries are scheduled by publishing a new envelope with notBefore set to the next retry time. Each retry increments the attempt counter until TaskOptions.maxRetries is exhausted.

Task options

Use maxRetries on task handlers to cap retries:

lib/retry_backoff.dart
  
TaskOptions get options => const TaskOptions(maxRetries: 2);

Custom strategies

Provide a custom RetryStrategy to the worker when you need fixed delays, linear backoff, or bespoke logic:

lib/retry_backoff.dart
final RetryStrategy retryStrategy = ExponentialJitterRetryStrategy(
base: const Duration(milliseconds: 200),
max: const Duration(seconds: 2),
);

You can also implement your own strategy by conforming to the RetryStrategy interface and returning the desired delay for each attempt.

Observability cues

Watch these signals and metrics to verify retry behavior:

  • StemSignals.taskRetry includes the next retry timestamp.
  • stem.tasks.retried and stem.tasks.failed counters highlight spikes.
  • DLQ volume indicates retries are exhausting or errors are permanent.
lib/retry_backoff.dart
void registerRetrySignals() {
StemSignals.taskRetry.connect((payload, _) {
print('Retry scheduled at ${payload.nextRetryAt}');
});
}

Operational checklist

  • Monitor retry rates and DLQ volume.
  • Alert on sustained retry spikes.
  • Requeue only after the root cause is fixed.

Next steps