Retry & Backoff

Retries keep transient failures from becoming outages. Use backoff to prevent retry storms.

Principles

Treat transient errors as retryable.
Fail fast on permanent errors (bad input, missing resources).
Add jitter to spread retries over time.

When to retry

Network timeouts
Rate-limited downstream APIs
Temporary resource exhaustion

Avoid retrying when:

Validation fails
Authorization fails
The task is not idempotent

Backoff strategy

Start with a short delay, then increase gradually.
Cap the maximum delay.
Add random jitter to avoid spikes.

Stem defaults

Workers use ExponentialJitterRetryStrategy by default:

base: 2 seconds
max: 5 minutes

Retries are scheduled by publishing a new envelope with notBefore set to the next retry time. Each retry increments the attempt counter until TaskOptions.maxRetries is exhausted.

Task options

Use maxRetries on task handlers to cap retries:

lib/retry_backoff.dart
  @override
  TaskOptions get options => const TaskOptions(maxRetries: 2);

Custom strategies

Provide a custom RetryStrategy to the worker when you need fixed delays, linear backoff, or bespoke logic:

Strategy config
StemApp worker config
Custom strategy
Fixed-delay worker

lib/retry_backoff.dart
final RetryStrategy retryStrategy = ExponentialJitterRetryStrategy(
  base: const Duration(milliseconds: 200),
  max: const Duration(seconds: 2),
);

lib/retry_backoff.dart
  final workerConfig = StemWorkerConfig(retryStrategy: retryStrategy);

lib/retry_backoff.dart
class FixedDelayRetryStrategy implements RetryStrategy {
  const FixedDelayRetryStrategy(this.delay);

  final Duration delay;

  @override
  Duration nextDelay(int attempt, Object error, StackTrace stackTrace) => delay;
}

lib/retry_backoff.dart
const fixedDelayWorkerConfig = StemWorkerConfig(
  retryStrategy: FixedDelayRetryStrategy(Duration(seconds: 1)),
);

You can also implement your own strategy by conforming to the RetryStrategy interface and returning the desired delay for each attempt.

Observability cues

Watch these signals and metrics to verify retry behavior:

StemSignals.taskRetry includes the next retry timestamp.
stem.tasks.retried and stem.tasks.failed counters highlight spikes.
DLQ volume indicates retries are exhausting or errors are permanent.

lib/retry_backoff.dart
void registerRetrySignals() {
  StemSignals.taskRetry.connect((payload, _) {
    print('Retry scheduled at ${payload.nextRetryAt}');
  });
}

Operational checklist

Monitor retry rates and DLQ volume.
Alert on sustained retry spikes.
Requeue only after the root cause is fixed.

Principles​

When to retry​

Backoff strategy​

Stem defaults​

Task options​

Custom strategies​

Observability cues​

Operational checklist​

Next steps​