Retry & Backoff
Retries keep transient failures from becoming outages. Use backoff to prevent retry storms.
Principles
- Treat transient errors as retryable.
- Fail fast on permanent errors (bad input, missing resources).
- Add jitter to spread retries over time.
When to retry
- Network timeouts
- Rate-limited downstream APIs
- Temporary resource exhaustion
Avoid retrying when:
- Validation fails
- Authorization fails
- The task is not idempotent
Backoff strategy
- Start with a short delay, then increase gradually.
- Cap the maximum delay.
- Add random jitter to avoid spikes.
Stem defaults
Workers use ExponentialJitterRetryStrategy by default:
base: 2 secondsmax: 5 minutes
Retries are scheduled by publishing a new envelope with notBefore set to the
next retry time. Each retry increments the attempt counter until
TaskOptions.maxRetries is exhausted.
Task options
Use maxRetries on task handlers to cap retries:
lib/retry_backoff.dart
TaskOptions get options => const TaskOptions(maxRetries: 2);
Custom strategies
Provide a custom RetryStrategy to the worker when you need fixed delays,
linear backoff, or bespoke logic:
- Strategy config
- StemApp worker config
- Custom strategy
- Fixed-delay worker
lib/retry_backoff.dart
final RetryStrategy retryStrategy = ExponentialJitterRetryStrategy(
base: const Duration(milliseconds: 200),
max: const Duration(seconds: 2),
);
lib/retry_backoff.dart
final workerConfig = StemWorkerConfig(retryStrategy: retryStrategy);
lib/retry_backoff.dart
class FixedDelayRetryStrategy implements RetryStrategy {
const FixedDelayRetryStrategy(this.delay);
final Duration delay;
Duration nextDelay(int attempt, Object error, StackTrace stackTrace) => delay;
}
lib/retry_backoff.dart
const fixedDelayWorkerConfig = StemWorkerConfig(
retryStrategy: FixedDelayRetryStrategy(Duration(seconds: 1)),
);
You can also implement your own strategy by conforming to the RetryStrategy
interface and returning the desired delay for each attempt.
Observability cues
Watch these signals and metrics to verify retry behavior:
StemSignals.taskRetryincludes the next retry timestamp.stem.tasks.retriedandstem.tasks.failedcounters highlight spikes.- DLQ volume indicates retries are exhausting or errors are permanent.
lib/retry_backoff.dart
void registerRetrySignals() {
StemSignals.taskRetry.connect((payload, _) {
print('Retry scheduled at ${payload.nextRetryAt}');
});
}
Operational checklist
- Monitor retry rates and DLQ volume.
- Alert on sustained retry spikes.
- Requeue only after the root cause is fixed.