Skip to main content

Reliability Guide

This guide summarizes reliability practices for task systems using Stem.

Recovery workflow

  1. Identify the failing task or queue.
  2. Inspect recent errors and DLQ entries.
  3. Fix the root cause before replaying.
  4. Replay only the affected tasks.

Broker fetch notes

  • Redis Streams uses consumer groups plus XAUTOCLAIM to reclaim idle deliveries; long-running tasks should emit heartbeats or extend leases.
  • Postgres uses polling with locked_until leases; tasks become visible again after the lease expires.

Workflow lease notes

  • Workflow runs are lease-based. Workers must renew leases while executing, and other workers can take over after the lease expires.
  • Keep runLeaseDuration >= broker visibility timeout to prevent redelivered workflow tasks from being dropped before takeover is possible.
  • Keep leaseExtension renewals ahead of both the workflow lease expiry and the broker visibility timeout.

Poison-pill handling

  • If a task fails repeatedly for the same reason, treat it as a poison pill.
  • Move it to the DLQ and add guardrails or validation to prevent repeats.
  • Record the failure pattern for future detection.

Scheduler reliability

  • Run multiple Beat instances only when backed by a shared lock store.
  • Monitor schedule drift and failures to detect store latency.
  • Re-apply schedules after deploys to ensure definitions stay current.

Retries and backoff

  • Use bounded retries with jittered backoff to avoid thundering herds.
  • Separate transient failures from permanent failures.
  • For permanent errors, fail fast and alert.
retry_task/bin/worker.dart
  final worker = Worker(
broker: broker,
registry: registry,
backend: backend,
queue: 'retry-demo',
consumerName: workerName,
retryStrategy: ExponentialJitterRetryStrategy(
base: const Duration(milliseconds: 200),
max: const Duration(seconds: 1),
),
);

Heartbeats and progress

Use heartbeats and progress updates to prevent long-running tasks from being reclaimed prematurely.

progress_heartbeat/bin/worker.dart
  final worker = Worker(
broker: broker,
registry: registry,
backend: backend,
queue: progressQueue,
subscription: RoutingSubscription.singleQueue(progressQueue),
consumerName: workerName,
heartbeatInterval: const Duration(seconds: 2),
workerHeartbeatInterval: const Duration(seconds: 5),
prefetchMultiplier: 1,
);

Observability signals

  • Track retry rates and DLQ volume as reliability signals.
  • Monitor queue backlog and worker heartbeats to detect stalls.
  • Tie task IDs to business logs for fast root-cause analysis.
  • Use StemSignals.taskRetry / taskFailed to drive notifications when error rates spike.

Operational checks

stem health --broker "$STEM_BROKER_URL" --backend "$STEM_RESULT_BACKEND_URL"
stem observe queues
stem observe workers
stem dlq list --queue <queue>

Next steps