Job Reliability
Understanding cancellation, progress tracking, and resume capabilities
Job Reliability
VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.
Execution Modes
Jobs run in one of two modes:
| Mode | Duration | Mechanism | Example |
|---|---|---|---|
| SYNC | Under 30 seconds | Inline in API request | Web searches, simple queries |
| ASYNC_QUEUE | Minutes to hours | SQS + Dramatiq workers | Data extractions, large syncs |
Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.
Cancelling Jobs
Pending Jobs
Jobs in PENDING status are cancelled immediately:
vai actions cancel run_abc123
# Run cancelledcurl -X POST https://api.virtuousai.com/api/v1/action-runs/run_abc123/cancel \
-H "Authorization: Bearer $VAI_API_KEY"Running Jobs
Jobs in RUNNING status use cooperative cancellation:
- API sets
cancel_requested_aton the run - Worker checks for cancellation every 30 seconds
- Current resource completes (no partial data)
- Job transitions to
CANCELLED
Cancellation is checked between resources, not mid-resource. The current resource always completes to avoid partial data.
Cancellation Sources
| Source | Trigger | Behavior | Final Status |
|---|---|---|---|
| User Request | API call or CLI | Cooperative, graceful | CANCELLED |
| SIGTERM | ECS deployment | Cooperative, 120s grace period | PENDING (resumable) |
| Lease Lost | Worker crash | Watchdog marks failed | FAILED (retryable) |
| Timeout | Dramatiq time limit | Forced termination | FAILED |
New: SIGTERM results in PENDING
When a worker receives SIGTERM during deployment, the job is set to PENDING (not CANCELLED or FAILED). This allows the new worker to immediately pick up and resume from the checkpoint without manual intervention.
Progress Tracking
Monitor extraction progress in real-time:
vai actions get run_abc123Progress includes:
- Current phase:
extracting,normalizing,loading - Current resource being processed
- Resources completed vs total
- Rows extracted so far
- Elapsed time
Progress Phases
| Phase | Description |
|---|---|
starting | Initializing extraction |
extracting | Pulling data from source API |
normalizing | Applying schema transformations |
loading | Writing to S3 bronze layer |
completed | Finished successfully |
failed | Encountered error |
cancelled | User or system cancelled |
Resume After Failure
If a job fails mid-extraction, it resumes from the last checkpoint.
Per-Resource Checkpointing
Resources are extracted one at a time with checkpoints between each.
Sliced Execution for Large Resources
Some resources (like Klaviyo events) can contain millions of records spanning years. These use time-window slicing for reliable extraction:
| Parameter | Value | Purpose |
|---|---|---|
| Slice Duration | 7 days (configurable) | Time range per slice |
| Checkpoint | After every slice | Resume point on failure |
| Progress | slices_completed / slices_total | Real-time visibility |
Example: 2 years of Klaviyo events = ~104 weekly slices
{
"resource_cursors": {
"events": {
"slices_completed": 47,
"slices_total": 104
}
}
}Benefits:
- Deployments can interrupt mid-resource without data loss
- Progress shows meaningful percentages (not stuck at 0%)
- Memory-efficient (processes one slice at a time)
Sliced execution commits progress after every slice. Even for massive resources, the maximum lost work on interruption is one slice (~7 days of data).
Checkpoint Flow:
Key points:
- Each resource extracts independently
- Checkpoint saved after each resource completes
- On retry, completed resources are skipped
Example: Extracting profiles, events, lists
| Scenario | On Retry |
|---|---|
Crashed during profiles | Re-extract profiles from dlt cursor |
Crashed during events | Skip profiles, resume events |
Crashed during lists | Skip profiles + events, resume lists |
dlt Incremental State
Within each resource, dlt maintains cursor state:
- Stored in S3
_dlt_pipeline_state/ - Tracks last
updated_ator similar cursor - On restart, only fetches records after cursor
dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).
Lease-Based Ownership
Workers must acquire a database lease before processing a job:
| Parameter | Value | Purpose |
|---|---|---|
| Lease Duration | 90 seconds | How long a worker owns a job |
| Heartbeat Interval | 30 seconds | How often lease is extended |
| Watchdog Grace | 180 seconds | How stale before recovery kicks in |
This prevents duplicate processing when:
- SQS delivers the same message twice
- A worker is slow but not dead
- Network partitions occur
Deployment Safety
When deploying new worker versions, jobs survive automatically through graceful shutdown handling.
How it works:
- ECS sends
SIGTERMto running containers - Workers have 120 seconds to finish gracefully
- Cancellation token is set immediately
- Current resource completes and checkpoints
- Job set to
PENDING(ready for immediate pickup) - New worker resumes from checkpoint
Best Practices for Deployments
-
Check running jobs before deploying:
vai actions list --status running -
Wait for completion if possible (safest)
-
Deploy with confidence — jobs resume automatically from checkpoints
Watchdog Recovery
A background task monitors for abandoned jobs and automatically recovers them:
| Check | Frequency | Action |
|---|---|---|
| Expired leases | Every 5 minutes | Mark FAILED (retryable) |
Stuck PENDING | Every 5 minutes | Re-enqueue if stale |
Jobs with worker_lost error are automatically re-enqueued if retry count allows.
Troubleshooting
Job stuck in RUNNING
- Check if worker is alive (lease should be fresh)
- If lease expired, watchdog will recover within 5 minutes
- Manual recovery:
vai actions retry run_abc123
Job keeps failing
- Check error details:
vai actions get run_abc123 - If
AUTH_ERROR: Update connection credentials - If
RATE_LIMITED: Job will auto-retry with backoff - If
worker_lost: Infrastructure issue, check ECS logs
Data seems duplicated
Bronze layer may have duplicate files after crash/restart. This is expected:
- Bronze = raw data (duplicates acceptable)
- Silver layer deduplicates during transformation
SQS Visibility Heartbeat
For very long jobs (8+ hours), the system extends SQS message visibility:
| Parameter | Value |
|---|---|
| Initial Visibility | 30 minutes |
| Extension Interval | 5 minutes |
| Extension Amount | 10 minutes |
This prevents SQS from re-delivering messages during extremely long extractions.