Job Reliability

VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.

Execution Modes

Jobs run in one of two modes:

Mode	Duration	Mechanism	Example
SYNC	Under 30 seconds	Inline in API request	Web searches, simple queries
ASYNC_QUEUE	Minutes to hours	SQS + Dramatiq workers	Data extractions, large syncs

Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.

Source	Trigger	Behavior	Final Status
User Request	API call or CLI	Cooperative, graceful	`CANCELLED`
SIGTERM	ECS deployment	Cooperative, 120s grace period	`PENDING` (resumable)
Lease Lost	Worker crash	Watchdog marks failed	`FAILED` (retryable)
Timeout	Dramatiq time limit	Forced termination	`FAILED`

When a worker receives SIGTERM during deployment, the job is set to PENDING (not CANCELLED or FAILED). This allows the new worker to immediately pick up and resume from the checkpoint without manual intervention.

Progress Tracking

Monitor extraction progress in real-time:

vai actions get run_abc123

Progress includes:

Current phase: extracting, normalizing, loading
Current resource being processed
Resources completed vs total
Rows extracted so far
Elapsed time

Progress Phases

Phase	Description
`starting`	Initializing extraction
`extracting`	Pulling data from source API
`normalizing`	Applying schema transformations
`loading`	Writing to S3 bronze layer
`completed`	Finished successfully
`failed`	Encountered error
`cancelled`	User or system cancelled

Resume After Failure

If a job fails mid-extraction, it resumes from the last checkpoint.

Per-Resource Checkpointing

Resources are extracted one at a time with checkpoints between each.

Sliced Execution for Large Resources

Some resources (like Klaviyo events) can contain millions of records spanning years. These use time-window slicing for reliable extraction:

Parameter	Value	Purpose
Slice Duration	7 days (configurable)	Time range per slice
Checkpoint	After every slice	Resume point on failure
Progress	`slices_completed / slices_total`	Real-time visibility

Example: 2 years of Klaviyo events = ~104 weekly slices

{
  "resource_cursors": {
    "events": {
      "slices_completed": 47,
      "slices_total": 104
    }
  }
}

Benefits:

Deployments can interrupt mid-resource without data loss
Progress shows meaningful percentages (not stuck at 0%)
Memory-efficient (processes one slice at a time)

Sliced execution commits progress after every slice. Even for massive resources, the maximum lost work on interruption is one slice (~7 days of data).

Checkpoint Flow:

Key points:

Each resource extracts independently
Checkpoint saved after each resource completes
On retry, completed resources are skipped

Example: Extracting profiles, events, lists

Scenario	On Retry
Crashed during `profiles`	Re-extract `profiles` from dlt cursor
Crashed during `events`	Skip `profiles`, resume `events`
Crashed during `lists`	Skip `profiles` + `events`, resume `lists`

dlt Incremental State

Within each resource, dlt maintains cursor state:

Stored in S3 _dlt_pipeline_state/
Tracks last updated_at or similar cursor
On restart, only fetches records after cursor

dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).

Lease-Based Ownership

Workers must acquire a database lease before processing a job:

Parameter	Value	Purpose
Lease Duration	90 seconds	How long a worker owns a job
Heartbeat Interval	30 seconds	How often lease is extended
Watchdog Grace	180 seconds	How stale before recovery kicks in

This prevents duplicate processing when:

SQS delivers the same message twice
A worker is slow but not dead
Network partitions occur

Deployment Safety

When deploying new worker versions, jobs survive automatically through graceful shutdown handling.

How it works:

ECS sends SIGTERM to running containers
Workers have 120 seconds to finish gracefully
Cancellation token is set immediately
Current resource completes and checkpoints
Job set to PENDING (ready for immediate pickup)
New worker resumes from checkpoint

Best Practices for Deployments

Check running jobs before deploying:
```
vai actions list --status running
```
Wait for completion if possible (safest)
Deploy with confidence — jobs resume automatically from checkpoints

Watchdog Recovery

A background task monitors for abandoned jobs and automatically recovers them:

Check	Frequency	Action
Expired leases	Every 5 minutes	Mark `FAILED` (retryable)
Stuck `PENDING`	Every 5 minutes	Re-enqueue if stale

Jobs with worker_lost error are automatically re-enqueued if retry count allows.

Troubleshooting

Job stuck in RUNNING

Check if worker is alive (lease should be fresh)
If lease expired, watchdog will recover within 5 minutes
Manual recovery: vai actions retry run_abc123

Job keeps failing

Check error details: vai actions get run_abc123
If AUTH_ERROR: Update connection credentials
If RATE_LIMITED: Job will auto-retry with backoff
If worker_lost: Infrastructure issue, check ECS logs

Data seems duplicated

Bronze layer may have duplicate files after crash/restart. This is expected:

Bronze = raw data (duplicates acceptable)
Silver layer deduplicates during transformation

SQS Visibility Heartbeat

For very long jobs (8+ hours), the system extends SQS message visibility:

Parameter	Value
Initial Visibility	30 minutes
Extension Interval	5 minutes
Extension Amount	10 minutes

This prevents SQS from re-delivering messages during extremely long extractions.

Job Reliability

Job Reliability

Execution Modes

Cancelling Jobs

Pending Jobs

Running Jobs

Cancellation Sources

Progress Tracking

Progress Phases

Resume After Failure

Per-Resource Checkpointing

Sliced Execution for Large Resources

Checkpoint Flow:

dlt Incremental State

Lease-Based Ownership

Deployment Safety

Best Practices for Deployments

Watchdog Recovery

Troubleshooting

Job stuck in RUNNING

Job keeps failing

Data seems duplicated

SQS Visibility Heartbeat

Next Steps

Data Extraction Guide

Architecture Overview

Data Sources Reference

On this page

Job Reliability

View Checkpoint Flow

View Deployment Safety Flow

View Watchdog Recovery Flow

Data Extraction Guide

Architecture Overview

Data Sources Reference

On this page