VirtuousAI

Job Reliability

Understanding cancellation, progress tracking, and resume capabilities

Job Reliability

VirtuousAI ensures long-running jobs complete reliably, even during worker failures or deployments. This guide explains the reliability mechanisms and how to use them.

Execution Modes

Jobs run in one of two modes:

ModeDurationMechanismExample
SYNCUnder 30 secondsInline in API requestWeb searches, simple queries
ASYNC_QUEUEMinutes to hoursSQS + Dramatiq workersData extractions, large syncs

Long-running data extractions always use ASYNC_QUEUE mode with full reliability features.

Cancelling Jobs

Pending Jobs

Jobs in PENDING status are cancelled immediately:

vai actions cancel run_abc123
# Run cancelled
curl -X POST https://api.virtuousai.com/api/v1/action-runs/run_abc123/cancel \
  -H "Authorization: Bearer $VAI_API_KEY"

Running Jobs

Jobs in RUNNING status use cooperative cancellation:

  1. API sets cancel_requested_at on the run
  2. Worker checks for cancellation every 30 seconds
  3. Current resource completes (no partial data)
  4. Job transitions to CANCELLED

Cancellation is checked between resources, not mid-resource. The current resource always completes to avoid partial data.

Cancellation Sources

SourceTriggerBehaviorFinal Status
User RequestAPI call or CLICooperative, gracefulCANCELLED
SIGTERMECS deploymentCooperative, 120s grace periodPENDING (resumable)
Lease LostWorker crashWatchdog marks failedFAILED (retryable)
TimeoutDramatiq time limitForced terminationFAILED

New: SIGTERM results in PENDING

When a worker receives SIGTERM during deployment, the job is set to PENDING (not CANCELLED or FAILED). This allows the new worker to immediately pick up and resume from the checkpoint without manual intervention.

Progress Tracking

Monitor extraction progress in real-time:

vai actions get run_abc123

Progress includes:

  • Current phase: extracting, normalizing, loading
  • Current resource being processed
  • Resources completed vs total
  • Rows extracted so far
  • Elapsed time

Progress Phases

PhaseDescription
startingInitializing extraction
extractingPulling data from source API
normalizingApplying schema transformations
loadingWriting to S3 bronze layer
completedFinished successfully
failedEncountered error
cancelledUser or system cancelled

Resume After Failure

If a job fails mid-extraction, it resumes from the last checkpoint.

Per-Resource Checkpointing

Resources are extracted one at a time with checkpoints between each.

Sliced Execution for Large Resources

Some resources (like Klaviyo events) can contain millions of records spanning years. These use time-window slicing for reliable extraction:

ParameterValuePurpose
Slice Duration7 days (configurable)Time range per slice
CheckpointAfter every sliceResume point on failure
Progressslices_completed / slices_totalReal-time visibility

Example: 2 years of Klaviyo events = ~104 weekly slices

{
  "resource_cursors": {
    "events": {
      "slices_completed": 47,
      "slices_total": 104
    }
  }
}

Benefits:

  • Deployments can interrupt mid-resource without data loss
  • Progress shows meaningful percentages (not stuck at 0%)
  • Memory-efficient (processes one slice at a time)

Sliced execution commits progress after every slice. Even for massive resources, the maximum lost work on interruption is one slice (~7 days of data).

Checkpoint Flow:

Key points:

  1. Each resource extracts independently
  2. Checkpoint saved after each resource completes
  3. On retry, completed resources are skipped

Example: Extracting profiles, events, lists

ScenarioOn Retry
Crashed during profilesRe-extract profiles from dlt cursor
Crashed during eventsSkip profiles, resume events
Crashed during listsSkip profiles + events, resume lists

dlt Incremental State

Within each resource, dlt maintains cursor state:

  • Stored in S3 _dlt_pipeline_state/
  • Tracks last updated_at or similar cursor
  • On restart, only fetches records after cursor

dlt state commits at the END of each resource. If crashed mid-resource, that resource re-extracts from its last cursor (not mid-page).

Lease-Based Ownership

Workers must acquire a database lease before processing a job:

ParameterValuePurpose
Lease Duration90 secondsHow long a worker owns a job
Heartbeat Interval30 secondsHow often lease is extended
Watchdog Grace180 secondsHow stale before recovery kicks in

This prevents duplicate processing when:

  • SQS delivers the same message twice
  • A worker is slow but not dead
  • Network partitions occur

Deployment Safety

When deploying new worker versions, jobs survive automatically through graceful shutdown handling.

How it works:

  1. ECS sends SIGTERM to running containers
  2. Workers have 120 seconds to finish gracefully
  3. Cancellation token is set immediately
  4. Current resource completes and checkpoints
  5. Job set to PENDING (ready for immediate pickup)
  6. New worker resumes from checkpoint

Best Practices for Deployments

  1. Check running jobs before deploying:

    vai actions list --status running
  2. Wait for completion if possible (safest)

  3. Deploy with confidence — jobs resume automatically from checkpoints

Watchdog Recovery

A background task monitors for abandoned jobs and automatically recovers them:

CheckFrequencyAction
Expired leasesEvery 5 minutesMark FAILED (retryable)
Stuck PENDINGEvery 5 minutesRe-enqueue if stale

Jobs with worker_lost error are automatically re-enqueued if retry count allows.

Troubleshooting

Job stuck in RUNNING

  1. Check if worker is alive (lease should be fresh)
  2. If lease expired, watchdog will recover within 5 minutes
  3. Manual recovery: vai actions retry run_abc123

Job keeps failing

  1. Check error details: vai actions get run_abc123
  2. If AUTH_ERROR: Update connection credentials
  3. If RATE_LIMITED: Job will auto-retry with backoff
  4. If worker_lost: Infrastructure issue, check ECS logs

Data seems duplicated

Bronze layer may have duplicate files after crash/restart. This is expected:

  • Bronze = raw data (duplicates acceptable)
  • Silver layer deduplicates during transformation

SQS Visibility Heartbeat

For very long jobs (8+ hours), the system extends SQS message visibility:

ParameterValue
Initial Visibility30 minutes
Extension Interval5 minutes
Extension Amount10 minutes

This prevents SQS from re-delivering messages during extremely long extractions.

Next Steps

On this page