Event-Driven ETL Dependency Orchestration

The Challenge of Pipeline Orchestration

Modern data platforms often consist of dozens or hundreds of ETL jobs with complex interdependencies. Traditional workflow orchestration tools require defining these dependencies programmatically in DAGs (Directed Acyclic Graphs), which can become difficult to maintain as the system grows.

This article explores an alternative approach: declarative, event-driven dependency orchestration using YAML configuration files and cloud-native messaging systems.

System Architecture

Core Components

1. Dependency Configuration (YAML)

project_id: data-platform-prod
region: europe-west4
pubsub_topic: etl-job-events

jobs:
  extract-customers:
    dependencies: []
    triggers:
      - transform-customers
    layer: 0

  transform-customers:
    dependencies:
      - extract-customers
    triggers:
      - load-customers
      - calculate-metrics
    parallel_triggers: true
    layer: 1

Jobs are defined with their dependencies, triggers, and execution layer. The layer property enables topological sorting and visualization.

2. Event Publishing

When a job completes, it publishes an event to Pub/Sub:

{
  "job_name": "extract-customers",
  "status": "success",
  "timestamp": "2025-11-07T14:30:00Z",
  "run_id": "uuid-12345",
  "metrics": {
    "rows_processed": 150000,
    "duration_seconds": 120
  }
}

3. Dependency Resolution Engine

A lightweight service listens to job completion events and:

Reads the dependency configuration
Checks if all dependencies for triggered jobs are satisfied
Triggers downstream jobs when ready
Handles parallel execution groups

Layered Dependency Model

Jobs are organized into layers based on their dependencies. This creates a clear hierarchy and enables efficient parallel execution within each layer.

Layer 0: Foundation

Dependencies: None

Characteristics:

Source data extraction
Reference data loading
Can run in parallel
Entry points to the pipeline

Example Jobs:

extract-source-a
extract-source-b
load-reference-data

Layer 1: Primary Processing

Dependencies: Layer 0 jobs

Characteristics:

Basic transformations
Data quality checks
Initial enrichment
Depend on single Layer 0 jobs

Example Jobs:

validate-source-a
transform-source-b
enrich-with-reference

Layer 2: Intermediate Processing

Dependencies: Layer 0-1 jobs

Characteristics:

Complex transformations
Multi-source joins
Derived metrics
Can have parallel execution groups

Example Jobs:

join-sources-a-b
calculate-aggregates
build-dimensions

Layers 3-5: Advanced Processing

Dependencies: Multiple prior layers

Characteristics:

Complex multi-dependency jobs
Final aggregations
Report generation
ML feature engineering

Example Jobs:

build-fact-tables
generate-reports
export-to-downstream

Key Architectural Patterns

1. Event-Driven vs. Polling

Event-Driven: Jobs publish completion events immediately. The orchestrator reacts to these events in real-time, enabling faster pipeline execution and reduced latency.

Polling (Traditional): The orchestrator continuously checks job status at intervals. This introduces latency (polling interval) and creates unnecessary load on the scheduler.

2. Parallel Execution Groups

job-a:
  dependencies:
    - source-job
  triggers:
    - downstream-b
    - downstream-c
  parallel_triggers: true  # b and c run in parallel

The parallel_triggers flag enables fine-grained control over which jobs should run sequentially vs. in parallel, without requiring explicit DAG definition.

3. Dependency Resolution Algorithm

function procesJobCompletion(event):
    job_name = event.job_name

    # Read configuration
    config = load_yaml_config()

    # Get triggered jobs
    triggered_jobs = config.jobs[job_name].triggers

    for job in triggered_jobs:
        # Check if all dependencies satisfied
        dependencies = config.jobs[job].dependencies
        if all_jobs_completed(dependencies):
            # All dependencies met - trigger job
            trigger_job(job)
        else:
            # Wait for other dependencies
            log_waiting(job, dependencies)

4. State Management

The orchestrator maintains a simple state store (Redis, Cloud Memorystore, etc.) with:

Job Completion Status: job:run_id:status → "success" | "failed" | "running"
Run Metadata: job:run_id:metadata → {timestamp, metrics, ...}
Pending Dependencies: job:run_id:pending → Set of job names

Advantages of This Approach

✅ Declarative Configuration

Dependencies are defined in YAML, not code. Non-engineers can understand and modify the pipeline structure.

⚡ Real-Time Triggering

Event-driven architecture eliminates polling delays. Downstream jobs start as soon as dependencies complete.

🔄 Separation of Concerns

Job logic is completely separate from orchestration logic. Jobs just publish events - they don't need to know about dependencies.

📊 Easy Visualization

The YAML structure can be parsed to generate dependency graphs automatically. The layer property enables clean topological layouts.

🧪 Testable

Dependencies can be validated without running actual jobs. Unit tests can verify the configuration structure.

🔧 Easy Maintenance

Adding or modifying jobs only requires YAML changes. No need to redeploy the orchestration engine.

Implementation Considerations

1. Idempotency

Pub/Sub may deliver messages multiple times. Jobs must be idempotent and use run_id to deduplicate events.

2. Error Handling

# Configuration includes retry policies
jobs:
  critical-job:
    dependencies: [...]
    triggers: [...]
    retry_policy:
      max_attempts: 3
      backoff_multiplier: 2
      initial_delay_seconds: 60

3. Circular Dependency Detection

The configuration should be validated at deployment time using graph cycle detection algorithms (e.g., DFS with recursion stack) to prevent circular dependencies.

4. Monitoring & Observability

Metrics: Job duration, success rate, dependency wait time
Logging: Structured logs with run_id correlation
Tracing: Distributed tracing across job executions
Alerting: Failed jobs, stuck pipelines, SLA violations

Comparison with Traditional Orchestrators

Aspect	Event-Driven YAML	Traditional DAG (Airflow/Prefect)
Configuration	Declarative YAML	Python DAG files
Triggering	Real-time (event-driven)	Polling-based
Latency	Milliseconds	Seconds to minutes (polling interval)
Scalability	Horizontal (Pub/Sub scales automatically)	Vertical (scheduler bottleneck)
Learning Curve	Low (YAML + basic concepts)	High (Python + framework-specific APIs)
Job Coupling	Loose (jobs just publish events)	Tight (DAG imports job code)
Infrastructure	Lightweight (Pub/Sub + small orchestrator)	Heavy (scheduler, web server, database)

Real-World Example

Consider a data pipeline with 28 jobs across 5 layers:

# Layer 0 (6 jobs): Extract from sources
- extract-market-data
- extract-product-catalog
- extract-customer-data
- extract-reference-tables
- extract-config-a
- extract-config-b

# Layer 1 (2 jobs): Primary processing
- process-headers (depends on extract-market-data)
- process-features (depends on process-headers)

# Layer 2 (2 jobs): Parallel processing
- process-metadata (depends on process-features)
  triggers: [build-entities-a, build-entities-b] (parallel)

# Layer 3 (8 jobs): Entity building
- build-entities-a (depends on process-metadata)
- build-entities-b (depends on process-metadata)
- build-descriptions-a (depends on build-entities-a)
- build-descriptions-b (depends on build-entities-b)
- build-relationships (depends on both entities)
- ...

# Layer 4 (5 jobs): Complex joins
- join-multi-source (depends on multiple Layer 3 jobs)
- calculate-derived-metrics
- ...

# Layer 5 (5 jobs): Final processing
- aggregate-reports
- export-to-downstream
- ...

Pipeline Execution:

Layer 0: All 6 extract jobs run in parallel (no dependencies)
Layer 1: Starts as soon as extract-market-data completes
Layer 2: Triggered by Layer 1 completion
Layer 3: build-entities-a and build-entities-b run in parallel
Layers 4-5: Execute based on their specific dependencies

Total Pipeline Time: ~45 minutes (vs. ~2+ hours with sequential execution)

Best Practices

1. Layer Assignments

Assign layers using topological sorting:

def assign_layers(config):
    graph = build_dependency_graph(config)
    layers = nx.topological_generations(graph)
    for i, layer in enumerate(layers):
        for job in layer:
            config.jobs[job].layer = i

2. Naming Conventions

Use descriptive, action-oriented names: extract-customers, not job1
Include layer hints in names for readability: l0-extract-*, l1-transform-*
Use consistent naming patterns across teams

3. Configuration Validation

Validate YAML schema on commit
Check for circular dependencies in CI/CD
Verify all referenced jobs exist
Test parallel execution groups for race conditions

4. Version Control

Store configuration in Git alongside job code
Use pull requests for dependency changes
Tag releases for rollback capability
Document major architectural changes

Conclusion

Event-driven ETL orchestration with declarative YAML configuration offers a compelling alternative to traditional workflow orchestrators for certain use cases. It excels in scenarios where:

Low latency between jobs is critical
The team prefers declarative over programmatic configuration
Jobs are already well-modularized and loosely coupled
Cloud-native messaging infrastructure is available
The organization values simplicity over feature richness

However, it's not a silver bullet. Traditional orchestrators provide richer features like dynamic DAG generation, complex scheduling, and extensive monitoring UIs. The right choice depends on your specific requirements, team skills, and infrastructure constraints.

The layered dependency model combined with Pub/Sub messaging creates a powerful, scalable orchestration pattern that can handle complex pipelines while remaining simple to understand and maintain.

The Challenge of Pipeline Orchestration

System Architecture

Core Components

1. Dependency Configuration (YAML)

2. Event Publishing

3. Dependency Resolution Engine

Layered Dependency Model

Key Architectural Patterns

1. Event-Driven vs. Polling

2. Parallel Execution Groups

3. Dependency Resolution Algorithm

4. State Management

Advantages of This Approach

✅ Declarative Configuration

⚡ Real-Time Triggering

🔄 Separation of Concerns

📊 Easy Visualization

🧪 Testable

🔧 Easy Maintenance

Implementation Considerations

1. Idempotency

2. Error Handling

3. Circular Dependency Detection

4. Monitoring & Observability

Comparison with Traditional Orchestrators

Real-World Example

Best Practices

1. Layer Assignments

2. Naming Conventions

3. Configuration Validation

4. Version Control

Conclusion

Related Articles