Event-Driven ETL Dependency Orchestration

Declarative dependency management for complex data pipelines using YAML configuration and Pub/Sub messaging

๐Ÿ“… November 2025 โฑ๏ธ 12 min read ๐Ÿท๏ธ Data Engineering

The Challenge of Pipeline Orchestration

Modern data platforms often consist of dozens or hundreds of ETL jobs with complex interdependencies. Traditional workflow orchestration tools require defining these dependencies programmatically in DAGs (Directed Acyclic Graphs), which can become difficult to maintain as the system grows.

This article explores an alternative approach: declarative, event-driven dependency orchestration using YAML configuration files and cloud-native messaging systems.

System Architecture

Core Components

1. Dependency Configuration (YAML)
project_id: data-platform-prod
region: europe-west4
pubsub_topic: etl-job-events

jobs:
  extract-customers:
    dependencies: []
    triggers:
      - transform-customers
    layer: 0

  transform-customers:
    dependencies:
      - extract-customers
    triggers:
      - load-customers
      - calculate-metrics
    parallel_triggers: true
    layer: 1

Jobs are defined with their dependencies, triggers, and execution layer. The layer property enables topological sorting and visualization.

2. Event Publishing

When a job completes, it publishes an event to Pub/Sub:

{
  "job_name": "extract-customers",
  "status": "success",
  "timestamp": "2025-11-07T14:30:00Z",
  "run_id": "uuid-12345",
  "metrics": {
    "rows_processed": 150000,
    "duration_seconds": 120
  }
}
3. Dependency Resolution Engine

A lightweight service listens to job completion events and:

  • Reads the dependency configuration
  • Checks if all dependencies for triggered jobs are satisfied
  • Triggers downstream jobs when ready
  • Handles parallel execution groups

Layered Dependency Model

Jobs are organized into layers based on their dependencies. This creates a clear hierarchy and enables efficient parallel execution within each layer.

Layer 0: Foundation

Dependencies: None

Characteristics:

  • Source data extraction
  • Reference data loading
  • Can run in parallel
  • Entry points to the pipeline

Example Jobs:

  • extract-source-a
  • extract-source-b
  • load-reference-data
Layer 1: Primary Processing

Dependencies: Layer 0 jobs

Characteristics:

  • Basic transformations
  • Data quality checks
  • Initial enrichment
  • Depend on single Layer 0 jobs

Example Jobs:

  • validate-source-a
  • transform-source-b
  • enrich-with-reference
Layer 2: Intermediate Processing

Dependencies: Layer 0-1 jobs

Characteristics:

  • Complex transformations
  • Multi-source joins
  • Derived metrics
  • Can have parallel execution groups

Example Jobs:

  • join-sources-a-b
  • calculate-aggregates
  • build-dimensions
Layers 3-5: Advanced Processing

Dependencies: Multiple prior layers

Characteristics:

  • Complex multi-dependency jobs
  • Final aggregations
  • Report generation
  • ML feature engineering

Example Jobs:

  • build-fact-tables
  • generate-reports
  • export-to-downstream

Key Architectural Patterns

1. Event-Driven vs. Polling

Event-Driven: Jobs publish completion events immediately. The orchestrator reacts to these events in real-time, enabling faster pipeline execution and reduced latency.
Polling (Traditional): The orchestrator continuously checks job status at intervals. This introduces latency (polling interval) and creates unnecessary load on the scheduler.

2. Parallel Execution Groups

job-a:
  dependencies:
    - source-job
  triggers:
    - downstream-b
    - downstream-c
  parallel_triggers: true  # b and c run in parallel

The parallel_triggers flag enables fine-grained control over which jobs should run sequentially vs. in parallel, without requiring explicit DAG definition.

3. Dependency Resolution Algorithm

function procesJobCompletion(event):
    job_name = event.job_name

    # Read configuration
    config = load_yaml_config()

    # Get triggered jobs
    triggered_jobs = config.jobs[job_name].triggers

    for job in triggered_jobs:
        # Check if all dependencies satisfied
        dependencies = config.jobs[job].dependencies
        if all_jobs_completed(dependencies):
            # All dependencies met - trigger job
            trigger_job(job)
        else:
            # Wait for other dependencies
            log_waiting(job, dependencies)

4. State Management

The orchestrator maintains a simple state store (Redis, Cloud Memorystore, etc.) with:

  • Job Completion Status: job:run_id:status โ†’ "success" | "failed" | "running"
  • Run Metadata: job:run_id:metadata โ†’ {timestamp, metrics, ...}
  • Pending Dependencies: job:run_id:pending โ†’ Set of job names

Advantages of This Approach

โœ… Declarative Configuration

Dependencies are defined in YAML, not code. Non-engineers can understand and modify the pipeline structure.

โšก Real-Time Triggering

Event-driven architecture eliminates polling delays. Downstream jobs start as soon as dependencies complete.

๐Ÿ”„ Separation of Concerns

Job logic is completely separate from orchestration logic. Jobs just publish events - they don't need to know about dependencies.

๐Ÿ“Š Easy Visualization

The YAML structure can be parsed to generate dependency graphs automatically. The layer property enables clean topological layouts.

๐Ÿงช Testable

Dependencies can be validated without running actual jobs. Unit tests can verify the configuration structure.

๐Ÿ”ง Easy Maintenance

Adding or modifying jobs only requires YAML changes. No need to redeploy the orchestration engine.

Implementation Considerations

1. Idempotency

Pub/Sub may deliver messages multiple times. Jobs must be idempotent and use run_id to deduplicate events.

2. Error Handling

# Configuration includes retry policies
jobs:
  critical-job:
    dependencies: [...]
    triggers: [...]
    retry_policy:
      max_attempts: 3
      backoff_multiplier: 2
      initial_delay_seconds: 60

3. Circular Dependency Detection

The configuration should be validated at deployment time using graph cycle detection algorithms (e.g., DFS with recursion stack) to prevent circular dependencies.

4. Monitoring & Observability

  • Metrics: Job duration, success rate, dependency wait time
  • Logging: Structured logs with run_id correlation
  • Tracing: Distributed tracing across job executions
  • Alerting: Failed jobs, stuck pipelines, SLA violations

Comparison with Traditional Orchestrators

Aspect Event-Driven YAML Traditional DAG (Airflow/Prefect)
Configuration Declarative YAML Python DAG files
Triggering Real-time (event-driven) Polling-based
Latency Milliseconds Seconds to minutes (polling interval)
Scalability Horizontal (Pub/Sub scales automatically) Vertical (scheduler bottleneck)
Learning Curve Low (YAML + basic concepts) High (Python + framework-specific APIs)
Job Coupling Loose (jobs just publish events) Tight (DAG imports job code)
Infrastructure Lightweight (Pub/Sub + small orchestrator) Heavy (scheduler, web server, database)

Real-World Example

Consider a data pipeline with 28 jobs across 5 layers:

# Layer 0 (6 jobs): Extract from sources
- extract-market-data
- extract-product-catalog
- extract-customer-data
- extract-reference-tables
- extract-config-a
- extract-config-b

# Layer 1 (2 jobs): Primary processing
- process-headers (depends on extract-market-data)
- process-features (depends on process-headers)

# Layer 2 (2 jobs): Parallel processing
- process-metadata (depends on process-features)
  triggers: [build-entities-a, build-entities-b] (parallel)

# Layer 3 (8 jobs): Entity building
- build-entities-a (depends on process-metadata)
- build-entities-b (depends on process-metadata)
- build-descriptions-a (depends on build-entities-a)
- build-descriptions-b (depends on build-entities-b)
- build-relationships (depends on both entities)
- ...

# Layer 4 (5 jobs): Complex joins
- join-multi-source (depends on multiple Layer 3 jobs)
- calculate-derived-metrics
- ...

# Layer 5 (5 jobs): Final processing
- aggregate-reports
- export-to-downstream
- ...

Pipeline Execution:

  • Layer 0: All 6 extract jobs run in parallel (no dependencies)
  • Layer 1: Starts as soon as extract-market-data completes
  • Layer 2: Triggered by Layer 1 completion
  • Layer 3: build-entities-a and build-entities-b run in parallel
  • Layers 4-5: Execute based on their specific dependencies

Total Pipeline Time: ~45 minutes (vs. ~2+ hours with sequential execution)

Best Practices

1. Layer Assignments

Assign layers using topological sorting:

def assign_layers(config):
    graph = build_dependency_graph(config)
    layers = nx.topological_generations(graph)
    for i, layer in enumerate(layers):
        for job in layer:
            config.jobs[job].layer = i
2. Naming Conventions
  • Use descriptive, action-oriented names: extract-customers, not job1
  • Include layer hints in names for readability: l0-extract-*, l1-transform-*
  • Use consistent naming patterns across teams
3. Configuration Validation
  • Validate YAML schema on commit
  • Check for circular dependencies in CI/CD
  • Verify all referenced jobs exist
  • Test parallel execution groups for race conditions
4. Version Control
  • Store configuration in Git alongside job code
  • Use pull requests for dependency changes
  • Tag releases for rollback capability
  • Document major architectural changes

Conclusion

Event-driven ETL orchestration with declarative YAML configuration offers a compelling alternative to traditional workflow orchestrators for certain use cases. It excels in scenarios where:

  • Low latency between jobs is critical
  • The team prefers declarative over programmatic configuration
  • Jobs are already well-modularized and loosely coupled
  • Cloud-native messaging infrastructure is available
  • The organization values simplicity over feature richness

However, it's not a silver bullet. Traditional orchestrators provide richer features like dynamic DAG generation, complex scheduling, and extensive monitoring UIs. The right choice depends on your specific requirements, team skills, and infrastructure constraints.

The layered dependency model combined with Pub/Sub messaging creates a powerful, scalable orchestration pattern that can handle complex pipelines while remaining simple to understand and maintain.

Related Articles