Event-Driven ETL Dependency Orchestration
Declarative dependency management for complex data pipelines using YAML configuration and Pub/Sub messaging
The Challenge of Pipeline Orchestration
Modern data platforms often consist of dozens or hundreds of ETL jobs with complex interdependencies. Traditional workflow orchestration tools require defining these dependencies programmatically in DAGs (Directed Acyclic Graphs), which can become difficult to maintain as the system grows.
This article explores an alternative approach: declarative, event-driven dependency orchestration using YAML configuration files and cloud-native messaging systems.
System Architecture
Core Components
1. Dependency Configuration (YAML)
project_id: data-platform-prod
region: europe-west4
pubsub_topic: etl-job-events
jobs:
extract-customers:
dependencies: []
triggers:
- transform-customers
layer: 0
transform-customers:
dependencies:
- extract-customers
triggers:
- load-customers
- calculate-metrics
parallel_triggers: true
layer: 1
Jobs are defined with their dependencies, triggers, and execution layer.
The layer property enables topological sorting and visualization.
2. Event Publishing
When a job completes, it publishes an event to Pub/Sub:
{
"job_name": "extract-customers",
"status": "success",
"timestamp": "2025-11-07T14:30:00Z",
"run_id": "uuid-12345",
"metrics": {
"rows_processed": 150000,
"duration_seconds": 120
}
}
3. Dependency Resolution Engine
A lightweight service listens to job completion events and:
- Reads the dependency configuration
- Checks if all dependencies for triggered jobs are satisfied
- Triggers downstream jobs when ready
- Handles parallel execution groups
Layered Dependency Model
Jobs are organized into layers based on their dependencies. This creates a clear hierarchy and enables efficient parallel execution within each layer.
Dependencies: None
Characteristics:
- Source data extraction
- Reference data loading
- Can run in parallel
- Entry points to the pipeline
Example Jobs:
- extract-source-a
- extract-source-b
- load-reference-data
Dependencies: Layer 0 jobs
Characteristics:
- Basic transformations
- Data quality checks
- Initial enrichment
- Depend on single Layer 0 jobs
Example Jobs:
- validate-source-a
- transform-source-b
- enrich-with-reference
Dependencies: Layer 0-1 jobs
Characteristics:
- Complex transformations
- Multi-source joins
- Derived metrics
- Can have parallel execution groups
Example Jobs:
- join-sources-a-b
- calculate-aggregates
- build-dimensions
Dependencies: Multiple prior layers
Characteristics:
- Complex multi-dependency jobs
- Final aggregations
- Report generation
- ML feature engineering
Example Jobs:
- build-fact-tables
- generate-reports
- export-to-downstream
Key Architectural Patterns
1. Event-Driven vs. Polling
2. Parallel Execution Groups
job-a:
dependencies:
- source-job
triggers:
- downstream-b
- downstream-c
parallel_triggers: true # b and c run in parallel
The parallel_triggers flag enables fine-grained control over which jobs should run
sequentially vs. in parallel, without requiring explicit DAG definition.
3. Dependency Resolution Algorithm
function procesJobCompletion(event):
job_name = event.job_name
# Read configuration
config = load_yaml_config()
# Get triggered jobs
triggered_jobs = config.jobs[job_name].triggers
for job in triggered_jobs:
# Check if all dependencies satisfied
dependencies = config.jobs[job].dependencies
if all_jobs_completed(dependencies):
# All dependencies met - trigger job
trigger_job(job)
else:
# Wait for other dependencies
log_waiting(job, dependencies)
4. State Management
The orchestrator maintains a simple state store (Redis, Cloud Memorystore, etc.) with:
- Job Completion Status:
job:run_id:statusโ "success" | "failed" | "running" - Run Metadata:
job:run_id:metadataโ {timestamp, metrics, ...} - Pending Dependencies:
job:run_id:pendingโ Set of job names
Advantages of This Approach
โ Declarative Configuration
Dependencies are defined in YAML, not code. Non-engineers can understand and modify the pipeline structure.
โก Real-Time Triggering
Event-driven architecture eliminates polling delays. Downstream jobs start as soon as dependencies complete.
๐ Separation of Concerns
Job logic is completely separate from orchestration logic. Jobs just publish events - they don't need to know about dependencies.
๐ Easy Visualization
The YAML structure can be parsed to generate dependency graphs automatically. The layer property enables clean topological layouts.
๐งช Testable
Dependencies can be validated without running actual jobs. Unit tests can verify the configuration structure.
๐ง Easy Maintenance
Adding or modifying jobs only requires YAML changes. No need to redeploy the orchestration engine.
Implementation Considerations
1. Idempotency
Pub/Sub may deliver messages multiple times. Jobs must be idempotent and use run_id
to deduplicate events.
2. Error Handling
# Configuration includes retry policies
jobs:
critical-job:
dependencies: [...]
triggers: [...]
retry_policy:
max_attempts: 3
backoff_multiplier: 2
initial_delay_seconds: 60
3. Circular Dependency Detection
The configuration should be validated at deployment time using graph cycle detection algorithms (e.g., DFS with recursion stack) to prevent circular dependencies.
4. Monitoring & Observability
- Metrics: Job duration, success rate, dependency wait time
- Logging: Structured logs with run_id correlation
- Tracing: Distributed tracing across job executions
- Alerting: Failed jobs, stuck pipelines, SLA violations
Comparison with Traditional Orchestrators
| Aspect | Event-Driven YAML | Traditional DAG (Airflow/Prefect) |
|---|---|---|
| Configuration | Declarative YAML | Python DAG files |
| Triggering | Real-time (event-driven) | Polling-based |
| Latency | Milliseconds | Seconds to minutes (polling interval) |
| Scalability | Horizontal (Pub/Sub scales automatically) | Vertical (scheduler bottleneck) |
| Learning Curve | Low (YAML + basic concepts) | High (Python + framework-specific APIs) |
| Job Coupling | Loose (jobs just publish events) | Tight (DAG imports job code) |
| Infrastructure | Lightweight (Pub/Sub + small orchestrator) | Heavy (scheduler, web server, database) |
Real-World Example
Consider a data pipeline with 28 jobs across 5 layers:
# Layer 0 (6 jobs): Extract from sources
- extract-market-data
- extract-product-catalog
- extract-customer-data
- extract-reference-tables
- extract-config-a
- extract-config-b
# Layer 1 (2 jobs): Primary processing
- process-headers (depends on extract-market-data)
- process-features (depends on process-headers)
# Layer 2 (2 jobs): Parallel processing
- process-metadata (depends on process-features)
triggers: [build-entities-a, build-entities-b] (parallel)
# Layer 3 (8 jobs): Entity building
- build-entities-a (depends on process-metadata)
- build-entities-b (depends on process-metadata)
- build-descriptions-a (depends on build-entities-a)
- build-descriptions-b (depends on build-entities-b)
- build-relationships (depends on both entities)
- ...
# Layer 4 (5 jobs): Complex joins
- join-multi-source (depends on multiple Layer 3 jobs)
- calculate-derived-metrics
- ...
# Layer 5 (5 jobs): Final processing
- aggregate-reports
- export-to-downstream
- ...
Pipeline Execution:
- Layer 0: All 6 extract jobs run in parallel (no dependencies)
- Layer 1: Starts as soon as
extract-market-datacompletes - Layer 2: Triggered by Layer 1 completion
- Layer 3:
build-entities-aandbuild-entities-brun in parallel - Layers 4-5: Execute based on their specific dependencies
Total Pipeline Time: ~45 minutes (vs. ~2+ hours with sequential execution)
Best Practices
1. Layer Assignments
Assign layers using topological sorting:
def assign_layers(config):
graph = build_dependency_graph(config)
layers = nx.topological_generations(graph)
for i, layer in enumerate(layers):
for job in layer:
config.jobs[job].layer = i
2. Naming Conventions
- Use descriptive, action-oriented names:
extract-customers, notjob1 - Include layer hints in names for readability:
l0-extract-*,l1-transform-* - Use consistent naming patterns across teams
3. Configuration Validation
- Validate YAML schema on commit
- Check for circular dependencies in CI/CD
- Verify all referenced jobs exist
- Test parallel execution groups for race conditions
4. Version Control
- Store configuration in Git alongside job code
- Use pull requests for dependency changes
- Tag releases for rollback capability
- Document major architectural changes
Conclusion
Event-driven ETL orchestration with declarative YAML configuration offers a compelling alternative to traditional workflow orchestrators for certain use cases. It excels in scenarios where:
- Low latency between jobs is critical
- The team prefers declarative over programmatic configuration
- Jobs are already well-modularized and loosely coupled
- Cloud-native messaging infrastructure is available
- The organization values simplicity over feature richness
However, it's not a silver bullet. Traditional orchestrators provide richer features like dynamic DAG generation, complex scheduling, and extensive monitoring UIs. The right choice depends on your specific requirements, team skills, and infrastructure constraints.
The layered dependency model combined with Pub/Sub messaging creates a powerful, scalable orchestration pattern that can handle complex pipelines while remaining simple to understand and maintain.