ETL Metadata Flow Visualization
A comprehensive visualization of metadata flow throughout the ETL process, inspired by process mining and BPMN notation.
Overview
This page demonstrates how metadata is captured, stored, and utilized throughout an ETL pipeline. The visualization follows BPMN (Business Process Model and Notation) style, showing the flow of data and metadata through each stage of the process.
Interactive ETL Metadata Flow Simulation
- Rotate: Click and drag to rotate the view
- Zoom: Scroll to zoom in/out
- Click stages: Click on any stage to see metadata details
- Controls: Use the bottom controls to pause, reset, or adjust speed
What you're seeing:
• Cyan cubes = Data packets flowing through the pipeline
• Golden spheres = Metadata being captured at each stage
• Each stage pulses to show active processing
• Stages are color-coded: Green (Start) → Blue (Extract) → Orange (Transform) → Purple (Load) → Red (End)
ETL Process Steps & Metadata
Type: Start Event
Metadata Tables: etl_events
Description: When an ETL process starts, an event record is created with a unique run_id, timestamp, and status 'started'.
Example Record:
{
"id": 1,
"run_id": "5611cbad-819b-4c2b-8a15-3f5efc64dc02",
"service_name": "customer_data_etl",
"event_type": "process_start",
"status": "success",
"message": "ETL process started",
"timestamp": "2023-06-15T12:30:45.000Z"
}
Type: Activity
Metadata Tables: etl_events, etl_pagination_info
Source Tables:
- source_system.customers (customer_id, customer_name, email, last_update, status)
- source_system.orders (order_id, customer_id, order_date, status)
- source_system.products (product_id, product_name, price)
Pagination Information:
{
"page_number": 1,
"page_size": 100,
"fetched_count": 100,
"has_next_page": true,
"total_count": 1500
}
Type: Activity
Metadata Tables: etl_events, etl_checkpoints, etl_metrics
Transformations Applied:
- Join customers with orders
- Calculate order totals
- Format dates to ISO
- Derive customer_lifetime_value
- Calculate days_since_last_order
Checkpoint Example:
{
"entity_type": "customer",
"entity_id": "CUST-001",
"status": "success",
"processed_at": "2023-06-15T12:31:15.000Z"
}
Metrics Captured:
- transformation_time_ms: 250
- records_processed: 1500
- data_quality_score: 0.95
Type: Activity
Metadata Tables: etl_events, etl_checkpoints, etl_metrics
Target Tables:
- dwh.dim_customers (with SCD Type 2)
- dwh.dim_products (with SCD Type 2)
- dwh.fact_orders
Loading Strategy:
- Incremental load with Slowly Changing Dimensions (SCD) Type 2
- Partition by order_date
- Indexes on customer_id, order_id, product_id
Load Metrics:
{
"records_loaded": 1495,
"load_time_ms": 450,
"target_system": "data_warehouse"
}
Type: End Event
Metadata Tables: etl_events, etl_metrics
Process Summary:
{
"total_records": 1500,
"successful_records": 1495,
"failed_records": 5,
"duration_seconds": 120,
"status": "success"
}
Metadata Tables Overview
The ETL process utilizes four primary metadata tables to track all aspects of data processing:
etl_events
Purpose: Records all significant events in the ETL process
Key Fields:
- run_id (UUID)
- service_name
- event_type
- status
- message
- timestamp
etl_checkpoints
Purpose: Tracks processing status of individual entities
Key Fields:
- run_id (UUID)
- entity_type
- entity_id
- status
- processed_at
etl_metrics
Purpose: Collects quantitative measurements of ETL performance
Key Fields:
- run_id (UUID)
- metric_name
- metric_value
- metric_unit
- timestamp
etl_pagination_info
Purpose: Manages pagination state for large dataset processing
Key Fields:
- run_id (UUID)
- page_number
- page_size
- has_next_page
- end_cursor
Table Interconnections
run_id (UUID) to correlate
information from the same ETL run. This enables comprehensive analysis of each pipeline execution,
from start to finish, including all events, checkpoints, metrics, and pagination state.
Common Metadata Queries:
- Find all events for a specific ETL run
- Identify failed checkpoints and their error details
- Calculate average processing time by entity type
- Track pagination progress for large dataset extractions
- Generate lineage reports showing data flow through the pipeline
Error Handling & Recovery
The metadata system supports robust error handling and recovery mechanisms:
Error Detection
Errors are captured in etl_events with status='error' and detailed error messages.
Entity Tracking
Failed entities are marked in etl_checkpoints, enabling targeted retry logic.
Recovery
Checkpoints enable resuming from the last successful state without reprocessing.