ETL Metadata Flow Visualization

A comprehensive visualization of metadata flow throughout the ETL process, inspired by process mining and BPMN notation.


Overview

This page demonstrates how metadata is captured, stored, and utilized throughout an ETL pipeline. The visualization follows BPMN (Business Process Model and Notation) style, showing the flow of data and metadata through each stage of the process.

Interactive ETL Metadata Flow Simulation

Interactive Controls:
  • Rotate: Click and drag to rotate the view
  • Zoom: Scroll to zoom in/out
  • Click stages: Click on any stage to see metadata details
  • Controls: Use the bottom controls to pause, reset, or adjust speed

What you're seeing:
Cyan cubes = Data packets flowing through the pipeline
Golden spheres = Metadata being captured at each stage
• Each stage pulses to show active processing
• Stages are color-coded: Green (Start) → Blue (Extract) → Orange (Transform) → Purple (Load) → Red (End)

ETL Process Steps & Metadata

Type: Start Event

Metadata Tables: etl_events

Description: When an ETL process starts, an event record is created with a unique run_id, timestamp, and status 'started'.

Example Record:
{
  "id": 1,
  "run_id": "5611cbad-819b-4c2b-8a15-3f5efc64dc02",
  "service_name": "customer_data_etl",
  "event_type": "process_start",
  "status": "success",
  "message": "ETL process started",
  "timestamp": "2023-06-15T12:30:45.000Z"
}

Type: Activity

Metadata Tables: etl_events, etl_pagination_info

Source Tables:
  • source_system.customers (customer_id, customer_name, email, last_update, status)
  • source_system.orders (order_id, customer_id, order_date, status)
  • source_system.products (product_id, product_name, price)
Pagination Information:
{
  "page_number": 1,
  "page_size": 100,
  "fetched_count": 100,
  "has_next_page": true,
  "total_count": 1500
}

Type: Activity

Metadata Tables: etl_events, etl_checkpoints, etl_metrics

Transformations Applied:
  • Join customers with orders
  • Calculate order totals
  • Format dates to ISO
  • Derive customer_lifetime_value
  • Calculate days_since_last_order
Checkpoint Example:
{
  "entity_type": "customer",
  "entity_id": "CUST-001",
  "status": "success",
  "processed_at": "2023-06-15T12:31:15.000Z"
}
Metrics Captured:
  • transformation_time_ms: 250
  • records_processed: 1500
  • data_quality_score: 0.95

Type: Activity

Metadata Tables: etl_events, etl_checkpoints, etl_metrics

Target Tables:
  • dwh.dim_customers (with SCD Type 2)
  • dwh.dim_products (with SCD Type 2)
  • dwh.fact_orders
Loading Strategy:
  • Incremental load with Slowly Changing Dimensions (SCD) Type 2
  • Partition by order_date
  • Indexes on customer_id, order_id, product_id
Load Metrics:
{
  "records_loaded": 1495,
  "load_time_ms": 450,
  "target_system": "data_warehouse"
}

Type: End Event

Metadata Tables: etl_events, etl_metrics

Process Summary:
{
  "total_records": 1500,
  "successful_records": 1495,
  "failed_records": 5,
  "duration_seconds": 120,
  "status": "success"
}

Metadata Tables Overview

The ETL process utilizes four primary metadata tables to track all aspects of data processing:

etl_events

Purpose: Records all significant events in the ETL process

Key Fields:

  • run_id (UUID)
  • service_name
  • event_type
  • status
  • message
  • timestamp
etl_checkpoints

Purpose: Tracks processing status of individual entities

Key Fields:

  • run_id (UUID)
  • entity_type
  • entity_id
  • status
  • processed_at
etl_metrics

Purpose: Collects quantitative measurements of ETL performance

Key Fields:

  • run_id (UUID)
  • metric_name
  • metric_value
  • metric_unit
  • timestamp
etl_pagination_info

Purpose: Manages pagination state for large dataset processing

Key Fields:

  • run_id (UUID)
  • page_number
  • page_size
  • has_next_page
  • end_cursor

Table Interconnections

Key Point: All metadata tables use the same run_id (UUID) to correlate information from the same ETL run. This enables comprehensive analysis of each pipeline execution, from start to finish, including all events, checkpoints, metrics, and pagination state.
Common Metadata Queries:
  • Find all events for a specific ETL run
  • Identify failed checkpoints and their error details
  • Calculate average processing time by entity type
  • Track pagination progress for large dataset extractions
  • Generate lineage reports showing data flow through the pipeline

Error Handling & Recovery

The metadata system supports robust error handling and recovery mechanisms:

Error Detection

Errors are captured in etl_events with status='error' and detailed error messages.

Entity Tracking

Failed entities are marked in etl_checkpoints, enabling targeted retry logic.

Recovery

Checkpoints enable resuming from the last successful state without reprocessing.

Related Resources