ETL Metadata Flow Visualization

A comprehensive visualization of metadata flow throughout the ETL process, inspired by process mining and BPMN notation.

Overview

This page demonstrates how metadata is captured, stored, and utilized throughout an ETL pipeline. The visualization follows BPMN (Business Process Model and Notation) style, showing the flow of data and metadata through each stage of the process.

Interactive ETL Metadata Flow Simulation

Interactive Controls:

Rotate: Click and drag to rotate the view
Zoom: Scroll to zoom in/out
Click stages: Click on any stage to see metadata details
Controls: Use the bottom controls to pause, reset, or adjust speed

What you're seeing:
• Cyan cubes = Data packets flowing through the pipeline
• Golden spheres = Metadata being captured at each stage
• Each stage pulses to show active processing
• Stages are color-coded: Green (Start) → Blue (Extract) → Orange (Transform) → Purple (Load) → Red (End)

ETL Process Steps & Metadata

Type: Start Event

Metadata Tables: etl_events

Description: When an ETL process starts, an event record is created with a unique run_id, timestamp, and status 'started'.

Example Record:

{
  "id": 1,
  "run_id": "5611cbad-819b-4c2b-8a15-3f5efc64dc02",
  "service_name": "customer_data_etl",
  "event_type": "process_start",
  "status": "success",
  "message": "ETL process started",
  "timestamp": "2023-06-15T12:30:45.000Z"
}

Type: Activity

Metadata Tables: etl_events, etl_pagination_info

Source Tables:

source_system.customers (customer_id, customer_name, email, last_update, status)
source_system.orders (order_id, customer_id, order_date, status)
source_system.products (product_id, product_name, price)

Pagination Information:

{
  "page_number": 1,
  "page_size": 100,
  "fetched_count": 100,
  "has_next_page": true,
  "total_count": 1500
}

Type: Activity

Metadata Tables: etl_events, etl_checkpoints, etl_metrics

Transformations Applied:

Join customers with orders
Calculate order totals
Format dates to ISO
Derive customer_lifetime_value
Calculate days_since_last_order

Checkpoint Example:

{
  "entity_type": "customer",
  "entity_id": "CUST-001",
  "status": "success",
  "processed_at": "2023-06-15T12:31:15.000Z"
}

Metrics Captured:

transformation_time_ms: 250
records_processed: 1500
data_quality_score: 0.95

Type: Activity

Metadata Tables: etl_events, etl_checkpoints, etl_metrics

Target Tables:

dwh.dim_customers (with SCD Type 2)
dwh.dim_products (with SCD Type 2)
dwh.fact_orders

Loading Strategy:

Incremental load with Slowly Changing Dimensions (SCD) Type 2
Partition by order_date
Indexes on customer_id, order_id, product_id

Load Metrics:

{
  "records_loaded": 1495,
  "load_time_ms": 450,
  "target_system": "data_warehouse"
}

Type: End Event

Metadata Tables: etl_events, etl_metrics

Process Summary:

{
  "total_records": 1500,
  "successful_records": 1495,
  "failed_records": 5,
  "duration_seconds": 120,
  "status": "success"
}

Metadata Tables Overview

The ETL process utilizes four primary metadata tables to track all aspects of data processing:

etl_events

Purpose: Records all significant events in the ETL process

Key Fields:

run_id (UUID)
service_name
event_type
status
message
timestamp

etl_checkpoints

Purpose: Tracks processing status of individual entities

Key Fields:

run_id (UUID)
entity_type
entity_id
status
processed_at

etl_metrics

Purpose: Collects quantitative measurements of ETL performance

Key Fields:

run_id (UUID)
metric_name
metric_value
metric_unit
timestamp

etl_pagination_info

Purpose: Manages pagination state for large dataset processing

Key Fields:

run_id (UUID)
page_number
page_size
has_next_page
end_cursor

Table Interconnections

Key Point: All metadata tables use the same run_id (UUID) to correlate information from the same ETL run. This enables comprehensive analysis of each pipeline execution, from start to finish, including all events, checkpoints, metrics, and pagination state.

Common Metadata Queries:

Find all events for a specific ETL run
Identify failed checkpoints and their error details
Calculate average processing time by entity type
Track pagination progress for large dataset extractions
Generate lineage reports showing data flow through the pipeline

Error Handling & Recovery

The metadata system supports robust error handling and recovery mechanisms:

Error Detection

Errors are captured in etl_events with status='error' and detailed error messages.

Entity Tracking

Failed entities are marked in etl_checkpoints, enabling targeted retry logic.

Recovery

Checkpoints enable resuming from the last successful state without reprocessing.

Related Resources

ETL Animation API Specification Mermaid Diagrams

ETL Metadata Flow Visualization

Overview

Interactive ETL Metadata Flow Simulation

ETL Process Steps & Metadata

1. ETL Process Start

Example Record:

2. Extract Data

Source Tables:

Pagination Information:

3. Transform Data

Transformations Applied:

Checkpoint Example:

Metrics Captured:

4. Load Data

Target Tables:

Loading Strategy:

Load Metrics:

5. ETL Process Complete

Process Summary:

Metadata Tables Overview

etl_events

etl_checkpoints

etl_metrics

etl_pagination_info

Table Interconnections

Common Metadata Queries:

Error Handling & Recovery

Error Detection

Entity Tracking

Recovery

Related Resources