System Architecture

1. Overview

Pattern
- Node based pipeline over a medallion flow: Bronze -> Silver -> Gold -> Analysis -> Sinks.
- Raw data -> cleaned tables -> business metrics -> views -> exports.
Principles
- All transforms use nodes with explicit dependencies.
- Public interface is small: list_targets, run, visualize_graph.
- Nodes live under nodes/, logic under layers/, infra under infrastructure/.
- DAG images in docs/assets/dag/ show full lineage.

2. Public API

list_targets()
- Returns all available targets.
run(targets, **inputs)
- Example: run(["operator_petrinex_emissions"], target_year=2023).
- Builds the graph, resolves dependencies, runs required nodes only.
visualize_graph(path)
- Writes a PNG DAG to path.
Helper
- run_all(...) wraps run() for standard runs and CLI, not part of the public API.

3. Data flow and layers

Bronze
- Purpose: raw ingestion with minimal transforms.
- Input: external source data.
- Output: partitioned parquet under data/bronze/.
Silver
- Purpose: cleaning, standardization, aggregation, metadata joins.
- Input: bronze tables.
- Output: standardized parquet under data/silver/.
Gold
- Purpose: emissions, features, clustering, rankings, decision metrics, ML models.
- Input: silver tables.
- Output: analytics tables and views under data/gold/.
Analysis
- Purpose: scenarios, sensitivities, uncertainty, risk adjustments.
- Input: gold tables and metrics.
- Output: analysis tables and figures.
Sinks
- Purpose: materialization.
- Data sinks: parquet exports.
- Visualization sinks: PNG figures in docs/assets/figures/.

4. Node modules (nodes/)

Bronze IO
- bronze_io.py: load raw volumes, NGL, infrastructure, business associates, links, licences.
Silver core
- silver_core.py: activities, production, NGL, facility production, flows, metadata, operator aggregates.
Gold emissions
- gold_emissions.py: facility and operator emissions tables.
Gold segmentation
- gold_segmentation.py: features, clusters, cluster summaries, cluster opportunity metrics.
Gold ranking
- gold_ranking.py: baseline rankings, composite rankings, comparisons, key differences.
Gold decision
- gold_decision.py: operator_decision_metrics and related investment, operations, compliance outputs.
Gold ML
- gold_ml_panel.py: loads pre-built multi-year operator panel from disk
- gold_ml_reduction.py: reduction prediction nodes (features, labels, model, predictions, metrics)
- gold_ml_predictions.py: loads pre-computed ML predictions from parquet (production path)
- Panel Architecture:
  - Main DAG operates on single-year cross-sections (parameterized by target_year)
  - ML requires multi-year panels for temporal feature engineering and time-series splits
  - Pattern: Panel is pre-built outside Hamilton using scripts/ml/build_operator_panel.py
  - Node behavior: operator_panel_multiyear loads panel from data/gold/operator_panel_2022_2023.parquet
  - Rationale: Avoids the anti-pattern of calling _build_driver() from inside nodes (see CLAUDE.md)
  - Benefits: Clean separation of concerns, no recursive DAG execution, faster ML iteration
- ML Helpers (layers/gold):
  - ml_panel.py: panel construction utilities
  - ml_reduction.py: feature engineering, label construction, model training, evaluation
  - ml_targets.py: target variable definitions for supervised learning
  - ml_models.py: scikit-learn pipelines and model utilities
Gold API views
- gold_api_views.py: operator_viz_view and related views for visualizations.

ML architecture and status

Offline training + DAG loader pattern

We treat ML training as an offline concern and only load predictions inside the Hamilton DAG.

training
- reduction models are trained by scripts, not from inside nodes
- scripts/ml/build_operator_panel.py builds the multi-year operator panel
- scripts/ml/run_reduction_ml_panel.py trains the reduction model on that panel and writes:
  - data/gold/reduction_predictions_{year}.parquet
  - data/gold/reduction_metrics_{year}.json
  - data/gold/reduction_feature_importance_{year}.parquet
consumption
- operator_reduction_predictions_from_parquet (in nodes/gold_ml_predictions.py) loads predictions if they exist
- operator_viz_view attaches predicted_reduction_t when available
- all visualizations and API views remain functional when predictions are missing
graceful degradation
- if prediction parquet files do not exist, the loader returns an empty DataFrame with the correct schema
- predicted_reduction_t becomes NaN, charts and tables fall back to purely heuristic scores
- the core physics-based and financial pipeline is never blocked by missing ML artifacts

What the reduction ML pipeline does

The reduction ML pipeline is architecturally complete but intentionally treated as exploratory with 2 years of data.

panel construction
- operator_panel_multiyear (from nodes/gold_ml_panel.py) loads operator_panel_2022_2023.parquet
- grain: (operator_baid, year) with emissions, production, financial metrics, and scores
feature and label building
- build_reduction_features_from_panel constructs ML features from year t values (scale, intensity, NPV, risk, trends)
- build_reduction_labels_from_panel constructs realized_reduction_t = E_total_t − E_total_{t+1} for operators with both years
- training set: 366 operator-years (2022 with 2023 as t+1)
model training (offline)
- RandomForestRegressor with numeric + categorical preprocessing (SimpleImputer, StandardScaler, OneHotEncoder)
- time-aware train/test split: train on years < max(year), test on max(year)
- evaluation metrics (R², MAE, RMSE, bias, correlation) saved to JSON for inspection

With only 2 years of data, these models are positioned as validation tools, not production forecasters. The primary value today is checking alignment between heuristic opportunity scores and realized behaviour.

When ML would become production-grade

To make reduction ML a first-class production input we would need:

more years of data (4–5 years, e.g. 2020–2024) to capture both growth and contraction cycles
explicit production scenario controls (growth vs stable vs decline) and interaction terms (e.g. opportunity_score × production_growth_pct)
external drivers (commodity prices, regulatory changes, capital cycles) as features
time-series cross-validation and out-of-sample validation (e.g. train on 2020–2023, test on 2024)
clear business rules: ML refines ranking and segmentation but does not replace physics-based emissions or regulatory calculations

In the current state, the architecture is in place and battle-tested. The pipeline can start producing production-grade ML outputs once the data horizon and validation strategy are extended.

Repro steps

The end-to-end workflow is:

# 1. Build the multi-year operator panel (required for ML and reduction exports)
python scripts/ml/build_operator_panel.py

# 2. Export the reduction analysis tables (descriptive validation)
python scripts/reporting/export_reduction_tables.py

# 3. Train the reduction ML model (optional, exploratory with 2 years of data)
python scripts/ml/run_reduction_ml_panel.py

# 4. Run the main pipeline and generate all visualizations
validere usable --year 2023

Notes:

step 2 writes the CSVs that back the “realized reductions” and segmentation findings
step 3 writes ML predictions and diagnostics; the DAG will consume them if present and ignore them if not
step 4 is the primary entry point for users; they do not need to run ML training to get a consistent decision pipeline
Analysis core
- analysis_core.py: carbon price sensitivity, weight sensitivity, scenario results, uncertainty, risk adjusted scores.
Visualization sinks
- viz_working.py: core charts (emissions decomposition, intensity rankings, production vs intensity).
- viz_decision.py: decision charts (MAC curve, efficiency frontier, thresholds, risk return, scenario shift, reduction heatmap).
IO sinks
- sinks_io.py: parquet writers for emissions, rankings, clusters, analysis outputs.

5. Helpers and infrastructure

Silver helpers (layers/silver)
- activities.py, production.py, ngl.py, flows.py, metadata.py, throughput.py.
Gold helpers (layers/gold)
- emissions.py, features.py, clustering.py, rankings.py, comparison.py, sensitivity.py, scenarios.py, uncertainty.py, risk.py, ml_targets.py, ml_models.py.
Infrastructure (infrastructure)
- parquet_writer.py: partitioned parquet with schema validation.
- transformers.py: shared transforms.
- validators.py: data quality checks.

6. Execution and DAGs

Typical usage
- targets = validere.list_targets()
- results = validere.run(["operator_petrinex_emissions"], target_year=2023)
- graph_path = validere.visualize_graph("artifacts/graphs/dag.png")
Graph behavior
- Discover nodes from validere.nodes.*.
- Build a dependency graph from function signatures.
- Validate for cycles.
- Execute nodes in dependency order, with parallelism where possible.
DAG images
- Stored in docs/assets/dag/.
- Include full pipeline and layer views (bronze, silver, gold emissions, gold analytics, rankings, analysis, viz).
- Generated with: python -m validere.visualization.dag docs/assets/dag.

7. Data schema and SCD2

Silver tables
- Production and NGL tables at facility-month grain.
- Facility edges for flows and attribution.
- dim_facility as a slowly changing dimension (SCD2).
Gold tables
- facility_aggregated: facility-month KPIs and intensities.
- emission_attribution: facility-month per sender.
- Extra gold outputs for operator and decision metrics.
SCD2 pattern
- In silver.dim_facility with _valid_from, _valid_to, _is_current, version.
- Supports point-in-time joins and history of facility attributes.
Storage
- Parquet format with Snappy compression.
- Partitioning by year and month where appropriate.

8. Visualization API

Purpose
- Provide stable views for visualization.
- Keep viz nodes free of joins and business rules.
Operator view
- Built in layers/gold/api_views.py, exposed as operator_viz_view.
- Grain: one row per operator per year.
- Contains ids, production, emissions, intensity, facility count, decision metrics, scenario ranks.
- Guarantees: required columns exist, types are stable, data is already filtered by target_year.
Facility view
- Grain: one row per facility per year.
- Contains facility and operator ids, production, activities, emissions.
- Same guarantees on presence and types.
Rules for viz nodes
- Take operator_viz_view or facility view as input, not raw gold tables.
- Do not rejoin gold tables inside viz functions.
- Assume view schemas are correct, avoid defensive column checks.

9. Boundaries, runsets, docs

Public API
- validere.list_targets
- validere.run
- validere.visualize_graph
Deprecated
- validere.domain and anything under _archive are legacy and not used in new code.
Runsets
- validere.runsets.USABLE_TARGETS: standard end-to-end run.
- validere.runsets.DAG_LAYER_TARGETS: layer DAG targets.
Documentation
- Manually maintained: main narrative and handoff files in docs/ (problem framing, methods, results, implications, architecture, decision playbooks, technical_handoff, troubleshooting).
- Auto generated: everything under docs/api/, built from source docstrings with quartodoc build.