System Architecture
1. Overview
- Pattern
- Node based pipeline over a medallion flow: Bronze -> Silver -> Gold -> Analysis -> Sinks.
- Raw data -> cleaned tables -> business metrics -> views -> exports.
- Principles
- All transforms use nodes with explicit dependencies.
- Public interface is small:
list_targets,run,visualize_graph. - Nodes live under
nodes/, logic underlayers/, infra underinfrastructure/. - DAG images in
docs/assets/dag/show full lineage.
2. Public API
list_targets()- Returns all available targets.
run(targets, **inputs)- Example:
run(["operator_petrinex_emissions"], target_year=2023). - Builds the graph, resolves dependencies, runs required nodes only.
- Example:
visualize_graph(path)- Writes a PNG DAG to
path.
- Writes a PNG DAG to
- Helper
run_all(...)wrapsrun()for standard runs and CLI, not part of the public API.
3. Data flow and layers
- Bronze
- Purpose: raw ingestion with minimal transforms.
- Input: external source data.
- Output: partitioned parquet under
data/bronze/.
- Silver
- Purpose: cleaning, standardization, aggregation, metadata joins.
- Input: bronze tables.
- Output: standardized parquet under
data/silver/.
- Gold
- Purpose: emissions, features, clustering, rankings, decision metrics, ML models.
- Input: silver tables.
- Output: analytics tables and views under
data/gold/.
- Analysis
- Purpose: scenarios, sensitivities, uncertainty, risk adjustments.
- Input: gold tables and metrics.
- Output: analysis tables and figures.
- Sinks
- Purpose: materialization.
- Data sinks: parquet exports.
- Visualization sinks: PNG figures in
docs/assets/figures/.
4. Node modules (nodes/)
- Bronze IO
bronze_io.py: load raw volumes, NGL, infrastructure, business associates, links, licences.
- Silver core
silver_core.py: activities, production, NGL, facility production, flows, metadata, operator aggregates.
- Gold emissions
gold_emissions.py: facility and operator emissions tables.
- Gold segmentation
gold_segmentation.py: features, clusters, cluster summaries, cluster opportunity metrics.
- Gold ranking
gold_ranking.py: baseline rankings, composite rankings, comparisons, key differences.
- Gold decision
gold_decision.py:operator_decision_metricsand related investment, operations, compliance outputs.
- Gold ML
gold_ml_panel.py: loads pre-built multi-year operator panel from diskgold_ml_reduction.py: reduction prediction nodes (features, labels, model, predictions, metrics)gold_ml_predictions.py: loads pre-computed ML predictions from parquet (production path)- Panel Architecture:
- Main DAG operates on single-year cross-sections (parameterized by
target_year) - ML requires multi-year panels for temporal feature engineering and time-series splits
- Pattern: Panel is pre-built outside Hamilton using
scripts/ml/build_operator_panel.py - Node behavior:
operator_panel_multiyearloads panel fromdata/gold/operator_panel_2022_2023.parquet - Rationale: Avoids the anti-pattern of calling
_build_driver()from inside nodes (see CLAUDE.md) - Benefits: Clean separation of concerns, no recursive DAG execution, faster ML iteration
- Main DAG operates on single-year cross-sections (parameterized by
- ML Helpers (layers/gold):
ml_panel.py: panel construction utilitiesml_reduction.py: feature engineering, label construction, model training, evaluationml_targets.py: target variable definitions for supervised learningml_models.py: scikit-learn pipelines and model utilities
- Gold API views
gold_api_views.py:operator_viz_viewand related views for visualizations.
ML architecture and status
Offline training + DAG loader pattern
We treat ML training as an offline concern and only load predictions inside the Hamilton DAG.
- training
- reduction models are trained by scripts, not from inside nodes
scripts/ml/build_operator_panel.pybuilds the multi-year operator panelscripts/ml/run_reduction_ml_panel.pytrains the reduction model on that panel and writes:data/gold/reduction_predictions_{year}.parquetdata/gold/reduction_metrics_{year}.jsondata/gold/reduction_feature_importance_{year}.parquet
- consumption
operator_reduction_predictions_from_parquet(innodes/gold_ml_predictions.py) loads predictions if they existoperator_viz_viewattachespredicted_reduction_twhen available- all visualizations and API views remain functional when predictions are missing
- graceful degradation
- if prediction parquet files do not exist, the loader returns an empty DataFrame with the correct schema
predicted_reduction_tbecomesNaN, charts and tables fall back to purely heuristic scores- the core physics-based and financial pipeline is never blocked by missing ML artifacts
What the reduction ML pipeline does
The reduction ML pipeline is architecturally complete but intentionally treated as exploratory with 2 years of data.
- panel construction
operator_panel_multiyear(fromnodes/gold_ml_panel.py) loadsoperator_panel_2022_2023.parquet- grain: (operator_baid, year) with emissions, production, financial metrics, and scores
- feature and label building
build_reduction_features_from_panelconstructs ML features from year t values (scale, intensity, NPV, risk, trends)build_reduction_labels_from_panelconstructsrealized_reduction_t = E_total_t − E_total_{t+1}for operators with both years- training set: 366 operator-years (2022 with 2023 as t+1)
- model training (offline)
- RandomForestRegressor with numeric + categorical preprocessing (SimpleImputer, StandardScaler, OneHotEncoder)
- time-aware train/test split: train on years < max(year), test on max(year)
- evaluation metrics (R², MAE, RMSE, bias, correlation) saved to JSON for inspection
With only 2 years of data, these models are positioned as validation tools, not production forecasters. The primary value today is checking alignment between heuristic opportunity scores and realized behaviour.
When ML would become production-grade
To make reduction ML a first-class production input we would need:
- more years of data (4–5 years, e.g. 2020–2024) to capture both growth and contraction cycles
- explicit production scenario controls (growth vs stable vs decline) and interaction terms (e.g. opportunity_score × production_growth_pct)
- external drivers (commodity prices, regulatory changes, capital cycles) as features
- time-series cross-validation and out-of-sample validation (e.g. train on 2020–2023, test on 2024)
- clear business rules: ML refines ranking and segmentation but does not replace physics-based emissions or regulatory calculations
In the current state, the architecture is in place and battle-tested. The pipeline can start producing production-grade ML outputs once the data horizon and validation strategy are extended.
Repro steps
The end-to-end workflow is:
# 1. Build the multi-year operator panel (required for ML and reduction exports)
python scripts/ml/build_operator_panel.py
# 2. Export the reduction analysis tables (descriptive validation)
python scripts/reporting/export_reduction_tables.py
# 3. Train the reduction ML model (optional, exploratory with 2 years of data)
python scripts/ml/run_reduction_ml_panel.py
# 4. Run the main pipeline and generate all visualizations
validere usable --year 2023Notes:
step 2 writes the CSVs that back the “realized reductions” and segmentation findings
step 3 writes ML predictions and diagnostics; the DAG will consume them if present and ignore them if not
step 4 is the primary entry point for users; they do not need to run ML training to get a consistent decision pipeline
Analysis core
analysis_core.py: carbon price sensitivity, weight sensitivity, scenario results, uncertainty, risk adjusted scores.
Visualization sinks
viz_working.py: core charts (emissions decomposition, intensity rankings, production vs intensity).viz_decision.py: decision charts (MAC curve, efficiency frontier, thresholds, risk return, scenario shift, reduction heatmap).
IO sinks
sinks_io.py: parquet writers for emissions, rankings, clusters, analysis outputs.
5. Helpers and infrastructure
- Silver helpers (layers/silver)
activities.py,production.py,ngl.py,flows.py,metadata.py,throughput.py.
- Gold helpers (layers/gold)
emissions.py,features.py,clustering.py,rankings.py,comparison.py,sensitivity.py,scenarios.py,uncertainty.py,risk.py,ml_targets.py,ml_models.py.
- Infrastructure (infrastructure)
parquet_writer.py: partitioned parquet with schema validation.transformers.py: shared transforms.validators.py: data quality checks.
6. Execution and DAGs
- Typical usage
targets = validere.list_targets()results = validere.run(["operator_petrinex_emissions"], target_year=2023)graph_path = validere.visualize_graph("artifacts/graphs/dag.png")
- Graph behavior
- Discover nodes from
validere.nodes.*. - Build a dependency graph from function signatures.
- Validate for cycles.
- Execute nodes in dependency order, with parallelism where possible.
- Discover nodes from
- DAG images
- Stored in
docs/assets/dag/. - Include full pipeline and layer views (bronze, silver, gold emissions, gold analytics, rankings, analysis, viz).
- Generated with:
python -m validere.visualization.dag docs/assets/dag.
- Stored in
7. Data schema and SCD2
- Silver tables
- Production and NGL tables at facility-month grain.
- Facility edges for flows and attribution.
dim_facilityas a slowly changing dimension (SCD2).
- Gold tables
facility_aggregated: facility-month KPIs and intensities.emission_attribution: facility-month per sender.- Extra gold outputs for operator and decision metrics.
- SCD2 pattern
- In
silver.dim_facilitywith_valid_from,_valid_to,_is_current,version. - Supports point-in-time joins and history of facility attributes.
- In
- Storage
- Parquet format with Snappy compression.
- Partitioning by year and month where appropriate.
8. Visualization API
- Purpose
- Provide stable views for visualization.
- Keep viz nodes free of joins and business rules.
- Operator view
- Built in
layers/gold/api_views.py, exposed asoperator_viz_view. - Grain: one row per operator per year.
- Contains ids, production, emissions, intensity, facility count, decision metrics, scenario ranks.
- Guarantees: required columns exist, types are stable, data is already filtered by target_year.
- Built in
- Facility view
- Grain: one row per facility per year.
- Contains facility and operator ids, production, activities, emissions.
- Same guarantees on presence and types.
- Rules for viz nodes
- Take
operator_viz_viewor facility view as input, not raw gold tables. - Do not rejoin gold tables inside viz functions.
- Assume view schemas are correct, avoid defensive column checks.
- Take
9. Boundaries, runsets, docs
- Public API
validere.list_targetsvalidere.runvalidere.visualize_graph
- Deprecated
validere.domainand anything under_archiveare legacy and not used in new code.
- Runsets
validere.runsets.USABLE_TARGETS: standard end-to-end run.validere.runsets.DAG_LAYER_TARGETS: layer DAG targets.
- Documentation
- Manually maintained: main narrative and handoff files in
docs/(problem framing, methods, results, implications, architecture, decision playbooks, technical_handoff, troubleshooting). - Auto generated: everything under
docs/api/, built from source docstrings withquartodoc build.
- Manually maintained: main narrative and handoff files in