Technical Handoff

Environment and tooling

  • Python and libraries
    • Python 3.13+, managed with pixi or uv
    • Core: pandas, pyarrow, polars, scikit-learn, quarto
  • Dependency management
    • Install: pixi install or uv sync
    • Run pipeline: validere quickstart --year <YEAR> --viz or validere run --year <YEAR>
    • Render docs: pixi run quarto render or uv run quarto render

Data flow and graphs

  • Layers
    • Bronze: raw ingestion, minimal transforms
    • Silver: aggregation, unit conversion, activity extraction
    • Gold: emissions, features, clustering, ranking
    • Analysis: scenarios, sensitivity, risk metrics
    • Visualization: figure generation from gold views
  • Graphs
    • DAG PNGs live in docs/assets/dag/
    • Generate: python -m validere.visualization.dag docs/assets/dag
    • Views: bronze, silver, gold emissions, gold analytics, rankings, analysis, visualization, full pipeline
  • Architecture reference

Modules overview

  • Public API
    • File: validere/__init__.py
    • Functions: list_targets, run, visualize_graph
  • Nodes (pipeline layers)
    • Bronze: validere/nodes/bronze_io.py
    • Silver: validere/nodes/silver_core.py
    • Gold emissions: validere/nodes/gold_emissions.py
    • Gold segmentation: validere/nodes/gold_segmentation.py
    • Gold ranking: validere/nodes/gold_ranking.py
    • Gold decision: validere/nodes/gold_decision.py
    • Gold API views: validere/nodes/gold_api_views.py
    • Analysis: validere/nodes/analysis_core.py
    • Visualizations: validere/nodes/viz_working.py, validere/nodes/viz_decision.py
    • Sinks: validere/nodes/sinks_io.py
  • Emissions and features
    • Emissions calculators: domain/emissions/petrinex.py
    • Emissions metrics and costs: domain/emissions/metrics.py
    • Feature engineering: layers/gold/features.py
    • Clustering: layers/gold/clustering.py
  • Infrastructure
    • Parquet writer: infrastructure/parquet_writer.py
    • Schemas: schemas.py
    • Transformers: infrastructure/transformers.py
    • Units: utils/units.py

Running the pipeline

  • Bootstrap data
    • Year: validere bootstrap --year <YEAR>
    • Range: validere bootstrap --start YYYY-MM --end YYYY-MM
    • Output: partitioned parquet under data/bronze/
  • Main runs
    • Quick start: validere quickstart --year <YEAR> --viz
    • Standard: validere run --year <YEAR>
  • Programmatic use
    • List targets: validere.list_targets()
    • Run targets: validere.run([...], target_year=<YEAR>)
    • Save DAG: validere.visualize_graph("artifacts/graphs/dag.png")
  • CLI utilities
    • DAG image: validere viz --output artifacts/graphs/dag.png
    • Feasible targets: validere usable --year <YEAR>
    • Health check: validere doctor
  • ML panel building (for multi-year ML workflows)
    • Build panel: python scripts/ml/build_operator_panel.py
      • Runs DAG for years 2022-2023 using public run() API
      • Saves multi-year panel to data/gold/operator_panel_2022_2023.parquet
      • Required before running ML reduction nodes
    • Validate ML: python scripts/ml/run_reduction_ml_panel.py
      • Standalone test of ML pipeline outside Hamilton
      • Loads panel, builds features/labels, trains model, prints metrics
      • Use to verify ML logic before wiring into DAG
    • Run ML nodes: validere run operator_reduction_model operator_reduction_predictions ...
      • Requires pre-built panel from step 1
      • Uses operator_panel_multiyear node which loads panel from disk
    • Architecture note: Panel is built outside Hamilton to avoid the anti-pattern of calling _build_driver() from inside nodes (see CLAUDE.md)
  • Key outputs
    • Silver: facility production and NGL monthly parquet tables
    • Gold: operator emissions, features, clusters parquet tables
    • ML: operator panel, reduction models, predictions (when panel is built)
    • Figures: docs/assets/figures/*.png

Documentation

  • Build
    • API docs: quartodoc build (via pixi or uv)
    • Site: quarto render
  • Location
    • Root: site/
    • Landing: site/index.html
    • Problem framing: site/01_problem_framing.html
    • API reference: site/api/

Configuration

  • Core config
    • File: validere/config.py
    • Examples: carbon price, emission thresholds, random seed
  • Emission factors
    • File: domain/emissions/factors.py
    • Content: factors per component, units, source notes
  • Overrides
    • Edit config directly or set env vars such as CARBON_PRICE

Operations

  • Scale and performance
    • Typical sizes: bronze > silver > gold parquet tables
    • Typical times: bronze, silver, gold layer runtimes on a standard laptop
    • Memory: rough bands for small, medium, large datasets
    • Storage: approximate total footprint across all layers
  • Known patterns
    • Date and key quirks: handle missing or variant columns
    • Duplicate columns: drop via de-duplication on column names
    • Large runs: prefer chunking and efficient joins

Extensibility and tests

  • Adding new features or metrics
    • Implement feature in layers/gold/features.py
    • Wire into validere/nodes/gold_segmentation.py or related gold nodes
    • Update layers/gold/clustering.py if clustering should use the new feature
  • Tests
    • Unit tests: tests/test_*.py
    • Integration tests: tests/test_integration.py
    • Data validation: infrastructure-level checks and validators