Technical Handoff

Environment and tooling

Python and libraries
- Python 3.13+, managed with pixi or uv
- Core: pandas, pyarrow, polars, scikit-learn, quarto
Dependency management
- Install: pixi install or uv sync
- Run pipeline: validere quickstart --year <YEAR> --viz or validere run --year <YEAR>
- Render docs: pixi run quarto render or uv run quarto render

Data flow and graphs

Layers
- Bronze: raw ingestion, minimal transforms
- Silver: aggregation, unit conversion, activity extraction
- Gold: emissions, features, clustering, ranking
- Analysis: scenarios, sensitivity, risk metrics
- Visualization: figure generation from gold views
Graphs
- DAG PNGs live in docs/assets/dag/
- Generate: python -m validere.visualization.dag docs/assets/dag
- Views: bronze, silver, gold emissions, gold analytics, rankings, analysis, visualization, full pipeline
Architecture reference
- See System Architecture for full design details

Modules overview

Public API
- File: validere/__init__.py
- Functions: list_targets, run, visualize_graph
Nodes (pipeline layers)
- Bronze: validere/nodes/bronze_io.py
- Silver: validere/nodes/silver_core.py
- Gold emissions: validere/nodes/gold_emissions.py
- Gold segmentation: validere/nodes/gold_segmentation.py
- Gold ranking: validere/nodes/gold_ranking.py
- Gold decision: validere/nodes/gold_decision.py
- Gold API views: validere/nodes/gold_api_views.py
- Analysis: validere/nodes/analysis_core.py
- Visualizations: validere/nodes/viz_working.py, validere/nodes/viz_decision.py
- Sinks: validere/nodes/sinks_io.py
Emissions and features
- Emissions calculators: domain/emissions/petrinex.py
- Emissions metrics and costs: domain/emissions/metrics.py
- Feature engineering: layers/gold/features.py
- Clustering: layers/gold/clustering.py
Infrastructure
- Parquet writer: infrastructure/parquet_writer.py
- Schemas: schemas.py
- Transformers: infrastructure/transformers.py
- Units: utils/units.py

Running the pipeline

Bootstrap data
- Year: validere bootstrap --year <YEAR>
- Range: validere bootstrap --start YYYY-MM --end YYYY-MM
- Output: partitioned parquet under data/bronze/
Main runs
- Quick start: validere quickstart --year <YEAR> --viz
- Standard: validere run --year <YEAR>
Programmatic use
- List targets: validere.list_targets()
- Run targets: validere.run([...], target_year=<YEAR>)
- Save DAG: validere.visualize_graph("artifacts/graphs/dag.png")
CLI utilities
- DAG image: validere viz --output artifacts/graphs/dag.png
- Feasible targets: validere usable --year <YEAR>
- Health check: validere doctor
ML panel building (for multi-year ML workflows)
- Build panel: python scripts/ml/build_operator_panel.py
  - Runs DAG for years 2022-2023 using public run() API
  - Saves multi-year panel to data/gold/operator_panel_2022_2023.parquet
  - Required before running ML reduction nodes
- Validate ML: python scripts/ml/run_reduction_ml_panel.py
  - Standalone test of ML pipeline outside Hamilton
  - Loads panel, builds features/labels, trains model, prints metrics
  - Use to verify ML logic before wiring into DAG
- Run ML nodes: validere run operator_reduction_model operator_reduction_predictions ...
  - Requires pre-built panel from step 1
  - Uses operator_panel_multiyear node which loads panel from disk
- Architecture note: Panel is built outside Hamilton to avoid the anti-pattern of calling _build_driver() from inside nodes (see CLAUDE.md)
Key outputs
- Silver: facility production and NGL monthly parquet tables
- Gold: operator emissions, features, clusters parquet tables
- ML: operator panel, reduction models, predictions (when panel is built)
- Figures: docs/assets/figures/*.png

Documentation

Build
- API docs: quartodoc build (via pixi or uv)
- Site: quarto render
Location
- Root: site/
- Landing: site/index.html
- Problem framing: site/01_problem_framing.html
- API reference: site/api/

Configuration

Core config
- File: validere/config.py
- Examples: carbon price, emission thresholds, random seed
Emission factors
- File: domain/emissions/factors.py
- Content: factors per component, units, source notes
Overrides
- Edit config directly or set env vars such as CARBON_PRICE

Operations

Scale and performance
- Typical sizes: bronze > silver > gold parquet tables
- Typical times: bronze, silver, gold layer runtimes on a standard laptop
- Memory: rough bands for small, medium, large datasets
- Storage: approximate total footprint across all layers
Known patterns
- Date and key quirks: handle missing or variant columns
- Duplicate columns: drop via de-duplication on column names
- Large runs: prefer chunking and efficient joins

Extensibility and tests

Adding new features or metrics
- Implement feature in layers/gold/features.py
- Wire into validere/nodes/gold_segmentation.py or related gold nodes
- Update layers/gold/clustering.py if clustering should use the new feature
Tests
- Unit tests: tests/test_*.py
- Integration tests: tests/test_integration.py
- Data validation: infrastructure-level checks and validators