Data flow and graphs
- Layers
- Bronze: raw ingestion, minimal transforms
- Silver: aggregation, unit conversion, activity extraction
- Gold: emissions, features, clustering, ranking
- Analysis: scenarios, sensitivity, risk metrics
- Visualization: figure generation from gold views
- Graphs
- DAG PNGs live in
docs/assets/dag/
- Generate:
python -m validere.visualization.dag docs/assets/dag
- Views: bronze, silver, gold emissions, gold analytics, rankings, analysis, visualization, full pipeline
- Architecture reference
Modules overview
- Public API
- File:
validere/__init__.py
- Functions:
list_targets, run, visualize_graph
- Nodes (pipeline layers)
- Bronze:
validere/nodes/bronze_io.py
- Silver:
validere/nodes/silver_core.py
- Gold emissions:
validere/nodes/gold_emissions.py
- Gold segmentation:
validere/nodes/gold_segmentation.py
- Gold ranking:
validere/nodes/gold_ranking.py
- Gold decision:
validere/nodes/gold_decision.py
- Gold API views:
validere/nodes/gold_api_views.py
- Analysis:
validere/nodes/analysis_core.py
- Visualizations:
validere/nodes/viz_working.py, validere/nodes/viz_decision.py
- Sinks:
validere/nodes/sinks_io.py
- Emissions and features
- Emissions calculators:
domain/emissions/petrinex.py
- Emissions metrics and costs:
domain/emissions/metrics.py
- Feature engineering:
layers/gold/features.py
- Clustering:
layers/gold/clustering.py
- Infrastructure
- Parquet writer:
infrastructure/parquet_writer.py
- Schemas:
schemas.py
- Transformers:
infrastructure/transformers.py
- Units:
utils/units.py
Running the pipeline
- Bootstrap data
- Year:
validere bootstrap --year <YEAR>
- Range:
validere bootstrap --start YYYY-MM --end YYYY-MM
- Output: partitioned parquet under
data/bronze/
- Main runs
- Quick start:
validere quickstart --year <YEAR> --viz
- Standard:
validere run --year <YEAR>
- Programmatic use
- List targets:
validere.list_targets()
- Run targets:
validere.run([...], target_year=<YEAR>)
- Save DAG:
validere.visualize_graph("artifacts/graphs/dag.png")
- CLI utilities
- DAG image:
validere viz --output artifacts/graphs/dag.png
- Feasible targets:
validere usable --year <YEAR>
- Health check:
validere doctor
- ML panel building (for multi-year ML workflows)
- Build panel:
python scripts/ml/build_operator_panel.py
- Runs DAG for years 2022-2023 using public
run() API
- Saves multi-year panel to
data/gold/operator_panel_2022_2023.parquet
- Required before running ML reduction nodes
- Validate ML:
python scripts/ml/run_reduction_ml_panel.py
- Standalone test of ML pipeline outside Hamilton
- Loads panel, builds features/labels, trains model, prints metrics
- Use to verify ML logic before wiring into DAG
- Run ML nodes:
validere run operator_reduction_model operator_reduction_predictions ...
- Requires pre-built panel from step 1
- Uses
operator_panel_multiyear node which loads panel from disk
- Architecture note: Panel is built outside Hamilton to avoid the anti-pattern of calling
_build_driver() from inside nodes (see CLAUDE.md)
- Key outputs
- Silver: facility production and NGL monthly parquet tables
- Gold: operator emissions, features, clusters parquet tables
- ML: operator panel, reduction models, predictions (when panel is built)
- Figures:
docs/assets/figures/*.png
Documentation
- Build
- API docs:
quartodoc build (via pixi or uv)
- Site:
quarto render
- Location
- Root:
site/
- Landing:
site/index.html
- Problem framing:
site/01_problem_framing.html
- API reference:
site/api/
Configuration
- Core config
- File:
validere/config.py
- Examples: carbon price, emission thresholds, random seed
- Emission factors
- File:
domain/emissions/factors.py
- Content: factors per component, units, source notes
- Overrides
- Edit config directly or set env vars such as
CARBON_PRICE
Operations
- Scale and performance
- Typical sizes: bronze > silver > gold parquet tables
- Typical times: bronze, silver, gold layer runtimes on a standard laptop
- Memory: rough bands for small, medium, large datasets
- Storage: approximate total footprint across all layers
- Known patterns
- Date and key quirks: handle missing or variant columns
- Duplicate columns: drop via de-duplication on column names
- Large runs: prefer chunking and efficient joins
Extensibility and tests
- Adding new features or metrics
- Implement feature in
layers/gold/features.py
- Wire into
validere/nodes/gold_segmentation.py or related gold nodes
- Update
layers/gold/clustering.py if clustering should use the new feature
- Tests
- Unit tests:
tests/test_*.py
- Integration tests:
tests/test_integration.py
- Data validation: infrastructure-level checks and validators