Data & Methodology

Pipeline and data

ingestion
- source system: regulatory production and activity platform.
- entry point: validere bootstrap pulls volumes, NGL, and infrastructure data.
- bronze storage:
  - partitioned parquet for volumetric and NGL records.
  - reference tables for business associates, facilities, and well to facility links.
medallion layers
- bronze
  - raw parquet with minimal transformation.
- silver
  - cleaned facility month fact tables for production, NGL, and flows.
  - facility dimension with history (SCD2 fields: valid_from, valid_to, is_current, version).
- gold
  - emissions, intensity, and decision metrics at facility and operator level.
- analysis
  - scenarios, risk views, sensitivities, clustering.
- visualization
  - derived figures and exports from standardized views.
keys and grain
- facility key: ReportingFacilityID mapped to facility_id.
- operator key: OperatorBAID mapped to canonical operator_name.
- time key: ProductionMonth plus year and month.
- silver grain: facility_month.
- gold grain: facility_month and operator_year.
core tables
- bronze:
  - vol_raw: production and activities at well and facility level.
  - ngl_raw: component NGL volumes.
  - infrastructure: facility and operator mapping; well to facility links.
- silver:
  - production_monthly: facility month production plus FLARE, VENT, FUEL volumes.
  - ngl_production: facility month NGL.
  - facility_edges_monthly: flows between facilities for attribution.
  - dim_facility: facility attributes with history.
- gold:
  - facility_aggregated: facility month emissions and intensity.
  - operator_emissions: operator year emissions and intensity.
  - operator_decision_metrics: operator year decision metrics.
  - operator_viz_view: stable view consumed by visualizations.

Emissions and metrics

emissions logic (summary)
- inputs:
  - gas_flared_m3, gas_vented_m3, gas_fuel_m3.
  - gas throughput, oil and water volumes, NGL mix, steam volumes.
- components:
  - venting, flaring, fuel, processing, oil lifting, water handling, steam, fugitives.
- steps at facility month:
  - compute component emissions from volumes and factors.
  - sum to E_total.
  - convert production to BOE and sum to total_boe.
  - compute intensity_kg_per_boe = (E_total * 1000) / total_boe.
  - set intensity to null for very small producers; cap extreme values for charts and keep original in a separate column.
- steps at operator year:
  - group by operator_baid and year.
  - sum emissions and BOE first.
  - recompute intensity_kg_per_boe from the sums.
calibrated emission factors
- methodology:
  - factors are calibrated against TIER/GHGRP facility-level reported emissions.
  - calibration uses non-negative least squares (nnls) regression segmented by facility type.
  - segments: SAGD, GasPlant, Conventional (assigned based on facility type flags).
  - calibration dataset: facility-year grain with Petrinex activity volumes and TIER/GHGRP reported emissions.
- factor calibration:
  - feature matrix: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_processed_m3, oil_m3, water_m3, steam_proxy_m3, gas_production_m3).
  - target: E_total_tco2e_reported from TIER/GHGRP.
  - regression: scipy.optimize.nnls (non-negative constraints).
  - post-processing: factors clamped to physically plausible ranges (guardrails).
- guardrails (physical plausibility ranges):
  - venting_tco2e_per_m3: 0.01-0.03 (CH4 GWP 28, density ~0.68 kg/m3).
  - flaring_tco2e_per_m3: 0.001-0.003 (combustion efficiency 85-95%).
  - fuel_gas_tco2e_per_m3: 0.001-0.003 (similar to flaring).
  - gas_processing_tco2e_per_m3: 0.00005-0.0005 (processing overhead).
  - oil_lifting_light_tco2e_per_m3: 0.005-0.015 (conventional oil).
  - oil_lifting_heavy_tco2e_per_m3: 0.015-0.04 (heavy oil / SAGD).
  - water_handling_tco2e_per_m3: 0.001-0.005 (water treatment).
  - sagd_steam_tco2e_per_m3: 0.04-0.08 (natural gas boilers).
  - fugitive_leakage_rate: 0.005-0.03 (0.5% to 3% of gas production).
- evaluation metrics:
  - per segment: R², RMSE, MAE, bias (predicted vs reported emissions).
  - overall: aggregated metrics across all segments.
- execution:
  - calibrated factors available via separate Hamilton nodes (facilities_with_emissions_calibrated, operator_petrinex_emissions_calibrated).
  - raw (default) factors remain available via original nodes (no breaking changes).
  - calibrated factors are applied segment-specifically based on facility type detection.

Calibration workflow (optional)

Calibration against GHGRP/NPRI facility-level reported emissions is available via emission_calibration.py and gold_emissions_ml.py nodes.

Setup: - data sources: - ghgrp_emissions_annual.parquet in silver layer - facility_mapping_ghgrp_petrinex.parquet for ID mapping - nodes: - ghgrp_facility_emissions (bronze): load GHGRP emissions mapped to Petrinex facility IDs - calibration_dataset (gold): join Petrinex activity volumes with GHGRP reported emissions at facility-year grain - calibrated_emission_factors (gold): fit factors using NNLS segmented by facility type - facilities_with_emissions_calibrated (gold): calculate emissions with calibrated factors - operator_petrinex_emissions_calibrated (gold): operator-level emissions with calibrated factors - operator_emissions_active (gold): select calibrated vs baseline based on config toggle

Calibration method: - regression: non-negative least squares (scipy.optimize.nnls) - feature matrix X: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_production_m3, oil_m3, water_m3, steam_proxy_m3) - target y: E_total_tco2e_reported from GHGRP - segmentation: fit separate factors for SAGD, GasPlant, Conventional - post-processing: clamp factors to physical guardrails (venting 0.01–0.03, flaring 0.001–0.003, etc.)

Evaluation: - metrics per segment: R², RMSE, MAE, bias (predicted vs reported emissions) - baseline vs calibrated comparison via emission_calibration_metrics node - segment-specific factor application in add_emissions_to_dataframe based on facility type flags

Toggle: - config.USE_CALIBRATED_FACTORS = True to use calibrated factors instead of baseline - operator_emissions_active node switches automatically based on toggle - fallback: baseline factors if GHGRP data not available or calibration fails

decision metrics (definitions)
- npv_mm: net present value of reduction opportunities over a fixed horizon and discount rate.
- reduction_potential_kt: addressable emissions volume.
- regulatory_risk_score: composite regulatory exposure.
- payback_years: simple payback.
- investment_score:
  - based on NPV to CAPEX ratio.
  - normalized to 0-100 by percentile.
- benefit_score (internal):
  - weighted combination of intensity, scale, financial, and regulatory scores.
  - used for scenario and sensitivity views; not used for main rankings.
clustering and segments
- features:
  - log emissions volume.
  - intensity percentile.
  - log facility count.
  - log production volume.
- method:
  - K-means with adaptive cluster count.
- labels:
  - heuristic rules on cluster centroids to produce simple names (for example high intensity thermal, large efficient conventional).
- outputs:
  - cluster_id, cluster_label, cluster level opportunity scores.
validation and guardrails
- hours and volumes:
  - drop records with impossible hours.
  - set negative volumes to zero.
  - flag extreme outliers for review.
- units:
  - enforce conversion from e3m3 to m3 for gas.
- duplicates:
  - handle facility month duplicates via a consistent rule.
- linking:
  - use most recent valid well to facility link.
  - maintain operator history through dim_facility.
- diagnostics:
  - compare aggregated production against external statistics.
  - track match rates for well to facility and NGL linking.

Machine learning models

purpose
- validate and refine existing heuristics using supervised learning.
- use multi-year data to predict future outcomes and classify operators.
- provide additive insights that complement existing decision metrics.
task A: next-year intensity prediction (regression)
- grain: one row per (operator_baid, year_t).
- label: intensity_kg_per_boe at year t+1.
- features: E_total_kt, intensity_kg_per_boe, reduction_potential_kt, npv_mm, regulatory_risk_score, total_boe, venting_reduction_potential_kt, flaring_reduction_potential_kt, fuel_reduction_potential_kt, facility_count.
- models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
- preprocessing: SimpleImputer (median strategy), StandardScaler.
- evaluation: time-aware train/test split (train on early years, test on later years).
- metrics: R2, MAE, RMSE.
- data requirements: multi-year data (need year t and t+1 for each operator).
task B: high-opportunity operator classification
- grain: one row per (operator_baid, year).
- label: binary (1 if opportunity_score >= 75th percentile, 0 otherwise).
- features: same as Task A.
- models: RandomForestClassifier (primary), LogisticRegression (baseline).
- preprocessing: SimpleImputer (median strategy), StandardScaler.
- evaluation: KFold or stratified KFold (cross-sectional, not temporal).
- metrics: confusion matrix, accuracy, precision, recall, F1, ROC AUC.
- note: labels derived from existing opportunity_score heuristics; used as surrogate classification task.
task C: realized emissions reduction prediction (regression)
- grain: one row per (operator_baid, year_t).
- label: realized_reduction_kt = E_total_t - E_total_{t+1} (year-over-year emissions change).
- features: reduction potentials (venting, flaring, fuel), financial metrics (NPV, payback, MAC), operational metrics (intensity, production, facility count), risk scores, cluster archetypes.
- models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
- preprocessing: ColumnTransformer with SimpleImputer + StandardScaler for numeric features, OneHotEncoder for categorical features (cluster, performance category).
- evaluation: time-aware single holdout (train on years < max, test on max year). With only 2 years of data, this is exploratory, not robust backtesting.
- metrics: R2, MAE, RMSE, bias, correlation. Small-sample warnings issued when n < 10.
- purpose: validates which operators with high opportunity_score actually reduce emissions (identifies “doers” vs “talkers”). Used to refine operator targeting and stress-test existing heuristics.

ML validation status

We treat the reduction ML work as descriptive validation, not production forecasting, given only 2 years of panel data (2022–2023).

panel coverage
- 798 operator-years, 432 unique operators, 2022–2023
- 366 operators with both 2022 and 2023 (used for realized reductions)
realized reductions (2022→2023)
- 80.3% of operators increased emissions (mean change −37.8 Mt, median −1.5 Mt)
- 19.7% reduced emissions
- 45% improved emissions intensity year-over-year; production growth swamped efficiency gains
correlation analysis
- opportunity_score vs realized_reduction_t has strong negative correlation (r ≈ −0.73)
- high-opportunity operators (large oil sands players) increased emissions more during a growth period because they grew production 30–140%
exports and panel
- multi-year panel: data/gold/operator_panel_2022_2023.parquet (798 rows, 2022–2023)
- validation CSVs (all descriptive, no ML predictions):
  - data/gold/reduction_panel_2022_2023.csv (366 rows)
  - data/gold/reduction_segments_prime_alltalk_hidden.csv (124 rows)
  - data/gold/reduction_summary_stats.csv (4 rows)
interpretation
- physics-based and economic scoring (opportunity_score, composite_score, etc.) remains the primary targeting framework
- reduction ML is architecturally implemented but treated as optional and exploratory; the DAG degrades gracefully when ML artifacts are absent
feature engineering
- features extracted from operator_decision_metrics and operator_petrinex_emissions.
- missing columns handled gracefully (default to 0).
- features merged on operator_baid and year.
model integration
- ML outputs are additive; they do not replace existing scores (opportunity_score, composite_score, etc.).
- predictions can be attached to operator_viz_view or kept in separate ML views.
- model training is explicit and configurable, not always-on in production paths.
implementation location
- label construction: layers/gold/ml_targets.py (Tasks A & B), layers/gold/ml_reduction.py (Task C).
- model logic: layers/gold/ml_models.py (Tasks A & B), layers/gold/ml_reduction.py (Task C).
- Hamilton nodes: nodes/gold_ml.py (Tasks A & B), nodes/gold_ml_reduction.py (Task C).