Data & Methodology

Pipeline and data

  • ingestion
    • source system: regulatory production and activity platform.
    • entry point: validere bootstrap pulls volumes, NGL, and infrastructure data.
    • bronze storage:
      • partitioned parquet for volumetric and NGL records.
      • reference tables for business associates, facilities, and well to facility links.
  • medallion layers
    • bronze
      • raw parquet with minimal transformation.
    • silver
      • cleaned facility month fact tables for production, NGL, and flows.
      • facility dimension with history (SCD2 fields: valid_from, valid_to, is_current, version).
    • gold
      • emissions, intensity, and decision metrics at facility and operator level.
    • analysis
      • scenarios, risk views, sensitivities, clustering.
    • visualization
      • derived figures and exports from standardized views.
  • keys and grain
    • facility key: ReportingFacilityID mapped to facility_id.
    • operator key: OperatorBAID mapped to canonical operator_name.
    • time key: ProductionMonth plus year and month.
    • silver grain: facility_month.
    • gold grain: facility_month and operator_year.
  • core tables
    • bronze:
      • vol_raw: production and activities at well and facility level.
      • ngl_raw: component NGL volumes.
      • infrastructure: facility and operator mapping; well to facility links.
    • silver:
      • production_monthly: facility month production plus FLARE, VENT, FUEL volumes.
      • ngl_production: facility month NGL.
      • facility_edges_monthly: flows between facilities for attribution.
      • dim_facility: facility attributes with history.
    • gold:
      • facility_aggregated: facility month emissions and intensity.
      • operator_emissions: operator year emissions and intensity.
      • operator_decision_metrics: operator year decision metrics.
      • operator_viz_view: stable view consumed by visualizations.

Emissions and metrics

  • emissions logic (summary)
    • inputs:
      • gas_flared_m3, gas_vented_m3, gas_fuel_m3.
      • gas throughput, oil and water volumes, NGL mix, steam volumes.
    • components:
      • venting, flaring, fuel, processing, oil lifting, water handling, steam, fugitives.
    • steps at facility month:
      • compute component emissions from volumes and factors.
      • sum to E_total.
      • convert production to BOE and sum to total_boe.
      • compute intensity_kg_per_boe = (E_total * 1000) / total_boe.
      • set intensity to null for very small producers; cap extreme values for charts and keep original in a separate column.
    • steps at operator year:
      • group by operator_baid and year.
      • sum emissions and BOE first.
      • recompute intensity_kg_per_boe from the sums.
  • calibrated emission factors
    • methodology:
      • factors are calibrated against TIER/GHGRP facility-level reported emissions.
      • calibration uses non-negative least squares (nnls) regression segmented by facility type.
      • segments: SAGD, GasPlant, Conventional (assigned based on facility type flags).
      • calibration dataset: facility-year grain with Petrinex activity volumes and TIER/GHGRP reported emissions.
    • factor calibration:
      • feature matrix: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_processed_m3, oil_m3, water_m3, steam_proxy_m3, gas_production_m3).
      • target: E_total_tco2e_reported from TIER/GHGRP.
      • regression: scipy.optimize.nnls (non-negative constraints).
      • post-processing: factors clamped to physically plausible ranges (guardrails).
    • guardrails (physical plausibility ranges):
      • venting_tco2e_per_m3: 0.01-0.03 (CH4 GWP 28, density ~0.68 kg/m3).
      • flaring_tco2e_per_m3: 0.001-0.003 (combustion efficiency 85-95%).
      • fuel_gas_tco2e_per_m3: 0.001-0.003 (similar to flaring).
      • gas_processing_tco2e_per_m3: 0.00005-0.0005 (processing overhead).
      • oil_lifting_light_tco2e_per_m3: 0.005-0.015 (conventional oil).
      • oil_lifting_heavy_tco2e_per_m3: 0.015-0.04 (heavy oil / SAGD).
      • water_handling_tco2e_per_m3: 0.001-0.005 (water treatment).
      • sagd_steam_tco2e_per_m3: 0.04-0.08 (natural gas boilers).
      • fugitive_leakage_rate: 0.005-0.03 (0.5% to 3% of gas production).
    • evaluation metrics:
      • per segment: R², RMSE, MAE, bias (predicted vs reported emissions).
      • overall: aggregated metrics across all segments.
    • execution:
      • calibrated factors available via separate Hamilton nodes (facilities_with_emissions_calibrated, operator_petrinex_emissions_calibrated).
      • raw (default) factors remain available via original nodes (no breaking changes).
      • calibrated factors are applied segment-specifically based on facility type detection.

Calibration workflow (optional)

Calibration against GHGRP/NPRI facility-level reported emissions is available via emission_calibration.py and gold_emissions_ml.py nodes.

Setup: - data sources: - ghgrp_emissions_annual.parquet in silver layer - facility_mapping_ghgrp_petrinex.parquet for ID mapping - nodes: - ghgrp_facility_emissions (bronze): load GHGRP emissions mapped to Petrinex facility IDs - calibration_dataset (gold): join Petrinex activity volumes with GHGRP reported emissions at facility-year grain - calibrated_emission_factors (gold): fit factors using NNLS segmented by facility type - facilities_with_emissions_calibrated (gold): calculate emissions with calibrated factors - operator_petrinex_emissions_calibrated (gold): operator-level emissions with calibrated factors - operator_emissions_active (gold): select calibrated vs baseline based on config toggle

Calibration method: - regression: non-negative least squares (scipy.optimize.nnls) - feature matrix X: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_production_m3, oil_m3, water_m3, steam_proxy_m3) - target y: E_total_tco2e_reported from GHGRP - segmentation: fit separate factors for SAGD, GasPlant, Conventional - post-processing: clamp factors to physical guardrails (venting 0.01–0.03, flaring 0.001–0.003, etc.)

Evaluation: - metrics per segment: R², RMSE, MAE, bias (predicted vs reported emissions) - baseline vs calibrated comparison via emission_calibration_metrics node - segment-specific factor application in add_emissions_to_dataframe based on facility type flags

Toggle: - config.USE_CALIBRATED_FACTORS = True to use calibrated factors instead of baseline - operator_emissions_active node switches automatically based on toggle - fallback: baseline factors if GHGRP data not available or calibration fails

  • decision metrics (definitions)
    • npv_mm: net present value of reduction opportunities over a fixed horizon and discount rate.
    • reduction_potential_kt: addressable emissions volume.
    • regulatory_risk_score: composite regulatory exposure.
    • payback_years: simple payback.
    • investment_score:
      • based on NPV to CAPEX ratio.
      • normalized to 0-100 by percentile.
    • benefit_score (internal):
      • weighted combination of intensity, scale, financial, and regulatory scores.
      • used for scenario and sensitivity views; not used for main rankings.
  • clustering and segments
    • features:
      • log emissions volume.
      • intensity percentile.
      • log facility count.
      • log production volume.
    • method:
      • K-means with adaptive cluster count.
    • labels:
      • heuristic rules on cluster centroids to produce simple names (for example high intensity thermal, large efficient conventional).
    • outputs:
      • cluster_id, cluster_label, cluster level opportunity scores.
  • validation and guardrails
    • hours and volumes:
      • drop records with impossible hours.
      • set negative volumes to zero.
      • flag extreme outliers for review.
    • units:
      • enforce conversion from e3m3 to m3 for gas.
    • duplicates:
      • handle facility month duplicates via a consistent rule.
    • linking:
      • use most recent valid well to facility link.
      • maintain operator history through dim_facility.
    • diagnostics:
      • compare aggregated production against external statistics.
      • track match rates for well to facility and NGL linking.

Machine learning models

  • purpose
    • validate and refine existing heuristics using supervised learning.
    • use multi-year data to predict future outcomes and classify operators.
    • provide additive insights that complement existing decision metrics.
  • task A: next-year intensity prediction (regression)
    • grain: one row per (operator_baid, year_t).
    • label: intensity_kg_per_boe at year t+1.
    • features: E_total_kt, intensity_kg_per_boe, reduction_potential_kt, npv_mm, regulatory_risk_score, total_boe, venting_reduction_potential_kt, flaring_reduction_potential_kt, fuel_reduction_potential_kt, facility_count.
    • models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
    • preprocessing: SimpleImputer (median strategy), StandardScaler.
    • evaluation: time-aware train/test split (train on early years, test on later years).
    • metrics: R2, MAE, RMSE.
    • data requirements: multi-year data (need year t and t+1 for each operator).
  • task B: high-opportunity operator classification
    • grain: one row per (operator_baid, year).
    • label: binary (1 if opportunity_score >= 75th percentile, 0 otherwise).
    • features: same as Task A.
    • models: RandomForestClassifier (primary), LogisticRegression (baseline).
    • preprocessing: SimpleImputer (median strategy), StandardScaler.
    • evaluation: KFold or stratified KFold (cross-sectional, not temporal).
    • metrics: confusion matrix, accuracy, precision, recall, F1, ROC AUC.
    • note: labels derived from existing opportunity_score heuristics; used as surrogate classification task.
  • task C: realized emissions reduction prediction (regression)
    • grain: one row per (operator_baid, year_t).
    • label: realized_reduction_kt = E_total_t - E_total_{t+1} (year-over-year emissions change).
    • features: reduction potentials (venting, flaring, fuel), financial metrics (NPV, payback, MAC), operational metrics (intensity, production, facility count), risk scores, cluster archetypes.
    • models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
    • preprocessing: ColumnTransformer with SimpleImputer + StandardScaler for numeric features, OneHotEncoder for categorical features (cluster, performance category).
    • evaluation: time-aware single holdout (train on years < max, test on max year). With only 2 years of data, this is exploratory, not robust backtesting.
    • metrics: R2, MAE, RMSE, bias, correlation. Small-sample warnings issued when n < 10.
    • purpose: validates which operators with high opportunity_score actually reduce emissions (identifies “doers” vs “talkers”). Used to refine operator targeting and stress-test existing heuristics.

ML validation status

We treat the reduction ML work as descriptive validation, not production forecasting, given only 2 years of panel data (2022–2023).

  • panel coverage
    • 798 operator-years, 432 unique operators, 2022–2023
    • 366 operators with both 2022 and 2023 (used for realized reductions)
  • realized reductions (2022→2023)
    • 80.3% of operators increased emissions (mean change −37.8 Mt, median −1.5 Mt)
    • 19.7% reduced emissions
    • 45% improved emissions intensity year-over-year; production growth swamped efficiency gains
  • correlation analysis
    • opportunity_score vs realized_reduction_t has strong negative correlation (r ≈ −0.73)
    • high-opportunity operators (large oil sands players) increased emissions more during a growth period because they grew production 30–140%
  • exports and panel
    • multi-year panel: data/gold/operator_panel_2022_2023.parquet (798 rows, 2022–2023)
    • validation CSVs (all descriptive, no ML predictions):
      • data/gold/reduction_panel_2022_2023.csv (366 rows)
      • data/gold/reduction_segments_prime_alltalk_hidden.csv (124 rows)
      • data/gold/reduction_summary_stats.csv (4 rows)
  • interpretation
    • physics-based and economic scoring (opportunity_score, composite_score, etc.) remains the primary targeting framework
    • reduction ML is architecturally implemented but treated as optional and exploratory; the DAG degrades gracefully when ML artifacts are absent
  • feature engineering
    • features extracted from operator_decision_metrics and operator_petrinex_emissions.
    • missing columns handled gracefully (default to 0).
    • features merged on operator_baid and year.
  • model integration
    • ML outputs are additive; they do not replace existing scores (opportunity_score, composite_score, etc.).
    • predictions can be attached to operator_viz_view or kept in separate ML views.
    • model training is explicit and configurable, not always-on in production paths.
  • implementation location
    • label construction: layers/gold/ml_targets.py (Tasks A & B), layers/gold/ml_reduction.py (Task C).
    • model logic: layers/gold/ml_models.py (Tasks A & B), layers/gold/ml_reduction.py (Task C).
    • Hamilton nodes: nodes/gold_ml.py (Tasks A & B), nodes/gold_ml_reduction.py (Task C).