Data & Methodology
Pipeline and data
- ingestion
- source system: regulatory production and activity platform.
- entry point:
validere bootstrappulls volumes, NGL, and infrastructure data. - bronze storage:
- partitioned parquet for volumetric and NGL records.
- reference tables for business associates, facilities, and well to facility links.
- medallion layers
- bronze
- raw parquet with minimal transformation.
- silver
- cleaned facility month fact tables for production, NGL, and flows.
- facility dimension with history (SCD2 fields: valid_from, valid_to, is_current, version).
- gold
- emissions, intensity, and decision metrics at facility and operator level.
- analysis
- scenarios, risk views, sensitivities, clustering.
- visualization
- derived figures and exports from standardized views.
- bronze
- keys and grain
- facility key: ReportingFacilityID mapped to facility_id.
- operator key: OperatorBAID mapped to canonical operator_name.
- time key: ProductionMonth plus year and month.
- silver grain: facility_month.
- gold grain: facility_month and operator_year.
- core tables
- bronze:
- vol_raw: production and activities at well and facility level.
- ngl_raw: component NGL volumes.
- infrastructure: facility and operator mapping; well to facility links.
- silver:
- production_monthly: facility month production plus FLARE, VENT, FUEL volumes.
- ngl_production: facility month NGL.
- facility_edges_monthly: flows between facilities for attribution.
- dim_facility: facility attributes with history.
- gold:
- facility_aggregated: facility month emissions and intensity.
- operator_emissions: operator year emissions and intensity.
- operator_decision_metrics: operator year decision metrics.
- operator_viz_view: stable view consumed by visualizations.
- bronze:
Emissions and metrics
- emissions logic (summary)
- inputs:
- gas_flared_m3, gas_vented_m3, gas_fuel_m3.
- gas throughput, oil and water volumes, NGL mix, steam volumes.
- components:
- venting, flaring, fuel, processing, oil lifting, water handling, steam, fugitives.
- steps at facility month:
- compute component emissions from volumes and factors.
- sum to E_total.
- convert production to BOE and sum to total_boe.
- compute intensity_kg_per_boe = (E_total * 1000) / total_boe.
- set intensity to null for very small producers; cap extreme values for charts and keep original in a separate column.
- steps at operator year:
- group by operator_baid and year.
- sum emissions and BOE first.
- recompute intensity_kg_per_boe from the sums.
- inputs:
- calibrated emission factors
- methodology:
- factors are calibrated against TIER/GHGRP facility-level reported emissions.
- calibration uses non-negative least squares (nnls) regression segmented by facility type.
- segments: SAGD, GasPlant, Conventional (assigned based on facility type flags).
- calibration dataset: facility-year grain with Petrinex activity volumes and TIER/GHGRP reported emissions.
- factor calibration:
- feature matrix: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_processed_m3, oil_m3, water_m3, steam_proxy_m3, gas_production_m3).
- target: E_total_tco2e_reported from TIER/GHGRP.
- regression: scipy.optimize.nnls (non-negative constraints).
- post-processing: factors clamped to physically plausible ranges (guardrails).
- guardrails (physical plausibility ranges):
- venting_tco2e_per_m3: 0.01-0.03 (CH4 GWP 28, density ~0.68 kg/m3).
- flaring_tco2e_per_m3: 0.001-0.003 (combustion efficiency 85-95%).
- fuel_gas_tco2e_per_m3: 0.001-0.003 (similar to flaring).
- gas_processing_tco2e_per_m3: 0.00005-0.0005 (processing overhead).
- oil_lifting_light_tco2e_per_m3: 0.005-0.015 (conventional oil).
- oil_lifting_heavy_tco2e_per_m3: 0.015-0.04 (heavy oil / SAGD).
- water_handling_tco2e_per_m3: 0.001-0.005 (water treatment).
- sagd_steam_tco2e_per_m3: 0.04-0.08 (natural gas boilers).
- fugitive_leakage_rate: 0.005-0.03 (0.5% to 3% of gas production).
- evaluation metrics:
- per segment: R², RMSE, MAE, bias (predicted vs reported emissions).
- overall: aggregated metrics across all segments.
- execution:
- calibrated factors available via separate Hamilton nodes (facilities_with_emissions_calibrated, operator_petrinex_emissions_calibrated).
- raw (default) factors remain available via original nodes (no breaking changes).
- calibrated factors are applied segment-specifically based on facility type detection.
- methodology:
Calibration workflow (optional)
Calibration against GHGRP/NPRI facility-level reported emissions is available via emission_calibration.py and gold_emissions_ml.py nodes.
Setup: - data sources: - ghgrp_emissions_annual.parquet in silver layer - facility_mapping_ghgrp_petrinex.parquet for ID mapping - nodes: - ghgrp_facility_emissions (bronze): load GHGRP emissions mapped to Petrinex facility IDs - calibration_dataset (gold): join Petrinex activity volumes with GHGRP reported emissions at facility-year grain - calibrated_emission_factors (gold): fit factors using NNLS segmented by facility type - facilities_with_emissions_calibrated (gold): calculate emissions with calibrated factors - operator_petrinex_emissions_calibrated (gold): operator-level emissions with calibrated factors - operator_emissions_active (gold): select calibrated vs baseline based on config toggle
Calibration method: - regression: non-negative least squares (scipy.optimize.nnls) - feature matrix X: activity volumes (gas_vented_m3, gas_flared_m3, gas_fuel_m3, gas_production_m3, oil_m3, water_m3, steam_proxy_m3) - target y: E_total_tco2e_reported from GHGRP - segmentation: fit separate factors for SAGD, GasPlant, Conventional - post-processing: clamp factors to physical guardrails (venting 0.01–0.03, flaring 0.001–0.003, etc.)
Evaluation: - metrics per segment: R², RMSE, MAE, bias (predicted vs reported emissions) - baseline vs calibrated comparison via emission_calibration_metrics node - segment-specific factor application in add_emissions_to_dataframe based on facility type flags
Toggle: - config.USE_CALIBRATED_FACTORS = True to use calibrated factors instead of baseline - operator_emissions_active node switches automatically based on toggle - fallback: baseline factors if GHGRP data not available or calibration fails
- decision metrics (definitions)
- npv_mm: net present value of reduction opportunities over a fixed horizon and discount rate.
- reduction_potential_kt: addressable emissions volume.
- regulatory_risk_score: composite regulatory exposure.
- payback_years: simple payback.
- investment_score:
- based on NPV to CAPEX ratio.
- normalized to 0-100 by percentile.
- benefit_score (internal):
- weighted combination of intensity, scale, financial, and regulatory scores.
- used for scenario and sensitivity views; not used for main rankings.
- clustering and segments
- features:
- log emissions volume.
- intensity percentile.
- log facility count.
- log production volume.
- method:
- K-means with adaptive cluster count.
- labels:
- heuristic rules on cluster centroids to produce simple names (for example high intensity thermal, large efficient conventional).
- outputs:
- cluster_id, cluster_label, cluster level opportunity scores.
- features:
- validation and guardrails
- hours and volumes:
- drop records with impossible hours.
- set negative volumes to zero.
- flag extreme outliers for review.
- units:
- enforce conversion from e3m3 to m3 for gas.
- duplicates:
- handle facility month duplicates via a consistent rule.
- linking:
- use most recent valid well to facility link.
- maintain operator history through dim_facility.
- diagnostics:
- compare aggregated production against external statistics.
- track match rates for well to facility and NGL linking.
- hours and volumes:
Machine learning models
- purpose
- validate and refine existing heuristics using supervised learning.
- use multi-year data to predict future outcomes and classify operators.
- provide additive insights that complement existing decision metrics.
- task A: next-year intensity prediction (regression)
- grain: one row per (operator_baid, year_t).
- label: intensity_kg_per_boe at year t+1.
- features: E_total_kt, intensity_kg_per_boe, reduction_potential_kt, npv_mm, regulatory_risk_score, total_boe, venting_reduction_potential_kt, flaring_reduction_potential_kt, fuel_reduction_potential_kt, facility_count.
- models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
- preprocessing: SimpleImputer (median strategy), StandardScaler.
- evaluation: time-aware train/test split (train on early years, test on later years).
- metrics: R2, MAE, RMSE.
- data requirements: multi-year data (need year t and t+1 for each operator).
- task B: high-opportunity operator classification
- grain: one row per (operator_baid, year).
- label: binary (1 if opportunity_score >= 75th percentile, 0 otherwise).
- features: same as Task A.
- models: RandomForestClassifier (primary), LogisticRegression (baseline).
- preprocessing: SimpleImputer (median strategy), StandardScaler.
- evaluation: KFold or stratified KFold (cross-sectional, not temporal).
- metrics: confusion matrix, accuracy, precision, recall, F1, ROC AUC.
- note: labels derived from existing opportunity_score heuristics; used as surrogate classification task.
- task C: realized emissions reduction prediction (regression)
- grain: one row per (operator_baid, year_t).
- label: realized_reduction_kt = E_total_t - E_total_{t+1} (year-over-year emissions change).
- features: reduction potentials (venting, flaring, fuel), financial metrics (NPV, payback, MAC), operational metrics (intensity, production, facility count), risk scores, cluster archetypes.
- models: RandomForestRegressor (primary), TheilSenRegressor (baseline).
- preprocessing: ColumnTransformer with SimpleImputer + StandardScaler for numeric features, OneHotEncoder for categorical features (cluster, performance category).
- evaluation: time-aware single holdout (train on years < max, test on max year). With only 2 years of data, this is exploratory, not robust backtesting.
- metrics: R2, MAE, RMSE, bias, correlation. Small-sample warnings issued when n < 10.
- purpose: validates which operators with high opportunity_score actually reduce emissions (identifies “doers” vs “talkers”). Used to refine operator targeting and stress-test existing heuristics.
ML validation status
We treat the reduction ML work as descriptive validation, not production forecasting, given only 2 years of panel data (2022–2023).
- panel coverage
- 798 operator-years, 432 unique operators, 2022–2023
- 366 operators with both 2022 and 2023 (used for realized reductions)
- realized reductions (2022→2023)
- 80.3% of operators increased emissions (mean change −37.8 Mt, median −1.5 Mt)
- 19.7% reduced emissions
- 45% improved emissions intensity year-over-year; production growth swamped efficiency gains
- correlation analysis
- opportunity_score vs realized_reduction_t has strong negative correlation (r ≈ −0.73)
- high-opportunity operators (large oil sands players) increased emissions more during a growth period because they grew production 30–140%
- exports and panel
- multi-year panel:
data/gold/operator_panel_2022_2023.parquet(798 rows, 2022–2023) - validation CSVs (all descriptive, no ML predictions):
data/gold/reduction_panel_2022_2023.csv(366 rows)data/gold/reduction_segments_prime_alltalk_hidden.csv(124 rows)data/gold/reduction_summary_stats.csv(4 rows)
- multi-year panel:
- interpretation
- physics-based and economic scoring (opportunity_score, composite_score, etc.) remains the primary targeting framework
- reduction ML is architecturally implemented but treated as optional and exploratory; the DAG degrades gracefully when ML artifacts are absent
- feature engineering
- features extracted from operator_decision_metrics and operator_petrinex_emissions.
- missing columns handled gracefully (default to 0).
- features merged on operator_baid and year.
- model integration
- ML outputs are additive; they do not replace existing scores (opportunity_score, composite_score, etc.).
- predictions can be attached to operator_viz_view or kept in separate ML views.
- model training is explicit and configurable, not always-on in production paths.
- implementation location
- label construction:
layers/gold/ml_targets.py(Tasks A & B),layers/gold/ml_reduction.py(Task C). - model logic:
layers/gold/ml_models.py(Tasks A & B),layers/gold/ml_reduction.py(Task C). - Hamilton nodes:
nodes/gold_ml.py(Tasks A & B),nodes/gold_ml_reduction.py(Task C).
- label construction: