pdstools.utils.cdh_utils._metrics¶
Performance metrics: AUC, lift, log-odds, gains, feature importance.
Functions¶
|
Internal helper to keep auc a safe number between 0.5 and 1.0 always. |
|
Calculates AUC from an array of truth values and predictions. |
|
Calculates AUC from counts of positives and negatives directly |
|
Calculates PR AUC (precision-recall) from an array of truth values and predictions. |
|
Calculates PR AUC (precision-recall) from counts of positives and negatives directly. |
|
Convert AUC performance metric to GINI |
|
Calculates the Z-Ratio for predictor bins. |
|
Calculates the Lift for predictor bins. |
|
|
|
Calculate log odds per bin with correct Laplace smoothing. |
|
Calculate feature importance for Naive Bayes predictors. |
|
Calculates cumulative gains from any data frame. |
Module Contents¶
- safe_range_auc(auc: float) float¶
Internal helper to keep auc a safe number between 0.5 and 1.0 always.
- auc_from_probs(groundtruth: list[int], probs: list[float]) float¶
Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.
- Parameters:
- Return type:
Examples
>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
- auc_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float¶
Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.
- Parameters:
- Returns:
The AUC as a value between 0.5 and 1.
- Return type:
Examples
>>> auc_from_bincounts([3,1,0], [2,0,1])
- aucpr_from_probs(groundtruth: list[int], probs: list[float]) float¶
Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.
- Parameters:
- Return type:
Examples
>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
- aucpr_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float¶
Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.
- Parameters:
- Returns:
The PR AUC as a value between 0.0 and 1.
- Return type:
Examples
>>> aucpr_from_bincounts([3,1,0], [2,0,1])
- auc_to_gini(auc: float) float¶
Convert AUC performance metric to GINI
- Parameters:
auc (float) – The AUC (number between 0.5 and 1)
- Returns:
GINI metric, a number between 0 and 1
- Return type:
Examples
>>> auc2GINI(0.8232)
- z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr¶
Calculates the Z-Ratio for predictor bins.
The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.
To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.
- Parameters:
- Return type:
polars.Expr
Examples
>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()
- lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr¶
Calculates the Lift for predictor bins.
The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.
- Parameters:
- Return type:
polars.Expr
Examples
>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()
- log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives')) polars.Expr¶
Calculate log odds per bin with correct Laplace smoothing.
- Formula (per bin i in predictor p):
log(pos_i + 1/nBins) - log(sum(pos) + 1) - [log(neg_i + 1/nBins) - log(sum(neg) + 1)]
Laplace smoothing uses 1/nBins where nBins is the number of bins for that specific predictor. This matches the platform implementation in GroupedPredictor.java.
- Must be used with .over() to calculate nBins per predictor group:
.with_columns(log_odds_polars().over(“PredictorName”, “ModelID”))
- Parameters:
- Returns:
Log odds expression (use with .over() for correct grouping)
- Return type:
pl.Expr
See also
feature_importanceCalculate predictor importance from log odds
bin_log_oddsPure Python version (reference implementation)
References
ADM Explained: Log Odds calculation section
Issue #263: https://github.com/pegasystems/pega-datascientist-tools/issues/263
Platform: GroupedPredictor.java lines 603-606
Examples
>>> # For propensity calculation in classifier >>> df.with_columns( ... log_odds_polars( ... pl.col("BinPositives"), ... pl.col("BinNegatives") ... ).over("PredictorName", "ModelID") ... )
- feature_importance(over: list[str] | None = None, scaled: bool = True) polars.Expr¶
Calculate feature importance for Naive Bayes predictors.
Feature importance represents the weighted average of absolute log odds values across all bins, weighted by bin response counts. This measures how strongly the predictor differentiates between positive and negative outcomes.
Algorithm (matches platform GroupedPredictor.calculatePredictorImportance()): 1. Calculate log odds per bin with Laplace smoothing (1/nBins) 2. Take absolute value of each bin’s log odds 3. Calculate weighted average: Sum(|logOdds(bin)| × binResponses) / totalResponses 4. Optional: Scale to 0-100 range (scaled=True, default)
This matches the Pega platform implementation in: adaptive-learning-core-lib/…/GroupedPredictor.java lines 371-382
- Formula:
Feature Importance = Σ |logOdds(bin)| × (binResponses / totalResponses)
- Parameters:
- Returns:
Feature importance expression
- Return type:
pl.Expr
Examples
>>> df.with_columns( ... feature_importance().over("PredictorName", "ModelID") ... )
Notes
This implementation matches the platform calculation exactly. Issue #263 incorrectly suggested “diff from mean” based on R implementation, but the platform actually uses weighted average of absolute log odds.
See also
log_odds_polarsCalculate per-bin log odds
References
Issue #263: Calculation of Feature Importance incorrect
Issue #404: Add feature importance explanation to ADM Explained
Platform: GroupedPredictor.java calculatePredictorImportance()
ADM Explained: Feature Importance section
- gains_table(df, value: str, index=None, by=None)¶
Calculates cumulative gains from any data frame.
The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage.
- Parameters:
df (pl.DataFrame) – The (Polars) dataframe with the raw values
value (str) – The name of the field with the values (plotted on y-axis)
None (by =) – Optional name of the field for the x-axis. If not passed in all records are used and weighted equally.
None – Grouping field(s), can also be None
- Returns:
A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1.
- Return type:
pl.DataFrame
Examples
>>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])