pdstools.utils.cdh_utils._metrics

Performance metrics: AUC, lift, log-odds, gains, feature importance.

Functions

safe_range_auc(→ float)

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

auc_from_probs(→ float)

Calculates AUC from an array of truth values and predictions.

auc_from_bincounts(→ float)

Calculates AUC from counts of positives and negatives directly

aucpr_from_probs(→ float)

Calculates PR AUC (precision-recall) from an array of truth values and predictions.

aucpr_from_bincounts(→ float)

Calculates PR AUC (precision-recall) from counts of positives and negatives directly.

auc_to_gini(→ float)

Convert AUC performance metric to GINI

z_ratio(, neg_col)

Calculates the Z-Ratio for predictor bins.

lift(, neg_col)

Calculates the Lift for predictor bins.

bin_log_odds(→ list[float])

log_odds_polars(, negatives)

Calculate log odds per bin with correct Laplace smoothing.

feature_importance(→ polars.Expr)

Calculate feature importance for Naive Bayes predictors.

gains_table(df, value[, index, by])

Calculates cumulative gains from any data frame.

Module Contents

safe_range_auc(auc: float) float

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

Parameters:

auc (float) – The AUC (Area Under the Curve) score

Returns:

‘Safe’ AUC score, between 0.5 and 1.0

Return type:

float

auc_from_probs(groundtruth: list[int], probs: list[float]) float

Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.

Parameters:
  • groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
auc_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float

Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.

Parameters:
  • pos (list[int]) – Vector with counts of the positive responses

  • neg (list[int]) – Vector with counts of the negative responses

  • probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_bincounts([3,1,0], [2,0,1])
aucpr_from_probs(groundtruth: list[int], probs: list[float]) float

Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
aucpr_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float

Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • pos (list[int]) – Vector with counts of the positive responses

  • neg (list[int]) – Vector with counts of the negative responses

  • probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The PR AUC as a value between 0.0 and 1.

Return type:

float

Examples

>>> aucpr_from_bincounts([3,1,0], [2,0,1])
auc_to_gini(auc: float) float

Convert AUC performance metric to GINI

Parameters:

auc (float) – The AUC (number between 0.5 and 1)

Returns:

GINI metric, a number between 0 and 1

Return type:

float

Examples

>>> auc2GINI(0.8232)
z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Z-Ratio for predictor bins.

The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.

To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

  • pos_col (str | polars.Expr)

  • neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()
lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Lift for predictor bins.

The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

  • pos_col (str | polars.Expr)

  • neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()
bin_log_odds(bin_pos: list[float], bin_neg: list[float]) list[float]
Parameters:
Return type:

list[float]

log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives')) polars.Expr

Calculate log odds per bin with correct Laplace smoothing.

Formula (per bin i in predictor p):

log(pos_i + 1/nBins) - log(sum(pos) + 1) - [log(neg_i + 1/nBins) - log(sum(neg) + 1)]

Laplace smoothing uses 1/nBins where nBins is the number of bins for that specific predictor. This matches the platform implementation in GroupedPredictor.java.

Must be used with .over() to calculate nBins per predictor group:

.with_columns(log_odds_polars().over(“PredictorName”, “ModelID”))

Parameters:
  • positives (pl.Expr or str) – Column with positive response counts per bin

  • negatives (pl.Expr or str) – Column with negative response counts per bin

Returns:

Log odds expression (use with .over() for correct grouping)

Return type:

pl.Expr

See also

feature_importance

Calculate predictor importance from log odds

bin_log_odds

Pure Python version (reference implementation)

References

Examples

>>> # For propensity calculation in classifier
>>> df.with_columns(
...     log_odds_polars(
...         pl.col("BinPositives"),
...         pl.col("BinNegatives")
...     ).over("PredictorName", "ModelID")
... )
feature_importance(over: list[str] | None = None, scaled: bool = True) polars.Expr

Calculate feature importance for Naive Bayes predictors.

Feature importance represents the weighted average of absolute log odds values across all bins, weighted by bin response counts. This measures how strongly the predictor differentiates between positive and negative outcomes.

Algorithm (matches platform GroupedPredictor.calculatePredictorImportance()): 1. Calculate log odds per bin with Laplace smoothing (1/nBins) 2. Take absolute value of each bin’s log odds 3. Calculate weighted average: Sum(|logOdds(bin)| × binResponses) / totalResponses 4. Optional: Scale to 0-100 range (scaled=True, default)

This matches the Pega platform implementation in: adaptive-learning-core-lib/…/GroupedPredictor.java lines 371-382

Formula:

Feature Importance = Σ |logOdds(bin)| × (binResponses / totalResponses)

Parameters:
  • over (list[str], optional) – Grouping columns. Defaults to ["PredictorName", "ModelID"].

  • scaled (bool, default True) – If True, scale importance to 0-100 where max predictor = 100

Returns:

Feature importance expression

Return type:

pl.Expr

Examples

>>> df.with_columns(
...     feature_importance().over("PredictorName", "ModelID")
... )

Notes

This implementation matches the platform calculation exactly. Issue #263 incorrectly suggested “diff from mean” based on R implementation, but the platform actually uses weighted average of absolute log odds.

See also

log_odds_polars

Calculate per-bin log odds

References

  • Issue #263: Calculation of Feature Importance incorrect

  • Issue #404: Add feature importance explanation to ADM Explained

  • Platform: GroupedPredictor.java calculatePredictorImportance()

  • ADM Explained: Feature Importance section

gains_table(df, value: str, index=None, by=None)

Calculates cumulative gains from any data frame.

The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage.

Parameters:
  • df (pl.DataFrame) – The (Polars) dataframe with the raw values

  • value (str) – The name of the field with the values (plotted on y-axis)

  • None (by =) – Optional name of the field for the x-axis. If not passed in all records are used and weighted equally.

  • None – Grouping field(s), can also be None

Returns:

A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1.

Return type:

pl.DataFrame

Examples

>>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])