pdstools.utils.cdh_utils._metrics¶

Performance metrics: AUC, lift, log-odds, gains, feature importance.

Functions¶

`safe_range_auc`(→ float)	Internal helper to keep auc a safe number between 0.5 and 1.0 always.
`auc_from_probs`(→ float)	Calculates AUC from an array of truth values and predictions.
`auc_from_bincounts`(→ float)	Calculates AUC from counts of positives and negatives directly
`aucpr_from_probs`(→ float)	Calculates PR AUC (precision-recall) from an array of truth values and predictions.
`aucpr_from_bincounts`(→ float)	Calculates PR AUC (precision-recall) from counts of positives and negatives directly.
`auc_to_gini`(→ float)	Convert AUC performance metric to GINI
`z_ratio`(, neg_col)	Calculates the Z-Ratio for predictor bins.
`lift`(, neg_col)	Calculates the Lift for predictor bins.
`bin_log_odds`(→ list[float])
`log_odds_polars`(, negatives)	Calculate log odds per bin with correct Laplace smoothing.
`feature_importance`(→ polars.Expr)	Calculate feature importance for Naive Bayes predictors.
`gains_table`(df, value[, index, by])	Calculates cumulative gains from any data frame.

Module Contents¶

safe_range_auc(auc: float) → float¶

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

Parameters:: auc (float) – The AUC (Area Under the Curve) score
Returns:: ‘Safe’ AUC score, between 0.5 and 1.0
Return type:: float

auc_from_probs(groundtruth: list[int], probs: list[float]) → float¶

Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.

Parameters:

groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.
probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth
Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])

auc_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) → float¶

Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.

Parameters:

pos (list[int]) – Vector with counts of the positive responses
neg (list[int]) – Vector with counts of the negative responses
probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_bincounts([3,1,0], [2,0,1])

aucpr_from_probs(groundtruth: list[int], probs: list[float]) → float¶

Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.

Parameters:

groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.
probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth
Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])

aucpr_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) → float¶

Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.

Parameters:

pos (list[int]) – Vector with counts of the positive responses
neg (list[int]) – Vector with counts of the negative responses
probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The PR AUC as a value between 0.0 and 1.

Return type:

float

Examples

>>> aucpr_from_bincounts([3,1,0], [2,0,1])

auc_to_gini(auc: float) → float¶

Convert AUC performance metric to GINI

Parameters:: auc (float) – The AUC (number between 0.5 and 1)
Returns:: GINI metric, a number between 0 and 1
Return type:: float

Examples

>>> auc2GINI(0.8232)

z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) → polars.Expr¶

Calculates the Z-Ratio for predictor bins.

The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.

To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.

Parameters:

posCol (pl.Expr) – The (Polars) column of the bin positives
negCol (pl.Expr) – The (Polars) column of the bin positives
pos_col (str | polars.Expr)
neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()

lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) → polars.Expr¶

Calculates the Lift for predictor bins.

The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.

Parameters:

posCol (pl.Expr) – The (Polars) column of the bin positives
negCol (pl.Expr) – The (Polars) column of the bin positives
pos_col (str | polars.Expr)
neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()

bin_log_odds(bin_pos: list[float], bin_neg: list[float]) → list[float]¶

Parameters:

bin_pos (list[float])
bin_neg (list[float])

Return type:

list[float]

log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives')) → polars.Expr¶

Calculate log odds per bin with correct Laplace smoothing.

Formula (per bin i in predictor p):: log(pos_i + 1/nBins) - log(sum(pos) + 1) - [log(neg_i + 1/nBins) - log(sum(neg) + 1)]

Laplace smoothing uses 1/nBins where nBins is the number of bins for that specific predictor. This matches the platform implementation in GroupedPredictor.java.

Must be used with .over() to calculate nBins per predictor group:: .with_columns(log_odds_polars().over(“PredictorName”, “ModelID”))

Parameters:

positives (pl.Expr or str) – Column with positive response counts per bin
negatives (pl.Expr or str) – Column with negative response counts per bin

Returns:

Log odds expression (use with .over() for correct grouping)

Return type:

pl.Expr