pdstools.utils.cdh_utils

Helpers for working with Pega CDH-style data.

This package preserves the public surface of the previous cdh_utils module while splitting the implementation across several focused private submodules:

  • _dates — Pega date-time parsing and start/end-date resolution.

  • _namespacing — Pega field-name normalisation (_capitalize) and predictor-categorisation defaults.

  • _polars — Polars expression / frame helpers (queries, sampling, schema casting, list-overlap utilities, weighted averages).

  • _metrics — Performance metrics: AUC, lift, log-odds, gains tables and feature importance.

  • _io — File, temp-directory, logger setup and version-check helpers.

  • _misc — Small standalone helpers (list flattening, plot legend colors).

Submodule names are underscore-prefixed; only this __init__ is the supported import surface. Imports such as from pdstools.utils.cdh_utils import safe_int continue to resolve unchanged.

Submodules

Attributes

Functions

from_prpc_date_time(→ datetime.datetime | str)

Convert from a Pega date-time string.

parse_pega_date_time_formats(→ polars.Expr)

Parses Pega DateTime formats.

to_prpc_date_time(→ str)

Convert to a Pega date-time string

create_working_and_temp_dir(→ tuple[pathlib.Path, ...)

Creates a working directory for saving files and a temp_dir

get_latest_pdstools_version()

process_files_to_bytes(→ tuple[bytes, str])

Processes a list of file paths, returning file content as bytes and a corresponding file name.

setup_logger()

Return the pdstools logger and a log buffer it streams into.

auc_from_bincounts(→ float)

Calculates AUC from counts of positives and negatives directly

auc_from_probs(→ float)

Calculates AUC from an array of truth values and predictions.

auc_to_gini(→ float)

Convert AUC performance metric to GINI

aucpr_from_bincounts(→ float)

Calculates PR AUC (precision-recall) from counts of positives and negatives directly.

aucpr_from_probs(→ float)

Calculates PR AUC (precision-recall) from an array of truth values and predictions.

bin_log_odds(→ list[float])

feature_importance(→ polars.Expr)

Calculate feature importance for Naive Bayes predictors.

gains_table(df, value[, index, by])

Calculates cumulative gains from any data frame.

lift(, neg_col)

Calculates the Lift for predictor bins.

log_odds_polars(, negatives)

Calculate log odds per bin with correct Laplace smoothing.

safe_range_auc(→ float)

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

z_ratio(, neg_col)

Calculates the Z-Ratio for predictor bins.

legend_color_order(fig)

Orders legend colors alphabetically in order to provide pega color

safe_flatten_list(→ list | None)

Flatten one level of alist, drop None entries, and prepend extras.

default_predictor_categorization() → polars.Expr)

Function to determine the 'category' of a predictor.

is_valid_polars_duration(→ bool)

Validate Polars duration syntax.

lazy_sample(→ pdstools.utils.cdh_utils._common.F)

overlap_lists_polars(→ polars.Series)

Calculate the average overlap ratio of each list element with all other list elements into a single Series.

overlap_matrix(→ polars.DataFrame)

Calculate the overlap of a list element with all other list elements returning a full matrix.

weighted_average_polars(→ polars.Expr)

weighted_performance_polars(→ polars.Expr)

Polars function to return a weighted performance

Package Contents

type QUERY = pl.Expr | Iterable[pl.Expr] | dict[str, list]
F
logger
from_prpc_date_time(x: str, return_string: bool = False, use_timezones: bool = True) datetime.datetime | str

Convert from a Pega date-time string.

Parameters:
  • x (str) – String of Pega date-time

  • return_string (bool, default=False) – If True it will return the date in string format. If False it will return in datetime type

  • use_timezones (bool)

Returns:

The converted date in datetime format or string.

Return type:

datetime.datetime | str

Examples

>>> fromPRPCDateTime("20180316T134127.847 GMT")
>>> fromPRPCDateTime("20180316T134127.847 GMT", True)
>>> fromPRPCDateTime("20180316T184127.846")
>>> fromPRPCDateTime("20180316T184127.846", True)
parse_pega_date_time_formats(timestamp_col='SnapshotTime', timestamp_fmt: str | None = None, timestamp_dtype: polars._typing.PolarsTemporalType = pl.Datetime) polars.Expr

Parses Pega DateTime formats.

Supports commonly used formats:

  • “%Y-%m-%d %H:%M:%S”

  • “%Y%m%dT%H%M%S.%f %Z”

  • “%d-%b-%y”

  • “%d%b%Y:%H:%M:%S”

  • “%Y%m%d”

Removes timezones, and rounds to seconds, with a ‘ns’ time unit.

In the implementation, the last expression uses timestamp_fmt or %Y. This is a bit of a hack, because if we pass None, it tries to infer automatically. Inferring raises when it can’t find an appropriate format, so that’s not good.

Parameters:
  • timestampCol (str, default = 'SnapshotTime') – The column to parse

  • timestamp_fmt (str, default = None) – An optional format to use rather than the default formats

  • timestamp_dtype (PolarsTemporalType, default = pl.Datetime) – The data type to convert into. Can be either Date, Datetime, or Time.

Return type:

polars.Expr

to_prpc_date_time(dt: datetime.datetime) str

Convert to a Pega date-time string

Parameters:
Returns:

A string representation in the format used by Pega

Return type:

str

Examples

>>> toPRPCDateTime(datetime.datetime.now())
create_working_and_temp_dir(name: str | None = None, working_dir: os.PathLike | None = None) tuple[pathlib.Path, pathlib.Path]

Creates a working directory for saving files and a temp_dir

Parameters:
Return type:

tuple[pathlib.Path, pathlib.Path]

get_latest_pdstools_version()
process_files_to_bytes(file_paths: list[str | pathlib.Path], base_file_name: str | pathlib.Path) tuple[bytes, str]

Processes a list of file paths, returning file content as bytes and a corresponding file name. Useful for zipping muliple model reports and the byte object is used for downloading files in Streamlit app.

This function handles three scenarios: 1. Single file: Returns the file’s content as bytes and the provided base file name. 2. Multiple files: Creates a zip file containing all files, returns the zip file’s content as bytes

and a generated zip file name.

  1. No files: Returns empty bytes and an empty string.

Parameters:
  • file_paths (list[str | Path]) – A list of file paths to process. Can be empty, contain a single path, or multiple paths.

  • base_file_name (str | Path) – The base name to use for the output file. For a single file, this name is returned as is. For multiple files, this is used as part of the generated zip file name.

Returns:

A tuple containing: - bytes: The content of the single file or the created zip file, or empty bytes if no files. - str: The file name (either base_file_name or a generated zip file name), or an empty string if no files.

Return type:

tuple[bytes, str]

setup_logger()

Return the pdstools logger and a log buffer it streams into.

Targets the named pdstools logger rather than the root logger so we don’t clobber the host application’s logging config (Streamlit, Quarto, Jupyter, etc.). Idempotent: repeated calls return the same buffer instead of stacking new handlers, so re-running a notebook cell or bouncing a Streamlit page doesn’t produce duplicated log lines.

auc_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float

Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.

Parameters:
  • pos (list[int]) – Vector with counts of the positive responses

  • neg (list[int]) – Vector with counts of the negative responses

  • probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_bincounts([3,1,0], [2,0,1])
auc_from_probs(groundtruth: list[int], probs: list[float]) float

Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.

Parameters:
  • groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
auc_to_gini(auc: float) float

Convert AUC performance metric to GINI

Parameters:

auc (float) – The AUC (number between 0.5 and 1)

Returns:

GINI metric, a number between 0 and 1

Return type:

float

Examples

>>> auc2GINI(0.8232)
aucpr_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) float

Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • pos (list[int]) – Vector with counts of the positive responses

  • neg (list[int]) – Vector with counts of the negative responses

  • probs (list[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

The PR AUC as a value between 0.0 and 1.

Return type:

float

Examples

>>> aucpr_from_bincounts([3,1,0], [2,0,1])
aucpr_from_probs(groundtruth: list[int], probs: list[float]) float

Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • groundtruth (list[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (list[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (float) – The AUC as a value between 0.5 and 1.

Return type:

float

Examples

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
bin_log_odds(bin_pos: list[float], bin_neg: list[float]) list[float]
Parameters:
Return type:

list[float]

feature_importance(over: list[str] | None = None, scaled: bool = True) polars.Expr

Calculate feature importance for Naive Bayes predictors.

Feature importance represents the weighted average of absolute log odds values across all bins, weighted by bin response counts. This measures how strongly the predictor differentiates between positive and negative outcomes.

Algorithm (matches platform GroupedPredictor.calculatePredictorImportance()): 1. Calculate log odds per bin with Laplace smoothing (1/nBins) 2. Take absolute value of each bin’s log odds 3. Calculate weighted average: Sum(|logOdds(bin)| × binResponses) / totalResponses 4. Optional: Scale to 0-100 range (scaled=True, default)

This matches the Pega platform implementation in: adaptive-learning-core-lib/…/GroupedPredictor.java lines 371-382

Formula:

Feature Importance = Σ |logOdds(bin)| × (binResponses / totalResponses)

Parameters:
  • over (list[str], optional) – Grouping columns. Defaults to ["PredictorName", "ModelID"].

  • scaled (bool, default True) – If True, scale importance to 0-100 where max predictor = 100

Returns:

Feature importance expression

Return type:

pl.Expr

Examples

>>> df.with_columns(
...     feature_importance().over("PredictorName", "ModelID")
... )

Notes

This implementation matches the platform calculation exactly. Issue #263 incorrectly suggested “diff from mean” based on R implementation, but the platform actually uses weighted average of absolute log odds.

See also

log_odds_polars

Calculate per-bin log odds

References

  • Issue #263: Calculation of Feature Importance incorrect

  • Issue #404: Add feature importance explanation to ADM Explained

  • Platform: GroupedPredictor.java calculatePredictorImportance()

  • ADM Explained: Feature Importance section

gains_table(df, value: str, index=None, by=None)

Calculates cumulative gains from any data frame.

The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage.

Parameters:
  • df (pl.DataFrame) – The (Polars) dataframe with the raw values

  • value (str) – The name of the field with the values (plotted on y-axis)

  • None (by =) – Optional name of the field for the x-axis. If not passed in all records are used and weighted equally.

  • None – Grouping field(s), can also be None

Returns:

A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1.

Return type:

pl.DataFrame

Examples

>>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])
lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Lift for predictor bins.

The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

  • pos_col (str | polars.Expr)

  • neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()
log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives')) polars.Expr

Calculate log odds per bin with correct Laplace smoothing.

Formula (per bin i in predictor p):

log(pos_i + 1/nBins) - log(sum(pos) + 1) - [log(neg_i + 1/nBins) - log(sum(neg) + 1)]

Laplace smoothing uses 1/nBins where nBins is the number of bins for that specific predictor. This matches the platform implementation in GroupedPredictor.java.

Must be used with .over() to calculate nBins per predictor group:

.with_columns(log_odds_polars().over(“PredictorName”, “ModelID”))

Parameters:
  • positives (pl.Expr or str) – Column with positive response counts per bin

  • negatives (pl.Expr or str) – Column with negative response counts per bin

Returns:

Log odds expression (use with .over() for correct grouping)

Return type:

pl.Expr

See also

feature_importance

Calculate predictor importance from log odds

bin_log_odds

Pure Python version (reference implementation)

References

Examples

>>> # For propensity calculation in classifier
>>> df.with_columns(
...     log_odds_polars(
...         pl.col("BinPositives"),
...         pl.col("BinNegatives")
...     ).over("PredictorName", "ModelID")
... )
safe_range_auc(auc: float) float

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

Parameters:

auc (float) – The AUC (Area Under the Curve) score

Returns:

‘Safe’ AUC score, between 0.5 and 1.0

Return type:

float

z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Z-Ratio for predictor bins.

The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.

To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

  • pos_col (str | polars.Expr)

  • neg_col (str | polars.Expr)

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()
legend_color_order(fig)

Orders legend colors alphabetically in order to provide pega color consistency among different categories

safe_flatten_list(alist: list | None, extras: list | None = None) list | None

Flatten one level of alist, drop None entries, and prepend extras.

The result is order-preserving and de-duplicated. Strings are treated as atoms (not iterated). Both alist and extras are read-only — the caller’s lists are never mutated. Returns None when the result would be empty so callers can use the truthiness as a “no grouping” signal.

Parameters:
  • alist (list | None)

  • extras (list | None)

Return type:

list | None

default_predictor_categorization(x: str | polars.Expr = pl.col('PredictorName')) polars.Expr

Function to determine the ‘category’ of a predictor.

It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain.

By default, this function returns “Primary” whenever there is no ‘.’ anywhere in the name string, otherwise returns the first string before the first period

Parameters:

x (str | pl.Expr, default = pl.col('PredictorName')) – The column to parse

Return type:

polars.Expr

POLARS_DURATION_PATTERN
is_valid_polars_duration(value: str, max_length: int = 30) bool

Validate Polars duration syntax.

Checks if a string is a valid Polars duration (e.g., “1d”, “1w”, “1mo”, “1h30m”). Used to validate user input before passing to Polars methods like dt.truncate() or group_by_dynamic().

Parameters:
  • value (str) – The duration string to validate.

  • max_length (int, default 30) – Maximum allowed string length (prevents excessive input).

Returns:

True if the string is a valid Polars duration, False otherwise.

Return type:

bool

Examples

>>> is_valid_polars_duration("1d")
True
>>> is_valid_polars_duration("1w")
True
>>> is_valid_polars_duration("1h30m")
True
>>> is_valid_polars_duration("invalid")
False
>>> is_valid_polars_duration("")
False
lazy_sample(df: pdstools.utils.cdh_utils._common.F, n_rows: int, with_replacement: bool = True) pdstools.utils.cdh_utils._common.F
Parameters:
Return type:

pdstools.utils.cdh_utils._common.F

overlap_lists_polars(col: polars.Series) polars.Series

Calculate the average overlap ratio of each list element with all other list elements into a single Series.

For each list in the input Series, this function calculates the average overlap (intersection) with all other lists, normalized by the size of the original list. The overlap ratio represents how much each list has in common with all other lists on average.

Parameters:

col (pl.Series) – A Polars Series where each element is a list. The function will calculate the overlap between each list and all other lists in the Series.

Returns:

A Polars Series of float values representing the average overlap ratio for each list. Each value is calculated as: (sum of intersection sizes with all other lists) / (number of other lists) / (size of original list)

Return type:

pl.Series

Examples

>>> import polars as pl
>>> data = pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])
>>> overlap_lists_polars(data)
shape: (3,)
Series: '' [f64]
[
    0.5
    0.375
    0.25
]
>>> df = pl.DataFrame({"Channel" : ["Mobile", "Web", "Email"], "Actions" : pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])})
>>> df.with_columns(pl.col("Actions").map_batches(overlap_lists_polars))
shape: (3, 2)
┌─────────┬─────────┐
│ Channel │ Actions │
│ ---     │ ---     │
│ str     │ f64     │
╞═════════╪═════════╡
│ Mobile  │ 0.5     │
│ Web     │ 0.375   │
│ Email   │ 0.25    │
└─────────┴─────────┘
overlap_matrix(df: polars.DataFrame, list_col: str, by: str, show_fraction: bool = True) polars.DataFrame

Calculate the overlap of a list element with all other list elements returning a full matrix.

For each list in the specified column, this function calculates the overlap ratio (intersection size divided by the original list size) with every other list in the column, including itself. The result is a matrix where each row represents the overlap ratios for one list with all others.

Parameters:
  • df (pl.DataFrame) – The Polars DataFrame containing the list column and grouping column.

  • list_col (str) – The name of the column containing the lists. Each element in this column should be a list.

  • by (str) – The name of the column to use for grouping and labeling the rows in the result matrix.

  • show_fraction (bool)

Returns:

A DataFrame where: - Each row represents the overlap ratios for one list with all others - Each column (except the last) represents the overlap ratio with a specific list - Column names are formatted as “Overlap_{list_col_name}_{by}” - The last column contains the original values from the ‘by’ column

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> df = pl.DataFrame({
...     "Channel": ["Mobile", "Web", "Email"],
...     "Actions": [
...         [1, 2, 3],
...         [2, 3, 4, 6],
...         [3, 5, 7, 8]
...     ]
... })
>>> overlap_matrix(df, "Actions", "Channel")
shape: (3, 4)
┌───────────────────┬───────────────┬───────────────┬─────────┐
│ Overlap_Actions_M… │ Overlap_Actio… │ Overlap_Actio… │ Channel │
│ ---               │ ---           │ ---           │ ---     │
│ f64               │ f64           │ f64           │ str     │
╞═══════════════════╪═══════════════╪═══════════════╪═════════╡
│ 1.0               │ 0.6666667     │ 0.3333333     │ Mobile  │
│ 0.5               │ 1.0           │ 0.25          │ Web     │
│ 0.25              │ 0.25          │ 1.0           │ Email   │
└───────────────────┴───────────────┴───────────────┴─────────┘
weighted_average_polars(vals: str | polars.Expr, weights: str | polars.Expr) polars.Expr
Parameters:
  • vals (str | polars.Expr)

  • weights (str | polars.Expr)

Return type:

polars.Expr

weighted_performance_polars(vals: str | polars.Expr = 'Performance', weights: str | polars.Expr = 'ResponseCount') polars.Expr

Polars function to return a weighted performance

Parameters:
  • vals (str | polars.Expr)

  • weights (str | polars.Expr)

Return type:

polars.Expr