pdstools.utils.cdh_utils

cdhtools: Data Science add-ons for Pega.

Various utilities to access and manipulate data from Pega for purposes of data analysis, reporting and monitoring.

Module Contents

Functions

defaultPredictorCategorization() → polars.Expr)

Function to determine the 'category' of a predictor.

_extract_keys(→ pdstools.utils.types.any_frame)

Extracts keys out of the pyName column

parsePegaDateTimeFormats([timestampCol, ...])

Parses Pega DateTime formats.

getTypeMapping(df, definition[, verbose])

This function is used to convert the data types of columns in a DataFrame to a desired types.

set_types(df[, table, verbose])

inferTableDefinition(df)

safe_range_auc(→ float)

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

auc_from_probs(→ List[float])

Calculates AUC from an array of truth values and predictions.

auc_from_bincounts(→ float)

Calculates AUC from counts of positives and negatives directly

aucpr_from_probs(→ List[float])

Calculates PR AUC (precision-recall) from an array of truth values and predictions.

aucpr_from_bincounts(→ float)

Calculates PR AUC (precision-recall) from counts of positives and negatives directly.

auc2GINI(→ float)

Convert AUC performance metric to GINI

_capitalize(→ list)

Applies automatic capitalization, aligned with the R couterpart.

_polarsCapitalize(df)

fromPRPCDateTime(→ Union[datetime.datetime, str])

Convert from a Pega date-time string.

toPRPCDateTime(→ str)

Convert to a Pega date-time string

weighted_average_polars(→ polars.Expr)

weighted_performance_polars(→ polars.Expr)

Polars function to return a weighted performance

zRatio(, negCol)

Calculates the Z-Ratio for predictor bins.

lift(, negCol)

Calculates the Lift for predictor bins.

LogOdds([Positives, Negatives])

featureImportance([over])

gains_table(df, value[, index, by])

Calculates cumulative gains from any data frame.

legend_color_order(fig)

Orders legend colors alphabetically in order to provide pega color

sync_reports([checkOnly, autoUpdate])

Compares the report files in your local directory to the repo

defaultPredictorCategorization(x: str | polars.Expr = pl.col('PredictorName')) polars.Expr

Function to determine the ‘category’ of a predictor.

It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain.

By default, this function returns “Primary” whenever there is no ‘.’ anywhere in the name string, otherwise returns the first string before the first period

Parameters:

x (Union[str, pl.Expr], default = pl.col('PredictorName')) – The column to parse

Return type:

polars.Expr

_extract_keys(df: pdstools.utils.types.any_frame, col='Name', capitalize=True, import_strategy='eager') pdstools.utils.types.any_frame

Extracts keys out of the pyName column

This is not a lazy operation as we don’t know the possible keys in advance. For that reason, we select only the pyName column, extract the keys from that, and then collect the resulting dataframe. This dataframe is then joined back to the original dataframe.

This is relatively efficient, but we still do need the whole pyName column in memory to do this, so it won’t work completely lazily from e.g. s3. That’s why it only works with eager mode.

Parameters:

df (Union[pl.DataFrame, pl.LazyFrame]) – The dataframe to extract the keys from

Return type:

pdstools.utils.types.any_frame

parsePegaDateTimeFormats(timestampCol='SnapshotTime', timestamp_fmt: str = None, strict_conversion: bool = True)

Parses Pega DateTime formats.

Supports the two most commonly used formats:

  • “%Y-%m-%d %H:%M:%S”

  • “%Y%m%dT%H%M%S.%f %Z”

If you want to parse a different timezone, then

Removes timezones, and rounds to seconds, with a ‘ns’ time unit.

Parameters:
  • timestampCol (str, default = 'SnapshotTime') – The column to parse

  • timestamp_fmt (str, default = None) – An optional format to use rather than the default formats

  • strict_conversion (bool, default = True) – Whether to error on incorrect parses or just return Null values

getTypeMapping(df, definition, verbose=False, **timestamp_opts)

This function is used to convert the data types of columns in a DataFrame to a desired types. The desired types are defined in a PegaDefaultTables class.

Parameters:
  • df (pl.LazyFrame) – The DataFrame whose columns’ data types need to be converted.

  • definition (PegaDefaultTables) – A PegaDefaultTables object that contains the desired data types for the columns.

  • verbose (bool) – If True, the function will print a message when a column is not in the default table schema.

  • timestamp_opts (str) – Additional arguments for timestamp parsing.

Returns:

A list with polars expressions for casting data types.

Return type:

List

set_types(df, table='infer', verbose=False, **timestamp_opts)
inferTableDefinition(df)
safe_range_auc(auc: float) float

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

Parameters:

auc (float) – The AUC (Area Under the Curve) score

Returns:

‘Safe’ AUC score, between 0.5 and 1.0

Return type:

float

auc_from_probs(groundtruth: List[int], probs: List[float]) List[float]

Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.

Parameters:
  • groundtruth (List[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (List[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (List[float]) – The AUC as a value between 0.5 and 1.

  • Examples

    >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
    

Return type:

List[float]

auc_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) float

Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.

Parameters:
  • pos (List[int]) – Vector with counts of the positive responses

  • neg (List[int]) – Vector with counts of the negative responses

  • probs (List[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

  • float – The AUC as a value between 0.5 and 1.

  • Examples – >>> auc_from_bincounts([3,1,0], [2,0,1])

Return type:

float

aucpr_from_probs(groundtruth: List[int], probs: List[float]) List[float]

Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • groundtruth (List[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.

  • probs (List[float]) – The predictions, as a numeric vector of the same length as groundtruth

  • Returns (List[float]) – The AUC as a value between 0.5 and 1.

  • Examples

    >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])
    

Return type:

List[float]

aucpr_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) float

Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.

Parameters:
  • pos (List[int]) – Vector with counts of the positive responses

  • neg (List[int]) – Vector with counts of the negative responses

  • probs (List[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

  • float – The PR AUC as a value between 0.0 and 1.

  • Examples – >>> aucpr_from_bincounts([3,1,0], [2,0,1])

Return type:

float

auc2GINI(auc: float) float

Convert AUC performance metric to GINI

Parameters:

auc (float) – The AUC (number between 0.5 and 1)

Returns:

  • float – GINI metric, a number between 0 and 1

  • Examples – >>> auc2GINI(0.8232)

Return type:

float

_capitalize(fields: list) list

Applies automatic capitalization, aligned with the R couterpart.

Parameters:

fields (list) – A list of names

Returns:

fields – The input list, but each value properly capitalized

Return type:

list

_polarsCapitalize(df: polars.LazyFrame)
Parameters:

df (polars.LazyFrame)

fromPRPCDateTime(x: str, return_string: bool = False) datetime.datetime | str

Convert from a Pega date-time string.

Parameters:
  • x (str) – String of Pega date-time

  • return_string (bool, default=False) – If True it will return the date in string format. If False it will return in datetime type

Returns:

  • Union[datetime.datetime, str] – The converted date in datetime format or string.

  • Examples – >>> fromPRPCDateTime(“20180316T134127.847 GMT”) >>> fromPRPCDateTime(“20180316T134127.847 GMT”, True) >>> fromPRPCDateTime(“20180316T184127.846”) >>> fromPRPCDateTime(“20180316T184127.846”, True)

Return type:

Union[datetime.datetime, str]

toPRPCDateTime(dt: datetime.datetime) str

Convert to a Pega date-time string

Parameters:
  • x (datetime.datetime) – A datetime object

  • dt (datetime.datetime)

Returns:

  • str – A string representation in the format used by Pega

  • Examples – >>> toPRPCDateTime(datetime.datetime.now())

Return type:

str

weighted_average_polars(vals: str | polars.Expr, weights: str | polars.Expr) polars.Expr
Parameters:
  • vals (Union[str, polars.Expr])

  • weights (Union[str, polars.Expr])

Return type:

polars.Expr

weighted_performance_polars() polars.Expr

Polars function to return a weighted performance

Return type:

polars.Expr

zRatio(posCol: polars.Expr = pl.col('BinPositives'), negCol: polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Z-Ratio for predictor bins.

The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.

To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()
lift(posCol: polars.Expr = pl.col('BinPositives'), negCol: polars.Expr = pl.col('BinNegatives')) polars.Expr

Calculates the Lift for predictor bins.

The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.

Parameters:
  • posCol (pl.Expr) – The (Polars) column of the bin positives

  • negCol (pl.Expr) – The (Polars) column of the bin positives

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()
LogOdds(Positives=pl.col('Positives'), Negatives=pl.col('ResponseCount') - pl.col('Positives'))
featureImportance(over=['PredictorName', 'ModelID'])
gains_table(df, value: str, index=None, by=None)

Calculates cumulative gains from any data frame.

The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage.

Parameters:
  • df (pl.DataFrame) – The (Polars) dataframe with the raw values

  • value (str) – The name of the field with the values (plotted on y-axis)

  • None (by =) – Optional name of the field for the x-axis. If not passed in all records are used and weighted equally.

  • None – Grouping field(s), can also be None

Returns:

A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1.

Return type:

pl.DataFrame

Examples

>>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])
legend_color_order(fig)

Orders legend colors alphabetically in order to provide pega color consistency among different categories

sync_reports(checkOnly: bool = False, autoUpdate: bool = False)

Compares the report files in your local directory to the repo

If any of the files are different from the ones in GitHub, will prompt you to update them.

Parameters:
  • checkOnly (bool, default = False) – If True, only checks, does not prompt to update

  • autoUpdate (bool, default = False) – If True, doensn’t prompt for update and goes ahead