pdstools.utils.cdh_utils¶

Attributes¶

`F`
`Figure`

Functions¶

`_apply_query`(→ F)
`_combine_queries`(→ pdstools.utils.types.QUERY)
`default_predictor_categorization`() → polars.Expr)	Function to determine the 'category' of a predictor.
`_extract_keys`(→ F)	Extracts keys out of the pyName column
`parse_pega_date_time_formats`([timestamp_col, ...])	Parses Pega DateTime formats.
`safe_range_auc`(→ float)	Internal helper to keep auc a safe number between 0.5 and 1.0 always.
`auc_from_probs`(→ List[float])	Calculates AUC from an array of truth values and predictions.
`auc_from_bincounts`(→ float)	Calculates AUC from counts of positives and negatives directly
`aucpr_from_probs`(→ List[float])	Calculates PR AUC (precision-recall) from an array of truth values and predictions.
`aucpr_from_bincounts`(→ float)	Calculates PR AUC (precision-recall) from counts of positives and negatives directly.
`auc_to_gini`(→ float)	Convert AUC performance metric to GINI
`_capitalize`(→ List[str])	Applies automatic capitalization, aligned with the R couterpart.
`_polars_capitalize`(→ F)
`from_prpc_date_time`(→ Union[datetime.datetime, str])	Convert from a Pega date-time string.
`to_prpc_date_time`(→ str)	Convert to a Pega date-time string
`weighted_average_polars`(→ polars.Expr)
`weighted_performance_polars`(→ polars.Expr)	Polars function to return a weighted performance
`overlap_matrix`(→ polars.DataFrame)	Calculate the overlap of a list element with all other list elements returning a full matrix.
`overlap_lists_polars`(→ polars.Series)	Calculate the average overlap ratio of each list element with all other list elements into a single Series.
`z_ratio`(, neg_col, polars.Expr] = pl.col) → polars.Expr)	Calculates the Z-Ratio for predictor bins.
`lift`(, neg_col, polars.Expr] = pl.col) → polars.Expr)	Calculates the Lift for predictor bins.
`bin_log_odds`(→ List[float])
`log_odds_polars`([positives, negatives])
`feature_importance`([over])
`_apply_schema_types`(→ F)	This function is used to convert the data types of columns in a DataFrame to a desired types.
`gains_table`(df, value[, index, by])	Calculates cumulative gains from any data frame.
`lazy_sample`(→ F)
`legend_color_order`(fig)	Orders legend colors alphabetically in order to provide pega color
`process_files_to_bytes`(→ Tuple[bytes, str])	Processes a list of file paths, returning file content as bytes and a corresponding file name.
`get_latest_pdstools_version`()
`setup_logger`()	Returns a logger and log buffer in root level
`create_working_and_temp_dir`(→ Tuple[pathlib.Path, ...)	Creates a working directory for saving files and a temp_dir
`safe_flatten_list`(→ List)
`_get_start_end_date_args`(data[, start_date, end_date, ...])
`_read_pdc`(pdc_data)

Module Contents¶

F¶

Figure¶

_apply_query(df: F, query: pdstools.utils.types.QUERY | None = None, allow_empty: bool = False) → F¶

Parameters:

df (F)
query (Optional[pdstools.utils.types.QUERY])
allow_empty (bool)

Return type:

_combine_queries(existing_query: pdstools.utils.types.QUERY, new_query: polars.Expr) → pdstools.utils.types.QUERY¶

Parameters:

existing_query (pdstools.utils.types.QUERY)
new_query (polars.Expr)

Return type:

pdstools.utils.types.QUERY

default_predictor_categorization(x: str | polars.Expr = pl.col('PredictorName')) → polars.Expr¶

Function to determine the ‘category’ of a predictor.

It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain.

By default, this function returns “Primary” whenever there is no ‘.’ anywhere in the name string, otherwise returns the first string before the first period

Parameters:: x (Union[str, pl.Expr], default = pl.col('PredictorName')) – The column to parse
Return type:: polars.Expr

_extract_keys(df: F, key: str = 'Name', capitalize: bool = True) → F¶

Extracts keys out of the pyName column

This is not a lazy operation as we don’t know the possible keys in advance. For that reason, we select only the key column, extract the keys from that, and then collect the resulting dataframe. This dataframe is then joined back to the original dataframe.

This is relatively efficient, but we still do need the whole pyName column in memory to do this, so it won’t work completely lazily from e.g. s3. That’s why it only works with eager mode.

The data in column for which the JSON is extract is normalized a little by taking out non-space, non-printable characters. Not just ASCII of course. This may be relatively expensive.

JSON extraction only happens on the unique values so saves a lot of time with multiple snapshots of the same models, it also only processes rows for which the key column appears to be valid JSON. It will break when you “trick” it with malformed JSON.

Column values for columns that are also encoded in the key column will be overwritten with values from the key column, but only for rows that are JSON. In previous versions all values were overwritten resulting in many nulls.

Parameters:

df (Union[pl.DataFrame, pl.LazyFrame]) – The dataframe to extract the keys from
key (str) – The column with embedded JSON
capitalize (bool) – If True (default) normalizes the names of the embedded columns otherwise keeps the names as-is.

Return type:

parse_pega_date_time_formats(timestamp_col='SnapshotTime', timestamp_fmt: str | None = None, timestamp_dtype: polars._typing.PolarsTemporalType | None = pl.Datetime)¶

Parses Pega DateTime formats.

Supports commonly used formats:

“%Y-%m-%d %H:%M:%S”
“%Y%m%dT%H%M%S.%f %Z”
“%d-%b-%y”
“%d%b%Y:%H:%M:%S”
“%Y%m%d”

Removes timezones, and rounds to seconds, with a ‘ns’ time unit.

In the implementation, the last expression uses timestamp_fmt or %Y. This is a bit of a hack, because if we pass None, it tries to infer automatically. Inferring raises when it can’t find an appropriate format, so that’s not good.

Parameters:

timestampCol (str, default = 'SnapshotTime') – The column to parse
timestamp_fmt (str, default = None) – An optional format to use rather than the default formats
timestamp_dtype (PolarsTemporalType, default = pl.Datetime) – The data type to convert into. Can be either Date, Datetime, or Time.

safe_range_auc(auc: float) → float¶

Internal helper to keep auc a safe number between 0.5 and 1.0 always.

Parameters:: auc (float) – The AUC (Area Under the Curve) score
Returns:: ‘Safe’ AUC score, between 0.5 and 1.0
Return type:: float

auc_from_probs(groundtruth: List[int], probs: List[float]) → List[float]¶

Calculates AUC from an array of truth values and predictions. Calculates the area under the ROC curve from an array of truth values and predictions, making sure to always return a value between 0.5 and 1.0 and returns 0.5 when there is just one groundtruth label.

Parameters:

groundtruth (List[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.
probs (List[float]) – The predictions, as a numeric vector of the same length as groundtruth
Returns (List[float]) – The AUC as a value between 0.5 and 1.

Examples –

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])

Return type:

List[float]

auc_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) → float¶

Calculates AUC from counts of positives and negatives directly This is an efficient calculation of the area under the ROC curve directly from an array of positives and negatives. It makes sure to always return a value between 0.5 and 1.0 and will return 0.5 when there is just one groundtruth label.

Parameters:

pos (List[int]) – Vector with counts of the positive responses
neg (List[int]) – Vector with counts of the negative responses
probs (List[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

float – The AUC as a value between 0.5 and 1.
Examples – >>> auc_from_bincounts([3,1,0], [2,0,1])

Return type:

float

aucpr_from_probs(groundtruth: List[int], probs: List[float]) → List[float]¶

Calculates PR AUC (precision-recall) from an array of truth values and predictions. Calculates the area under the PR curve from an array of truth values and predictions. Returns 0.0 when there is just one groundtruth label.

Parameters:

groundtruth (List[int]) – The ‘true’ values, Positive values must be represented as True or 1. Negative values must be represented as False or 0.
probs (List[float]) – The predictions, as a numeric vector of the same length as groundtruth
Returns (List[float]) – The AUC as a value between 0.5 and 1.

Examples –

>>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])

Return type:

List[float]

aucpr_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) → float¶

Calculates PR AUC (precision-recall) from counts of positives and negatives directly. This is an efficient calculation of the area under the PR curve directly from an array of positives and negatives. Returns 0.0 when there is just one groundtruth label.

Parameters:

pos (List[int]) – Vector with counts of the positive responses
neg (List[int]) – Vector with counts of the negative responses
probs (List[float]) – Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).

Returns:

float – The PR AUC as a value between 0.0 and 1.
Examples – >>> aucpr_from_bincounts([3,1,0], [2,0,1])

Return type:

float

auc_to_gini(auc: float) → float¶

Convert AUC performance metric to GINI

Parameters:

auc (float) – The AUC (number between 0.5 and 1)

Returns:

float – GINI metric, a number between 0 and 1
Examples – >>> auc2GINI(0.8232)

Return type:

float

_capitalize(fields: str | Iterable[str], extra_endwords: Iterable[str] | None = None) → List[str]¶

Applies automatic capitalization, aligned with the R couterpart.

Parameters:

fields (list) – A list of names
extra_endwords (Optional[Iterable[str]])

Returns:

fields – The input list, but each value properly capitalized

Return type:

list

_polars_capitalize(df: F, extra_endwords: Iterable[str] | None = None) → F¶

Parameters:

df (F)
extra_endwords (Optional[Iterable[str]])

Return type:

from_prpc_date_time(x: str, return_string: bool = False, use_timezones: bool = True) → datetime.datetime | str¶

Convert from a Pega date-time string.

Parameters:

x (str) – String of Pega date-time
return_string (bool, default=False) – If True it will return the date in string format. If False it will return in datetime type
use_timezones (bool)

Returns:

Union[datetime.datetime, str] – The converted date in datetime format or string.
Examples – >>> fromPRPCDateTime(“20180316T134127.847 GMT”) >>> fromPRPCDateTime(“20180316T134127.847 GMT”, True) >>> fromPRPCDateTime(“20180316T184127.846”) >>> fromPRPCDateTime(“20180316T184127.846”, True)

Return type:

Union[datetime.datetime, str]

to_prpc_date_time(dt: datetime.datetime) → str¶

Convert to a Pega date-time string

Parameters:

x (datetime.datetime) – A datetime object
dt (datetime.datetime)

Returns:

str – A string representation in the format used by Pega
Examples – >>> toPRPCDateTime(datetime.datetime.now())

Return type:

str

weighted_average_polars(vals: str | polars.Expr, weights: str | polars.Expr) → polars.Expr¶

Parameters:

vals (Union[str, polars.Expr])
weights (Union[str, polars.Expr])

Return type:

polars.Expr

weighted_performance_polars(vals: str | polars.Expr = 'Performance', weights: str | polars.Expr = 'ResponseCount') → polars.Expr¶

Polars function to return a weighted performance

Parameters:

vals (Union[str, polars.Expr])
weights (Union[str, polars.Expr])

Return type:

polars.Expr

overlap_matrix(df: polars.DataFrame, list_col: str, by: str, show_fraction: bool = True) → polars.DataFrame¶

Calculate the overlap of a list element with all other list elements returning a full matrix.

For each list in the specified column, this function calculates the overlap ratio (intersection size divided by the original list size) with every other list in the column, including itself. The result is a matrix where each row represents the overlap ratios for one list with all others.

Parameters:

df (pl.DataFrame) – The Polars DataFrame containing the list column and grouping column.
list_col (str) – The name of the column containing the lists. Each element in this column should be a list.
by (str) – The name of the column to use for grouping and labeling the rows in the result matrix.
show_fraction (bool)

Returns:

A DataFrame where: - Each row represents the overlap ratios for one list with all others - Each column (except the last) represents the overlap ratio with a specific list - Column names are formatted as “Overlap_{list_col_name}_{by}” - The last column contains the original values from the ‘by’ column

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> df = pl.DataFrame({
...     "Channel": ["Mobile", "Web", "Email"],
...     "Actions": [
...         [1, 2, 3],
...         [2, 3, 4, 6],
...         [3, 5, 7, 8]
...     ]
... })
>>> overlap_matrix(df, "Actions", "Channel")
shape: (3, 4)
┌───────────────────┬───────────────┬───────────────┬─────────┐
│ Overlap_Actions_M… │ Overlap_Actio… │ Overlap_Actio… │ Channel │
│ ---               │ ---           │ ---           │ ---     │
│ f64               │ f64           │ f64           │ str     │
╞═══════════════════╪═══════════════╪═══════════════╪═════════╡
│ 1.0               │ 0.6666667     │ 0.3333333     │ Mobile  │
│ 0.5               │ 1.0           │ 0.25          │ Web     │
│ 0.25              │ 0.25          │ 1.0           │ Email   │
└───────────────────┴───────────────┴───────────────┴─────────┘

overlap_lists_polars(col: polars.Series) → polars.Series¶

Calculate the average overlap ratio of each list element with all other list elements into a single Series.

For each list in the input Series, this function calculates the average overlap (intersection) with all other lists, normalized by the size of the original list. The overlap ratio represents how much each list has in common with all other lists on average.

Parameters:: col (pl.Series) – A Polars Series where each element is a list. The function will calculate the overlap between each list and all other lists in the Series.
Returns:: A Polars Series of float values representing the average overlap ratio for each list. Each value is calculated as: (sum of intersection sizes with all other lists) / (number of other lists) / (size of original list)
Return type:: pl.Series

Examples

>>> import polars as pl
>>> data = pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])
>>> overlap_lists_polars(data)
shape: (3,)
Series: '' [f64]
[
    0.5
    0.375
    0.25
]
>>> df = pl.DataFrame({"Channel" : ["Mobile", "Web", "Email"], "Actions" : pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])})
>>> df.with_columns(pl.col("Actions").map_batches(overlap_lists_polars))
shape: (3, 2)
┌─────────┬─────────┐
│ Channel │ Actions │
│ ---     │ ---     │
│ str     │ f64     │
╞═════════╪═════════╡
│ Mobile  │ 0.5     │
│ Web     │ 0.375   │
│ Email   │ 0.25    │
└─────────┴─────────┘

z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) → polars.Expr¶

Calculates the Z-Ratio for predictor bins.

The Z-ratio is a measure of how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the avreage, so centers around 0. The wider the spread, the better the predictor is.

To recreate the OOTB ZRatios from the datamart, use in a group_by. See examples.

Parameters:

posCol (pl.Expr) – The (Polars) column of the bin positives
negCol (pl.Expr) – The (Polars) column of the bin positives
pos_col (Union[str, polars.Expr])
neg_col (Union[str, polars.Expr])

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()

lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) → polars.Expr¶

Calculates the Lift for predictor bins.

The Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity.

Parameters:

posCol (pl.Expr) – The (Polars) column of the bin positives
negCol (pl.Expr) – The (Polars) column of the bin positives
pos_col (Union[str, polars.Expr])
neg_col (Union[str, polars.Expr])

Return type:

polars.Expr

Examples

>>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()

bin_log_odds(bin_pos: List[float], bin_neg: List[float]) → List[float]¶

Parameters:

bin_pos (List[float])
bin_neg (List[float])

Return type:

List[float]

log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives'))¶

Parameters:

positives (Union[polars.Expr, str])
negatives (Union[polars.Expr, str])

feature_importance(over=['PredictorName', 'ModelID'])¶

_apply_schema_types(df: F, definition, verbose=False, **timestamp_opts) → F¶

This function is used to convert the data types of columns in a DataFrame to a desired types. The desired types are defined in a PegaDefaultTables class.

Parameters:

df (pl.LazyFrame) – The DataFrame whose columns’ data types need to be converted.
definition (PegaDefaultTables) – A PegaDefaultTables object that contains the desired data types for the columns.
verbose (bool) – If True, the function will print a message when a column is not in the default table schema.
timestamp_opts (str) – Additional arguments for timestamp parsing.

Returns:

A list with polars expressions for casting data types.

Return type:

List

gains_table(df, value: str, index=None, by=None)¶

Calculates cumulative gains from any data frame.

The cumulative gains are the cumulative values expressed as a percentage vs the size of the population, also expressed as a percentage.

Parameters:

df (pl.DataFrame) – The (Polars) dataframe with the raw values
value (str) – The name of the field with the values (plotted on y-axis)
None (by =) – Optional name of the field for the x-axis. If not passed in all records are used and weighted equally.
None – Grouping field(s), can also be None

Returns:

A (Polars) dataframe with cum_x and cum_y columns and optionally the grouping column(s). Values for cum_x and cum_y are relative so expressed as values 0-1.

Return type:

pl.DataFrame

Examples

>>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])

lazy_sample(df: F, n_rows: int, with_replacement: bool = True) → F¶

Parameters:

df (F)
n_rows (int)
with_replacement (bool)

Return type:

legend_color_order(fig)¶: Orders legend colors alphabetically in order to provide pega color consistency among different categories

process_files_to_bytes(file_paths: List[str | pathlib.Path], base_file_name: str | pathlib.Path) → Tuple[bytes, str]¶

Processes a list of file paths, returning file content as bytes and a corresponding file name. Useful for zipping muliple model reports and the byte object is used for downloading files in Streamlit app.

This function handles three scenarios: 1. Single file: Returns the file’s content as bytes and the provided base file name. 2. Multiple files: Creates a zip file containing all files, returns the zip file’s content as bytes

and a generated zip file name.

No files: Returns empty bytes and an empty string.

Parameters:

file_paths (List[Union[str, Path]]) – A list of file paths to process. Can be empty, contain a single path, or multiple paths.
base_file_name (Union[str, Path]) – The base name to use for the output file. For a single file, this name is returned as is. For multiple files, this is used as part of the generated zip file name.

Returns:

A tuple containing: - bytes: The content of the single file or the created zip file, or empty bytes if no files. - str: The file name (either base_file_name or a generated zip file name), or an empty string if no files.

Return type:

Tuple[bytes, str]

get_latest_pdstools_version()¶

setup_logger()¶: Returns a logger and log buffer in root level

create_working_and_temp_dir(name: str | None = None, working_dir: os.PathLike | None = None) → Tuple[pathlib.Path, pathlib.Path]¶

Creates a working directory for saving files and a temp_dir

Parameters:

name (Optional[str])
working_dir (Optional[os.PathLike])

Return type:

Tuple[pathlib.Path, pathlib.Path]

safe_flatten_list(alist: List, extras: List = None) → List¶

Parameters:

alist (List)
extras (List)

Return type:

List

Parameters:

data (Union[polars.Series, polars.LazyFrame, polars.DataFrame])
start_date (Optional[datetime.datetime])
end_date (Optional[datetime.datetime])
window (Optional[Union[int, datetime.timedelta]])

_read_pdc(pdc_data: polars.LazyFrame)¶

Parameters:: pdc_data (polars.LazyFrame)