pdstools

Python pdstools

Subpackages

Package Contents

Classes

ADMDatamart

Main class for importing, preprocessing and structuring Pega ADM Datamart.

ADMTrees

MultiTrees

BinAggregator

A class to generate rolled up insights from ADM predictor binning.

CDHLimits

A singleton container for best practice limits for CDH.

Config

Configuration file for the data anonymizer.

DataAnonymization

Anonymize a historical dataset.

PegaDefaultTables

ValueFinder

Class to analyze Value Finder datasets.

Sample

Functions

get_token(credentialFile[, verify])

Get API credentials to a Pega Platform instance.

readDSExport(→ polars.LazyFrame)

Read a Pega dataset export file.

setupAzureOpenAI(api_base, api_version, ...)

Convenience function to automagically setup Azure AD-based authentication

defaultPredictorCategorization() → polars.Expr)

Function to determine the 'category' of a predictor.

show_versions(→ None)

Print out version of pdstools and dependencies to stdout.

CDHSample([plotting_engine, query])

SampleTrees()

SampleValueFinder([verbose])

Attributes

__version__ = '3.4.3'
class ADMDatamart(path: str | pathlib.Path = Path('.'), import_strategy: Literal[eager, lazy] = 'eager', *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, context_keys: list = ['Channel', 'Direction', 'Issue', 'Group'], extract_keys: bool = False, predictorCategorization: polars.Expr = cdh_utils.defaultPredictorCategorization, plotting_engine: str | Any = 'plotly', verbose: bool = False, **reading_opts)

Bases: pdstools.plots.plot_base.Plots, pdstools.adm.Tables.Tables

Main class for importing, preprocessing and structuring Pega ADM Datamart. Gets all available data, properly names and merges into one main dataframe.

It’s also possible to import directly from S3. Please refer to pdstools.pega_io.S3.S3Data.get_ADMDatamart().

Parameters:
  • path (str, default = ".") – The path of the data files

  • import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

  • model_filename (Optional[str])

  • predictor_filename (Optional[str])

  • model_df (Optional[pdstools.utils.types.any_frame])

  • predictor_df (Optional[pdstools.utils.types.any_frame])

  • query (Optional[Union[polars.Expr, List[polars.Expr], str, Dict[str, list]]])

  • subset (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

  • context_keys (list)

  • extract_keys (bool)

  • predictorCategorization (polars.Expr)

  • plotting_engine (Union[str, Any])

  • verbose (bool)

Keyword Arguments:
  • model_filename (Optional[str]) – The name, or extended filepath, towards the model file

  • predictor_filename (Optional[str]) – The name, or extended filepath, towards the predictors file

  • model_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file

  • predictor_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file

  • query (Union[pl.Expr, str, Dict[str, list]], default = None) – Please refer to _apply_query()

  • plotting_engine (str, default = "plotly") – Please refer to get_engine()

  • subset (bool, default = True) – Whether to only keep a subset of columns for efficiency purposes Refer to _available_columns() for the default list of columns.

  • drop_cols (Optional[list]) – Columns to exclude from reading

  • include_cols (Optional[list]) – Additionial columns to include when reading

  • context_keys (list, default = ["Channel", "Direction", "Issue", "Group"]) – Which columns to use as context keys

  • extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.

  • verbose (bool, default = False) – Whether to print out information during importing

  • **reading_opts – Additional parameters used while reading. Refer to pdstools.pega_io.File.import_file() for more info.

modelData

If available, holds the preprocessed data about the models

Type:

pl.LazyFrame

predictorData

If available, holds the preprocessed data about the predictor binning

Type:

pl.LazyFrame

combinedData

If both modelData and predictorData are available, holds the merged data about the models and predictors

Type:

pl.LazyFrame

import_strategy

See the import_strategy parameter

query

See the query parameter

context_keys

See the context_keys parameter

verbose

See the verbose parameter

Examples

>>> Data =  ADMDatamart("/CDHSample")
>>> Data =  ADMDatamart("Data/Adaptive Models & Predictors Export",
            model_filename = "Data-Decision-ADM-ModelSnapshot_AdaptiveModelSnapshotRepo20201110T085543_GMT/data.json",
            predictor_filename = "Data-Decision-ADM-PredictorBinningSnapshot_PredictorBinningSnapshotRepo20201110T084825_GMT/data.json")
>>> Data =  ADMDatamart("Data/files",
            model_filename = "ModelData.csv",
            predictor_filename = "PredictorData.csv")
property is_available: bool
Return type:

bool

standardChannelGroups = ['Web', 'Mobile', 'E-mail', 'Push', 'SMS', 'Retail', 'Call Center', 'IVR']
standardDirections = ['Inbound', 'Outbound']
NBAD_model_configurations
static get_engine(plotting_engine)

Which engine to use for creating the plots.

By supplying a custom class here, you can re-use the pdstools functions but create visualisations to your own specifications, in any library.

import_data(path: str | pathlib.Path | None = Path('.'), *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, extract_keys: bool = False, verbose: bool = False, **reading_opts) Tuple[polars.LazyFrame | None, polars.LazyFrame | None]

Method to import & format the relevant data.

The method first imports the model data, and then the predictor data. If model_df or predictor_df is supplied, it will use those instead If any filters are included in the the query argument of the ADMDatmart, those will be applied to the modeldata, and the predictordata will be filtered such that it only contains the modelids leftover after filtering. After reading, some additional values (such as success rate) are automatically computed. Lastly, if there are missing columns from both datasets, this will be printed to the user if verbose is True.

Parameters:
  • path (Path) – The path of the data files Default = current path (‘.’)

  • subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns

  • model_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file

  • predictor_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file

  • drop_cols (Optional[list]) – Columns to exclude from reading

  • include_cols (Optional[list]) – Additionial columns to include when reading

  • extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.

  • verbose (bool, default = False) – Whether to print out information during importing

  • model_filename (Optional[str])

  • predictor_filename (Optional[str])

Returns:

The model data and predictor binning data as LazyFrames

Return type:

(polars.LazyFrame, polars.LazyFrame)

_import_utils(name: str | pdstools.utils.types.any_frame, path: str | None = None, *, subset: bool = True, extract_keys: bool = False, drop_cols: list | None = None, include_cols: list | None = None, **reading_opts) Tuple[polars.LazyFrame, dict, dict]

Handler function to interface to the cdh_utils methods

Parameters:
  • name (Union[str, pl.DataFrame]) – One of {modelData, predictorData} or a dataframe

  • path (str, default = None) – The path of the data file

  • subset (bool)

  • extract_keys (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

Keyword Arguments:
  • subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns

  • drop_cols (list) – Supply columns to drop from the dataframe

  • include_cols (list) – Supply columns to include with the dataframe

  • extract_keys (bool) – Treatments are typically hidden within the pyName column, extract_keys can expand that cell to also show these values.

  • arguments (Additional keyword)

  • ----------------------------

Return type:

Tuple[polars.LazyFrame, dict, dict]

:keyword See pdstools.pega_io.File.readDSExport():

Returns:

  • The requested dataframe,

  • The renamed columns

  • The columns missing in both dataframes)

Return type:

(pl.LazyFrame, dict, dict)

Parameters:
  • name (Union[str, pdstools.utils.types.any_frame])

  • path (Optional[str])

  • subset (bool)

  • extract_keys (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

_available_columns(df: polars.LazyFrame, include_cols: list | None = None, drop_cols: list | None = None) Tuple[set, set]

Based on the default names for variables, rename available data to proper formatting

Parameters:
  • df (pl.LazyFrame) – Input dataframe

  • include_cols (list) – Supply columns to include with the dataframe

  • drop_cols (list) – Supply columns to not import at all

Returns:

The original dataframe, but renamed for the found columns & The original and updated names for all renamed columns & The variables that were not found in the table

Return type:

Tuple[set, set]

_set_types(df: pdstools.utils.types.any_frame, table: str = 'infer', *, timestamp_fmt: str = None, strict_conversion: bool = True) pdstools.utils.types.any_frame

A method to change columns to their proper type

Parameters:
  • df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe

  • table (str) – The table to set types for. Default is infer, in which case it infers the table type from the columns in it.

  • timestamp_fmt (str)

  • strict_conversion (bool)

Keyword Arguments:
  • timestamp_fmt (str) – The format of Date type columns

  • strict_conversion (bool) – Raises an error if timestamp conversion to given/default date format(timestamp_fmt) fails See ‘https://strftime.org/’ for timestamp formats

Returns:

The input dataframe, but the proper typing applied

Return type:

Union[pl.DataFrame, pl.LazyFrame]

last(table='modelData', strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame

Convenience function to get the last values for a table

Parameters:
  • table (str, default = modelData) – Which table to get the last values for One of {modelData, predictorData, combinedData}

  • strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

Returns:

The last snapshot for each model

Return type:

Union[pl.DataFrame, pl.LazyFrame]

static _last(df: pdstools.utils.types.any_frame) pdstools.utils.types.any_frame
Parameters:

df (pdstools.utils.types.any_frame)

Return type:

pdstools.utils.types.any_frame

static _last_timestamp(col: Literal[ResponseCount, Positives]) polars.Expr

Add a column to indicate the last timestamp a column has changed.

Parameters:

col (Literal['ResponseCount', 'Positives']) – The column to calculate the diff for

Return type:

polars.Expr

_get_combined_data(last=True, strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame

Combines the model data and predictor data into one dataframe.

Parameters:
  • last (bool, default=True) – Whether to only use the last snapshot for each table

  • strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

Returns:

The combined dataframe

Return type:

Union[pl.DataFrame, pl.LazyFrame]

processTables(query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) ADMDatamart

Processes modelData, predictorData and combinedData tables.

Can take in a query, which it will apply to modelData If a query is given, it joins predictorData to only retain the modelIDs the modelData was filtered on. If both modelData and predictorData are present, it joins them together into combinedData.

If memory_strategy is eager, which is the default, this method also collects the tables and then sets them back to lazy.

Parameters:

query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]], default = None) – An optional query to apply to the modelData table. See: _apply_query()

Return type:

ADMDatamart

save_data(path: str = '.') Tuple[os.PathLike, os.PathLike]

Cache modelData and predictorData to files.

Parameters:

path (str) – Where to place the files

Returns:

The paths to the model and predictor data files

Return type:

(os.PathLike, os.PathLike)

_apply_query(df: pdstools.utils.types.any_frame, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) polars.LazyFrame

Given an input Polars dataframe, it filters the dataframe based on input query

Parameters:
  • df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – If a Polars Expression, passes the expression into Polars’ filter function. If a list of Polars Expressions, applies each of the expressions as filters. If a string, uses the Pandas query function (works only in eager mode, not recommended). Else, a dict of lists where the key is column name in the dataframe and the corresponding value is a list of values to keep in the dataframe

Returns:

Filtered Polars DataFrame

Return type:

pl.DataFrame

discover_modelTypes(df: polars.LazyFrame, by: str = 'Configuration', allow_collect=False) Dict

Discovers the type of model embedded in the pyModelData column.

By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict.

Parameters:
  • df (pl.LazyFrame) – The dataframe to search for model types

  • by (str) – The column to look for types in. Configuration is recommended.

  • allow_collect (bool, default = False) – Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration.

Return type:

Dict

get_AGB_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, verbose: bool = True, **kwargs) Dict

Method to automatically extract AGB models.

Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models.

Parameters:
  • last (bool, default = False) – Whether to only look at the last snapshot for each model

  • by (str, default = 'Configuration') – Which column to determine unique models with

  • n_threads (int, default = 6) – The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import.

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to _apply_query()

  • verbose (bool, default = False) – Whether to print out information while importing

Return type:

Dict

static _create_sign_df(df: polars.LazyFrame, by: str = 'Name', *, what: str = 'ResponseCount', every: str = '1d', pivot: bool = True, mask: bool = True) polars.LazyFrame

Generates dataframe to show whether responses decreased/increased from day to day

For a given dataframe where columns are dates and rows are model names(by parameter), subtracts each day’s value from the previous day’s value per model. Then masks the data. If increased (desired situtation), it will put 1 in the cell, if no change, it will put 0, and if decreased it will put -1. This dataframe then could be used in the heatmap

Parameters:
  • df (pd.DataFrame) – This is typically pivoted ModelData

  • by (str, default = Name) – Column to calculate the daily change for.

  • what (str)

  • every (str)

  • pivot (bool)

  • mask (bool)

Keyword Arguments:
  • what (str, default = ResponseCount) – Column that contains response counts

  • every (str, default = 1d) – Interval of the change window

  • pivot (bool, default = True) – Returns a pivotted table with signs as value if set to true

  • mask (bool, default = True) – Drops SnapshotTime and returns direction of change(sign).

Returns:

The dataframe with signs for increase or decrease in day to day

Return type:

pd.LazyFrame

model_summary(by: str = 'ModelID', query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, **kwargs) polars.LazyFrame

Convenience method to automatically generate a summary over models

By default, it summarizes ResponseCount, Performance, SuccessRate & Positives by model ID. It also adds weighted means for Performance and SuccessRate, And adds the count of models without responses and the percentage.

Parameters:
  • by (str, default = ModelID) – By what column to summarize the models

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to _apply_query()

Returns:

group_by dataframe over all models

Return type:

pl.LazyFrame

pivot_df(df: polars.LazyFrame, by: str | list = 'Name', *, allow_collect: bool = True, top_n: int = 0) polars.DataFrame

Simple function to extract pivoted information

Parameters:
  • df (pl.LazyFrame) – The input DataFrame.

  • by (Union[str, list], default = Name) – The column(s) to pivot the DataFrame by. If a list is provided, only the first element is used.

  • allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.

  • top_n (int, optional (default=0)) – The number of rows to include in the pivoted DataFrame. If set to 0, all rows are included.

Returns:

The pivoted DataFrame.

Return type:

pl.DataFrame

static response_gain_df(df: pdstools.utils.types.any_frame, by: str = 'Channel') pdstools.utils.types.any_frame

Simple function to extract the response gain per model

Parameters:
  • df (pdstools.utils.types.any_frame)

  • by (str)

Return type:

pdstools.utils.types.any_frame

models_by_positives_df(df: polars.LazyFrame, by: str = 'Channel', allow_collect=True) polars.LazyFrame

Compute statistics on the dataframe by grouping it by a given column by and computing the count of unique ModelIDs and cumulative percentage of unique models for with regard to the number of positive answers.

Parameters:
  • df (pl.LazyFrame) – The input DataFrame

  • by (str, default = Channel) – The column name to group the DataFrame by, by default “Channel”

  • allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.

Returns:

DataFrame with PositivesBin column and model count statistics

Return type:

pl.LazyFrame

get_model_stats(last: bool = True) dict

Returns a dictionary containing various statistics for the model data.

Parameters:

last (bool) – Whether to compute statistics only on the last snapshot. Defaults to True.

Returns:

A dictionary containing the following keys: ‘models_n_snapshots’: The number of distinct snapshot times in the data. ‘models_total’: The total number of models in the data. ‘models_empty’: The models with no responses. ‘models_nopositives’: The models with responses but no positive responses. ‘models_isimmature’: The models with less than 200 positive responses. ‘models_noperformance’: The models with at least 200 positive responses but a performance of 50. ‘models_n_nonperforming’: The total number of models that are not performing well. ‘models_missing_{key}’: The number of models with missing values for each context key. ‘models_bottom_left’: The models with a performance of 50 and a success rate of 0.

Return type:

Dict

describe_models(**kwargs) NoReturn

Convenience method to quickly summarize the models

Return type:

NoReturn

applyGlobalQuery(query: polars.Expr | List[polars.Expr] | str | Dict[str, list]) ADMDatamart

Convenience method to further query the datamart

It’s possible to give this query to the initial ADMDatamart class directly, but this method is more explicit. Filters on the model data (query is put in a polars.filter() method), filters the predictorData on the ModelIDs remaining after the query, and recomputes combinedData.

Only works with Polars expressions.

Paramters

query: Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]

The query to apply, see _apply_query()

Parameters:

query (Union[polars.Expr, List[polars.Expr], str, Dict[str, list]])

Return type:

ADMDatamart

fillMissing() ADMDatamart

Convenience method to fill missing values

  • Fills categorical, string and null type columns with “NA”

  • Fills SuccessRate, Performance and ResponseCount columns with 0

  • When context keys have empty string values, replaces them

with “NA” string

Return type:

ADMDatamart

summary_by_channel(custom_channels: Dict[str, str] = None, keep_lists: bool = False)
Parameters:
  • custom_channels (Dict[str, str])

  • keep_lists (bool)

overall_summary(custom_channels: Dict[str, str] = None)
Parameters:

custom_channels (Dict[str, str])

generateReport(name: str | None = None, working_dir: pathlib.Path = Path('.'), *, modelid: str | None = '', delete_temp_files: bool = True, output_type: str = 'html', allow_collect: bool = True, cached_data: bool = False, predictordetails_activeonly: bool = False, **kwargs)

Generates a report based on the provided parameters. If modelid is provided, a model report will be generated. If not, an overall HealthCheck report will be generated.

Parameters:
  • name (Optional[str], default = None) – The name of the report.

  • working_dir (Path, default = Path(".")) – The working directory. Cached files will be written here.

  • *

  • modelid (Optional[str])

  • delete_temp_files (bool)

  • output_type (str)

  • allow_collect (bool)

  • cached_data (bool)

  • predictordetails_activeonly (bool)

Keyword Arguments:
  • modelid (Optional[str], default = "") – The model id,

  • delete_temp_files (bool, default = True) – Whether to delete temporary files.

  • output_type (str, default = "html") – The type of the output file.

  • allow_collect (bool, default = True) – Whether to allow collection of data.

  • cached_data (bool, default = False) – Whether to use cached data.

  • del_cache (bool, default = True) – Whether to delete cache.

  • predictordetails_activeonly (bool, default = False) – Whether to only include active predictor details.

  • **kwargs – Additional keyword arguments.

exportTables(file: pathlib.Path = 'Tables.xlsx', predictorBinning=False)

Exports all tables from pdstools.adm.Tables into one Excel file.

Parameters:
  • file (Path, default = 'Tables.xlsx') – The file name of the exported Excel file

  • predictorBinning (bool, default = True) – If False, the ‘predictorbinning’ table will not be created

class ADMTrees
static getMultiTrees(file: polars.DataFrame, n_threads=1, verbose=True, **kwargs)
Parameters:

file (polars.DataFrame)

class MultiTrees
property first
property last
trees: dict
model_name: str
context_keys: list
__repr__()

Return repr(self).

__getitem__(index)
__len__()
__add__(other)
computeOverTime(predictorCategorization=None)
plotSplitsPerVariableType(predictorCategorization=None, **kwargs)
class BinAggregator(dm: pdstools.ADMDatamart, query: polars.Expr = None)

A class to generate rolled up insights from ADM predictor binning.

Parameters:
roll_up(predictors: str | list, n: int = 10, distribution: Literal[lin, log] = 'lin', boundaries: float | list | None = None, symbols: str | list | None = None, minimum: float | None = None, maximum: float | None = None, aggregation: str | None = None, as_numeric: bool | None = None, return_df: bool = False, verbose: bool = False) polars.DataFrame | plotly.graph_objects.Figure

Roll up a predictor across all the models defined when creating the class.

Predictors can be both numeric and symbolic (also called ‘categorical’). You can aggregate the same predictor across different sets of models by specifying a column name in the aggregation argument.

Parameters:
  • predictors (str | list) – Name of the predictor to roll up. Multiple predictors can be passed in as a list.

  • n (int, optional) – Number of bins (intervals or symbols) to generate, by default 10. Any custom intervals or symbols specified with the ‘musthave’ argument will count towards this number as well. For symbolic predictors can be None, which means unlimited.

  • distribution (str, optional) – For numeric predictors: the way the intervals are constructed. By default “lin” for an evenly-spaced distribution, can be set to “log” for a long tailed distribution (for fields like income).

  • boundaries (float | list, optional) – For numeric predictors: one value, or a list of the numeric values to include as interval boundaries. They will be used at the front of the automatically created intervals. By default None, all intervals are created automatically.

  • symbols (str | list, optional) – For symbolic predictors, any symbol(s) that must be included in the symbol list in the generated binning. By default None.

  • minimum (float, optional) – Minimum value for numeric predictors, by default None. When None the minimum is taken from the binning data of the models.

  • maximum (float, optional) – Maximum value for numeric predictors, by default None. When None the maximum is taken from the binning data of the models.

  • aggregation (str, optional) – Optional column name in the data to aggregate over, creating separate aggregations for each of the different values. By default None.

  • as_numeric (bool, optional) – Optional override for the type of the predictor, so to be able to override in the (exceptional) situation that a predictor with the same name is numeric in some and symbolic in some other models. By default None which means the type is taken from the first predictor in the data.

  • return_df (bool, optional) – Return the underlying binning instead of a plot.

  • verbose (bool, optional) – Show detailed debug information while executing, by default False

Returns:

By default returns a nicely formatted plot. When ‘return_df’ is set to True, it returns the actual binning with the lift aggregated over all the models, optionally per predictor and per set of models.

Return type:

pl.DataFrame | Figure

accumulate_num_binnings(predictor, modelids, target_binning, verbose=False) polars.DataFrame
Return type:

polars.DataFrame

create_symbol_list(predictor, n_symbols, musthave_symbols) list
Return type:

list

accumulate_sym_binnings(predictor, modelids, symbollist, verbose=False) polars.DataFrame
Return type:

polars.DataFrame

normalize_all_binnings(combined_dm: polars.LazyFrame) polars.LazyFrame

Prepare all predictor binning

Fix up the boundaries for numeric bins and parse the bin labels into clean lists for symbolics.

Parameters:

combined_dm (polars.LazyFrame)

Return type:

polars.LazyFrame

create_empty_numbinning(predictor: str, n: int, distribution: str = 'lin', boundaries: list | None = None, minimum: float | None = None, maximum: float | None = None) polars.DataFrame
Parameters:
  • predictor (str)

  • n (int)

  • distribution (str)

  • boundaries (Optional[list])

  • minimum (Optional[float])

  • maximum (Optional[float])

Return type:

polars.DataFrame

get_source_numbinning(predictor: str, modelid: str) polars.DataFrame
Parameters:
  • predictor (str)

  • modelid (str)

Return type:

polars.DataFrame

combine_two_numbinnings(source: polars.DataFrame, target: polars.DataFrame, verbose=False) polars.DataFrame
Parameters:
  • source (polars.DataFrame)

  • target (polars.DataFrame)

Return type:

polars.DataFrame

plot_binning_attribution(source: polars.DataFrame, target: polars.DataFrame) plotly.graph_objects.Figure
Parameters:
  • source (polars.DataFrame)

  • target (polars.DataFrame)

Return type:

plotly.graph_objects.Figure

plotBinningLift(binning, col_facet=None, row_facet=None, custom_data=['PredictorName', 'BinSymbol'], return_df=False) polars.DataFrame | plotly.graph_objects.Figure
Return type:

Union[polars.DataFrame, plotly.graph_objects.Figure]

plot_lift_binning(binning: polars.DataFrame) plotly.graph_objects.Figure
Parameters:

binning (polars.DataFrame)

Return type:

plotly.graph_objects.Figure

get_token(credentialFile: str, verify: bool = True, **kwargs)

Get API credentials to a Pega Platform instance.

After setting up OAuth2 authentication in Dev Studio, you should be able to download a credential file. Simply point this method to that file, and it’ll read the relevant properties and give you your access token.

Parameters:
  • credentialFile (str) – The credential file downloaded after setting up OAuth in a Pega system

  • verify (bool, default = True) – Whether to only allow safe SSL requests. In case you’re connecting to an unsecured API endpoint, you need to explicitly set verify to False, otherwise Python will yell at you.

Keyword Arguments:

url (str) – An optional override of the URL to connect to. This is also extracted out of the credential file, but you may want to customize this (to a different port, etc).

readDSExport(filename: pandas.DataFrame | polars.DataFrame | str, path: str = '.', verbose: bool = True, **reading_opts) polars.LazyFrame

Read a Pega dataset export file. Can accept either a Pandas DataFrame or one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet

It automatically infers the default file names for both model data as well as predictor data. If you supply either ‘modelData’ or ‘predictorData’ as the ‘file’ argument, it will search for them. If you supply the full name of the file in the ‘path’ directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call .collect() to get an eager frame.

Parameters:
  • filename ([pd.DataFrame, pl.DataFrame, str]) – Either a Pandas/Polars DataFrame with the source data (for compatibility), or a string, in which case it can either be: - The name of the file (if a custom name) or - Whether we want to look for ‘modelData’ or ‘predictorData’ in the path folder.

  • path (str, default = '.') – The location of the file

  • verbose (bool, default = True) – Whether to print out which file will be imported

Keyword Arguments:

Any – Any arguments to plug into the scan_* function from Polars.

Returns:

  • pl.LazyFrame – The (lazy) dataframe

  • Examples – >>> df = readDSExport(filename = ‘modelData’, path = ‘./datamart’) >>> df = readDSExport(filename = ‘ModelSnapshot.json’, path = ‘data/ADMData’)

    >>> df = pd.read_csv('file.csv')
    >>> df = readDSExport(filename = df)
    

Return type:

polars.LazyFrame

setupAzureOpenAI(api_base: str = 'https://aze-openai-01.openai.azure.com/', api_version: Literal[2022-12-01, 2023-03-15-preview, 2023-05-15, 2023-06-01-preview, 2023-07-01-preview, 2023-09-15-preview, 2023-10-01-preview, 2023-12-01-preview] = '2023-12-01-preview')

Convenience function to automagically setup Azure AD-based authentication for the Azure OpenAI service. Mostly meant as an internal tool within Pega, but can of course also be used beyond.

Prerequisites (you should only need to do this once!): - Download Azure CLI (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) - Once installed, run ‘az login’ in your terminal - Additional dependencies: (pip install) azure-identity and (pip install) openai

Running this function automatically sets, among others: - openai.api_key - os.environ[“OPENAI_API_KEY”]

This should ensure that you don’t need to pass tokens and/or api_keys around. The key that’s set has a lifetime, typically of one hour. Therefore, if you get an error message like ‘invalid token’, you may need to run this method again to refresh the token for another hour.

Parameters:
  • api_base (str) – The url of the Azure service name you’d like to connect to If you have access to the Azure OpenAI playground (https://oai.azure.com/portal), you can easily find this url by clicking ‘view code’ in one of the playgrounds. If you have access to the Azure portal directly (https://portal.azure.com), this will be found under ‘endpoint’. Else, ask your system administrator for the correct url.

  • api_version (str) – The version of the api to use

  • Usage

  • -----

  • setupAzureOpenAI (>>> from pdstools import)

  • setupAzureOpenAI() (>>>)

class CDHLimits

Bases: object

A singleton container for best practice limits for CDH.

LimitStatus
Metrics
_instance
num_limit
lims
get_limits(metric: Metrics) num_limit | None
Parameters:

metric (Metrics)

Return type:

Union[num_limit, None]

check_limits(metric: Metrics, value: object) LimitStatus
Parameters:
  • metric (Metrics)

  • value (object)

Return type:

LimitStatus

defaultPredictorCategorization(x: str | polars.Expr = pl.col('PredictorName')) polars.Expr

Function to determine the ‘category’ of a predictor.

It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain.

By default, this function returns “Primary” whenever there is no ‘.’ anywhere in the name string, otherwise returns the first string before the first period

Parameters:

x (Union[str, pl.Expr], default = pl.col('PredictorName')) – The column to parse

Return type:

polars.Expr

show_versions() None

Print out version of pdstools and dependencies to stdout.

Examples

>>> from pdstools import show_versions
>>> show_versions()
---Version info---
pdstools: 3.1.0
Platform: macOS-12.6.4-x86_64-i386-64bit
Python: 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)]
---Dependencies---
plotly: 5.13.1
requests: 2.28.1
pydot: 1.4.2
polars: 0.17.0
pyarrow: 11.0.0.dev52
tqdm: 4.64.1
pyyaml: <not installed>
aioboto3: 11.0.1
---Streamlit app dependencies---
streamlit: 1.20.0
quarto: 0.1.0
papermill: 2.4.0
itables: 1.5.1
pandas: 1.5.3
jinja2: 3.1.2
xlsxwriter: 3.0
Return type:

None

CDHSample(plotting_engine='plotly', query=None, **kwargs)
SampleTrees()
SampleValueFinder(verbose=True)
class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = True, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = True, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)

Configuration file for the data anonymizer.

Parameters:
  • config_file (str = None) – An optional path to a config file

  • hds_folder (Path = ".") – The path to the hds files

  • use_datamart (bool = False) – Whether to use the datamart to infer predictor types

  • datamart_folder (Path = "datamart") – The folder of the datamart files

  • output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in

  • output_folder (Path = "output") – The location to write the files to

  • mapping_file (str = "mapping.map") – The name of the predictor mapping file

  • mask_predictor_names (bool = True) – Whether to mask the names of regular predictors

  • mask_context_key_names (bool = True) – Whether to mask the names of context key predictors

  • mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors

  • mask_outcome_name (bool = True) – Whether to mask the name of the outcome column

  • mask_predictor_values (bool = True) – Whether to mask the values of regular predictors

  • mask_context_key_values (bool = True) – Whether to mask the values of context key predictors

  • mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors

  • mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary

  • context_key_label (str = "Context_*") – The pattern of names for context key predictors

  • ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors

  • outcome_column (str = "Decision_Outcome") – The name of the outcome column

  • positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True

  • negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False

  • special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched

  • sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.

load_from_config_file(config_file: pathlib.Path)

Load the configurations from a file.

Parameters:

config_file (Path) – The path to the configuration file

save_to_config_file(file_name: str = None)

Save the configurations to a file.

Parameters:

file_name (str) – The name of the configuration file

validate_paths()

Validate the outcome folder exists.

class DataAnonymization(config: Config | None = None, df: polars.LazyFrame | None = None, datamart: pdstools.ADMDatamart | None = None, **config_args)

Anonymize a historical dataset.

Parameters:
  • config (Optional[Config]) – Override the default configurations with the Config class

  • df (Optional[polars.LazyFrame]) – Manually supply a Polars lazyframe to anonymize

  • datamart (Optional[pdstools.ADMDatamart]) – Manually supply a Datamart file to infer predictor types

Keyword Arguments:

**config_args – See Config

Example

See https://pegasystems.github.io/pega-datascientist-tools/Python/articles/Example_Data_Anonymization.html

write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized')

Write the processed dataframe to an output file.

Parameters:
  • df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()

  • ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to

  • mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.

create_mapping_file()

Create a file to write the column mapping

load_hds_files()

Load the historical dataset files from the config.hds_folder location.

read_predictor_type_from_file(df: polars.LazyFrame)

Infer the types of the preditors from the data.

This is non-trivial, as it’s not ideal to pull in all data to memory for this. For this reason, we sample 1% of data, or all data if less than 50 rows, and try to cast it to numeric. If that fails, we set it to categorical, else we set it to numeric.

It is technically supported to manually override this, by just overriding the symbolic_predictors_to_mask & numeric_predictors_to_mask properties.

Parameters:

df (pl.LazyFrame) – The lazyframe to infer the types with

static read_predictor_type_from_datamart(datamart_folder: pathlib.Path, datamart: pdstools.ADMDatamart = None)

The datamart contains type information about each predictor. This function extracts that information to infer types for the HDS.

Parameters:
  • datamart_folder (Path) – The path to the datamart files

  • datamart (ADMDatamart) – The direct ADMDatamart object

get_columns_by_type()

Get a list of columns for each type.

get_predictors_mapping()

Map the predictor names to their anonymized form.

getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)
process(strategy='eager', **kwargs)

Anonymize the dataset.

class PegaDefaultTables
class ADMModelSnapshot
pxApplication
pyAppliesToClass
pyModelID
pyConfigurationName
pySnapshotTime
pyIssue
pyGroup
pyName
pyChannel
pyDirection
pyTreatment
pyPerformance
pySuccessRate
pyResponseCount
pxObjClass
pzInsKey
pxInsName
pxSaveDateTime
pxCommitDateTime
pyExtension
pyActivePredictors
pyTotalPredictors
pyNegatives
pyPositives
pyRelativeNegatives
pyRelativePositives
pyRelativeResponseCount
pyMemory
pyPerformanceThreshold
pyCorrelationThreshold
pyPerformanceError
pyModelData
pyModelVersion
pyFactoryUpdatetime
class ADMPredictorBinningSnapshot
pxCommitDateTime
pxSaveDateTime
pyModelID
pxObjClass
pzInsKey
pxInsName
pyPredictorName
pyContents
pyPerformance
pyPositives
pyNegatives
pyType
pyTotalBins
pyResponseCount
pyRelativePositives
pyRelativeNegatives
pyRelativeResponseCount
pyBinNegatives
pyBinPositives
pyBinType
pyBinNegativesPercentage
pyBinPositivesPercentage
pyBinSymbol
pyBinLowerBound
pyBinUpperBound
pyRelativeBinPositives
pyRelativeBinNegatives
pyBinResponseCount
pyRelativeBinResponseCount
pyBinResponseCountPercentage
pySnapshotTime
pyBinIndex
pyLift
pyZRatio
pyEntryType
pyExtension
pyGroupIndex
pyCorrelationPredictor
class pyValueFinder
pyDirection
pySubjectType
ModelPositives
pyGroup
pyPropensity
FinalPropensity
pyStage
pxRank
pxPriority
pyModelPropensity
pyChannel
Value
pyName
StartingEvidence
pySubjectID
DecisionTime
pyTreatment
pyIssue
class ValueFinder(path: str | None = None, df: pandas.DataFrame | polars.DataFrame | polars.LazyFrame | None = None, verbose: bool = True, import_strategy: Literal[eager, lazy] = 'eager', ncust: int = None, **kwargs)

Class to analyze Value Finder datasets.

Relies heavily on polars for faster reading and transformations. See https://pola-rs.github.io/polars/py-polars/html/index.html

Requires either df or a path to be supplied, If a path is supplied, the ‘filename’ argument is optional. If path is given and no filename is, it will look for the most recent.

Parameters:
  • path (Optional[str]) – Path to the ValueFinder data files

  • df (Optional[DataFrame]) – Override to supply a dataframe instead of a file. Supports pandas or polars dataframes

  • import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

  • verbose (bool) – Whether to print out information during importing

  • ncust (int)

Keyword Arguments:
  • th (float) – An optional keyword argument to override the propensity threshold

  • filename (Optional[str]) – The name, or extended filepath, towards the file

  • subset (bool) – Whether to select only a subset of columns. Will speed up analysis and reduce unused information

save_data(path: str = '.') os.PathLike

Cache the ValueFinder dataset to a file

Parameters:

path (str) – Where to place the file

Returns:

The paths to the file

Return type:

PathLike

getCustomerSummary(th: float | None = None) polars.DataFrame

Computes the summary of propensities for all customers

Parameters:

th (Optional[float]) – The threshold to consider an action ‘good’. If a customer has actions with propensity above this, the customer has at least one relevant action. If not given, will default to 5th quantile.

Return type:

polars.DataFrame

getCountsPerStage(customersummary: polars.DataFrame | None = None) polars.DataFrame

Generates an aggregated view per stage.

Parameters:

customersummary (Optional[pl.DataFrame]) – Optional override of the customer summary, which can be generated by getCustomerSummary().

Return type:

polars.DataFrame

getThFromQuantile(quantile: float) float

Return the propensity threshold corresponding to a given quantile

If the threshold is already in self._thMap, simply gets it from there Otherwise, computes the threshold and then adds it to the map.

Parameters:

quantile (float) – The quantile to get the threshold for

Return type:

float

getCountsPerThreshold(th, return_df=False) polars.LazyFrame | None
Return type:

Optional[polars.LazyFrame]

addCountsForThresholdRange(start, stop, step, method=Literal['threshold, quantile']) None

Adds the counts per stage for a range of quantiles or thresholds.

Once computed, the values are added to .countsPerThreshold so we only need to compute each value once.

Parameters:
  • start (float) – The starting of the range

  • stop (float) – The end of the range

  • step (float) – The steps to compute between start and stop

  • method (Literal["threshold", "quantile"]:) – Whether to get a range of thresholds directly or compute the thresholds from their quantiles

Return type:

None

plotPropensityDistribution(sampledN: int = 10000) plotly.graph_objects.Figure

Plots the distribution of the different propensities.

For optimization reasons (storage for all points in a boxplot and time complexity for computing the distribution plot), we have to sample to a reasonable amount of data points.

Parameters:

sampledN (int, default = 10_000) – The number of datapoints to sample

Return type:

plotly.graph_objects.Figure

plotPropensityThreshold(sampledN=10000, stage='Eligibility') plotly.graph_objects.Figure

Plots the propensity threshold vs the different propensities.

Parameters:

sampledN (int, default = 10_000) – The number of datapoints to sample

Return type:

plotly.graph_objects.Figure

plotPieCharts(start: float = None, stop: float = None, step: float = None, *, method: Literal[ValueFinder.plotPieCharts.threshold, quantile] = 'threshold', rounding: int = 3, th: float | None = None) plotly.graph_objects.FigureWidget

Plots pie charts showing the distribution of customers

The pie charts each represent the fraction of customers with the color indicating whether they have sufficient relevant actions in that stage of the NBAD arbitration.

If no values are provided for start, stop or step, the pie charts are shown using the default propensity threshold, as part of the Value Finder class.

Parameters:
  • start (float) – The starting of the range

  • stop (float) – The end of the range

  • step (float) – The steps to compute between start and stop

  • method (Literal[ValueFinder.plotPieCharts.threshold, quantile])

  • rounding (int)

  • th (Optional[float])

Keyword Arguments:
  • method (Literal['threshold', 'quantile'], default='threshold') – Whether the range is computed based on the threshold directly or based on the quantile of the propensity

  • rounding (int) – The number of digits to round the values by

  • th (Optional[float]) – Choose a specific propensity threshold to plot

Return type:

plotly.graph_objects.FigureWidget

plotDistributionPerThreshold(start: float = None, stop: float = None, step: float = None, *, method: Literal[threshold, ValueFinder.plotDistributionPerThreshold.quantile] = 'threshold', rounding=3) plotly.graph_objects.FigureWidget

Plots the distribution of customers per threshold, per stage.

Based on the precomputed data in self.countsPerThreshold, this function will plot the distribution per stage.

To add more data points between a given range, simply pass all three arguments to this function: start, stop and step.

Parameters:
  • start (float) – The starting of the range

  • stop (float) – The end of the range

  • step (float) – The steps to compute between start and stop

  • method (Literal[threshold, ValueFinder.plotDistributionPerThreshold.quantile])

Keyword Arguments:
  • method (Literal['threshold', 'quantile'], default='threshold') – Whether the range is computed based on the threshold directly or based on the quantile of the propensity

  • rounding (int) – The number of digits to round the values by

Return type:

plotly.graph_objects.FigureWidget

plotFunnelChart(level: str = 'Action', query=None, return_df=False, **kwargs)

Plots the funnel of actions or issues per stage.

Parameters:

level (str, default = 'Actions') – Which element to plot: - If ‘Actions’, plots the distribution of actions. - If ‘Issues’, plots the distribution of issues

class Sample(ldf: polars.LazyFrame)
Parameters:

ldf (polars.LazyFrame)

sample(n)
height()
shape()
item()
__reports__