pdstools.adm¶

Submodules¶

Classes¶

ADMDatamart

Monitor and analyze ADM data from the Pega Datamart.

Package Contents¶

class ADMDatamart(model_df: polars.LazyFrame | None = None, predictor_df: polars.LazyFrame | None = None, *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True)¶

Monitor and analyze ADM data from the Pega Datamart.

To initialize this class, either 1. Initialize directly with the model_df and predictor_df polars LazyFrames 2. Use one of the class methods: from_ds_export, from_s3, from_dataflow_export etc.

This class will read in the data from different sources, properly structure them from further analysis, and apply correct typing and useful renaming.

There is also a few “namespaces” that you can call from this class:

.plot contains ready-made plots to analyze the data with
.aggregates contains mostly internal data aggregations queries
.agb contains analysis utilities for Adaptive Gradient Boosting models
.generate leads to some ready-made reports, such as the Health Check
.bin_aggregator allows you to compare the bins across various models

Parameters:

model_df (pl.LazyFrame, optional) – The Polars LazyFrame representation of the model snapshot table.
predictor_df (pl.LazyFrame, optional) – The Polars LazyFrame represenation of the predictor binning table.
query (QUERY, optional) – An optional query to apply to the input data. For details, see pdstools.utils.cdh_utils._apply_query().
extract_pyname_keys (bool, default = True) – Whether to extract extra keys from the pyName column. In older Pega versions, this contained pyTreatment among other (customizable) fields. By default True

Examples

>>> from pdstools import ADMDatamart
>>> from glob import glob
>>> dm = ADMDatamart(
         model_df = pl.scan_parquet('models.parquet'),
         predictor_df = pl.scan_parquet('predictors.parquet')
         query = {"Configuration":["Web_Click_Through"]}
         )
>>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder')
>>> dm = ADMDatamart.from_s3("pega_export")
>>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*"))

Note

This class depends on two datasets:

pyModelSnapshots corresponds to the model_data attribute
pyADMPredictorSnapshots corresponds to the predictor_data attribute

For instructions on how to download these datasets, please refer to the following article: https://docs.pega.com/bundle/platform/page/platform/decision-management/exporting-monitoring-database.html

See also

pdstools.adm.Plots: The out of the box plots on the Datamart data
pdstools.adm.Reports: Methods to generate the Health Check and Model Report
pdstools.utils.cdh_utils._apply_query: How to query the ADMDatamart class and methods

model_data: polars.LazyFrame | None¶

predictor_data: polars.LazyFrame | None¶

combined_data: polars.LazyFrame | None¶

plot: pdstools.adm.Plots.Plots¶

aggregates: pdstools.adm.Aggregates.Aggregates¶

agb: pdstools.adm.ADMTrees.AGB¶

generate: pdstools.adm.Reports.Reports¶

cdh_guidelines: pdstools.adm.CDH_Guidelines.CDHGuidelines¶

bin_aggregator: pdstools.adm.BinAggregator.BinAggregator¶

first_action_dates: polars.LazyFrame | None¶

context_keys: List[str] = ['Channel', 'Direction', 'Issue', 'Group', 'Name']¶

_get_first_action_dates(df: polars.LazyFrame | None) → polars.LazyFrame¶

Parameters:: df (Optional[polars.LazyFrame])
Return type:: polars.LazyFrame

classmethod from_ds_export(model_filename: str | None = None, predictor_filename: str | None = None, base_path: os.PathLike | str = '.', *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True)¶

Import the ADMDatamart class from a Pega Dataset Export

Parameters:

model_filename (Optional[str], optional) – The full path or name (if base_path is given) to the model snapshot files, by default None
predictor_filename (Optional[str], optional) – The full path or name (if base_path is given) to the predictor binning snapshot files, by default None
base_path (Union[os.PathLike, str], optional) – A base path to provide so that we can automatically find the most recent files for both the model and predictor snapshots, if model_filename and predictor_filename are not given as full paths, by default “.”
query (Optional[QUERY], optional) – An optional argument to filter out selected data, by default None
extract_pyname_keys (bool, optional) – Whether to extract additional keys from the pyName column, by default True

Returns:

The properly initialized ADMDatamart class

Return type:

ADMDatamart

Examples

>>> from pdstools import ADMDatamart

>>> # To automatically find the most recent files in the 'my_export_folder' dir:
>>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder')

>>> # To specify individual files:
>>> dm = ADMDatamart.from_ds_export(
        model_df='/Downloads/model_snapshots.parquet',
        predictor_df = '/Downloads/predictor_snapshots.parquet'
        )

Note

By default, the dataset export in Infinity returns a zip file per table. You do not need to open up this zip file! You can simply point to the zip, and this method will be able to read in the underlying data.

See also

pdstools.pega_io.File.read_ds_export: More information on file compatibility
pdstools.utils.cdh_utils._apply_query: How to query the ADMDatamart class and methods

classmethod from_s3()¶: Not implemented yet. Please let us know if you would like this functionality!

classmethod from_dataflow_export(model_data_files: Iterable[str] | str, predictor_data_files: Iterable[str] | str, *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: os.PathLike | str = 'cache')¶

Read in data generated by a data flow, such as the Prediction Studio export.

Dataflows are able to export data from and to various sources. As they are meant to be used in production, they are highly resiliant. For every partition and every node, a dataflow will output a small json file every few seconds. While this is great for production loads, it can be a bit more tricky to read in the data for smaller-scale and ad-hoc analyses.

This method aims to make the ingestion of such highly partitioned data easier. It reads in every individual small json file that the dataflow has output, and caches them to a parquet file in the cache_directory folder. As such, if you re-run this method later with more data added since the last export, we will not read in from the (slow) dataflow files, but rather from the (much faster) cache.

Parameters:

model_data_files (Union[Iterable[str], str]) – A list of files to read in as the model snapshots
predictor_data_files (Union[Iterable[str], str]) – A list of files to read in as the predictor snapshots
query (Optional[QUERY], optional) – A, by default None
extract_pyname_keys (bool, optional) – Whether to extract extra keys from the pyName column, by default True
cache_file_prefix (str, optional) – An optional prefix for the cache files, by default “”
extension (Literal["json"], optional) – The extension of the source data, by default “json”
compression (Literal["gzip"], optional) – The compression of the source files, by default “gzip”
cache_directory (Union[os.PathLike, str], optional) – Where to store the cached files, by default “cache”

Returns:

An initialized instance of the datamart class

Return type:

ADMDatamart

Examples

>>> from pdstools import ADMDatamart
>>> import glob
>>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*"))

See also

pdstools.utils.cdh_utils._apply_query: How to query the ADMDatamart class and methods
glob: Makes creating lists of files much easier

classmethod from_pdc(df: polars.LazyFrame, return_df=False)¶

Parameters:: df (polars.LazyFrame)

_validate_model_data(df: polars.LazyFrame | None, extract_pyname_keys: bool = True) → polars.LazyFrame | None¶

Internal method to validate model data

Parameters:

df (Optional[polars.LazyFrame])
extract_pyname_keys (bool)

Return type:

Optional[polars.LazyFrame]

_validate_predictor_data(df: polars.LazyFrame | None) → polars.LazyFrame | None¶

Internal method to validate predictor data

Parameters:: df (Optional[polars.LazyFrame])
Return type:: Optional[polars.LazyFrame]

apply_predictor_categorization(categorization: polars.Expr | Callable[Ellipsis, polars.Expr] | Dict[str, str | List[str]] = cdh_utils.default_predictor_categorization, *, use_regexp: bool = False, df: polars.LazyFrame | None = None)¶

Apply a new predictor categorization to the datamart tables

In certain plots, we use the predictor categorization to indicate what ‘kind’ a certain predictor is, such as IH, Customer, etc. Call this method with a custom Polars Expression (or a method that returns one) or a simple mapping and it will be applied to the predictor data (and the combined dataset too). When the categorization provides no match, the existing categories are kept as they are.

For a reference implementation of a custom predictor categorization, refer to pdstools.utils.cdh_utils.default_predictor_categorization.

Parameters:

categorization (Union[pl.Expr, Callable[..., pl.Expr], Dict[str, Union[str, List[str]]]]) – A Polars Expression (or method that returns one) that returns the predictor categories. Should be based on Polars’ when.then.otherwise syntax. Alternatively can be a dictionary of categories to (list of) string matches which can be either exact (the default) or regular expressions. By default, pdstools.utils.cdh_utils.default_predictor_categorization is used.
use_regexp (bool, optional) – Treat the mapping patterns in the categorization dictionary as regular expressions rather than plain strings. When treated as regular expressions, they will be interpreted in non-strict mode, so invalid expressions will return in no match. See https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.contains.html for exact behavior of the regular expressions. By default, False
df (Optional[pl.LazyFrame], optional) – A Polars Lazyframe to apply the categorization to. If not provided, applies it over the predictor data and combined datasets. By default, None

See also

pdstools.utils.cdh_utils.default_predictor_categorization: The default method

Examples

>>> dm = ADMDatamart(my_data) #uses the OOTB predictor categorization

>>> # Uses a custom Polars expression to set the categories
>>> dm.apply_predictor_categorization(categorization=pl.when(
>>> pl.col("PredictorName").cast(pl.Utf8).str.contains("Propensity")
>>> ).then(pl.lit("External Model")
>>> )

>>> # Uses a simple dictionary to set the categories
>>> dm.apply_predictor_categorization(categorization={
>>> "External Model" : ["Score", "Propensity"]}
>>> )

save_data(path: os.PathLike | str = '.', selected_model_ids: List[str] | None = None) → Tuple[pathlib.Path | None, pathlib.Path | None]¶

Caches model_data and predictor_data to files.

Parameters:

path (str) – Where to place the files
selected_model_ids (List[str]) – Optional list of model IDs to restrict to

Returns:

The paths to the model and predictor data files

Return type:

(Optional[Path], Optional[Path])

property unique_channels¶

A consistently ordered set of unique channels in the data

Used for making the color schemes in different plots consistent

property unique_configurations¶

A consistently ordered set of unique configurations in the data

Used for making the color schemes in different plots consistent

property unique_channel_direction¶: A consistently ordered set of unique channel+direction combos in the data Used for making the color schemes in different plots consistent

property unique_configuration_channel_direction¶: A consistently ordered set of unique configuration+channel+direction Used for making the color schemes in different plots consistent

property unique_predictor_categories¶: A consistently ordered set of unique predictor categories in the data Used for making the color schemes in different plots consistent

classmethod _minMaxScoresPerModel(bin_data: polars.LazyFrame) → polars.LazyFrame¶

Parameters:: bin_data (polars.LazyFrame)
Return type:: polars.LazyFrame

active_ranges(model_ids: str | List[str] | None = None) → polars.LazyFrame¶

Calculate the active, reachable bins in classifiers.

The classifiers exported by Pega contain (in certain product versions) more than the bins that can be reached given the current state of the predictors. This method first calculates the min and max score range from the predictor log odds, then maps that to the interval boundaries of the classifier(s) to find the min and max index.

It returns a LazyFrame with the score min/max, the min/max index, as well as the AUC as reported in the datamart data, when calculated from the full range, and when calculated from the reachable bins only.

This information can be used in the Health Check documents or when verifying the AUC numbers from the datamart.

Parameters:

model_ids (Optional[Union[str, List[str]]], optional) – An optional list of model id’s, or just a single one, to report on. When not given, the information is returned for all models.

Returns:

A table with all the index and AUC information for all the models with the following fields:

Model Identification: - ModelID - The unique identifier for the model

AUC Metrics: - AUC_Datamart - The AUC value as reported in the datamart - AUC_FullRange - The AUC calculated from the full range of bins in the classifier - AUC_ActiveRange - The AUC calculated from only the active/reachable bins

Classifier Information: - Bins - The total number of bins in the classifier - nActivePredictors - The number of active predictors in the model

Log Odds Information (mostly for internal use): - classifierLogOffset - The log offset of the classifier (baseline log odds) - sumMinLogOdds - The sum of minimum log odds across all active predictors - sumMaxLogOdds - The sum of maximum log odds across all active predictors - score_min - The minimum score (normalized sum of log odds including classifier offset) - score_max - The maximum score (normalized sum of log odds including classifier offset)

Active Range Information: - idx_min - The minimum bin index that can be reached given the current binning of all predictors - idx_max - The maximum bin index that can be reached given the current binning of all predictors

Return type:

pl.LazyFrame