pdstools.adm.ADMDatamart

Module Contents

Classes

ADMDatamart

Main class for importing, preprocessing and structuring Pega ADM Datamart.

class ADMDatamart(path: str | pathlib.Path = Path('.'), import_strategy: Literal[eager, lazy] = 'eager', *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, context_keys: list = ['Channel', 'Direction', 'Issue', 'Group'], extract_keys: bool = False, predictorCategorization: polars.Expr = cdh_utils.defaultPredictorCategorization, plotting_engine: str | Any = 'plotly', verbose: bool = False, **reading_opts)

Bases: pdstools.plots.plot_base.Plots, pdstools.adm.Tables.Tables

Main class for importing, preprocessing and structuring Pega ADM Datamart. Gets all available data, properly names and merges into one main dataframe.

It’s also possible to import directly from S3. Please refer to pdstools.pega_io.S3.S3Data.get_ADMDatamart().

Parameters:
  • path (str, default = ".") – The path of the data files

  • import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

  • model_filename (Optional[str])

  • predictor_filename (Optional[str])

  • model_df (Optional[pdstools.utils.types.any_frame])

  • predictor_df (Optional[pdstools.utils.types.any_frame])

  • query (Optional[Union[polars.Expr, List[polars.Expr], str, Dict[str, list]]])

  • subset (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

  • context_keys (list)

  • extract_keys (bool)

  • predictorCategorization (polars.Expr)

  • plotting_engine (Union[str, Any])

  • verbose (bool)

Keyword Arguments:
  • model_filename (Optional[str]) – The name, or extended filepath, towards the model file

  • predictor_filename (Optional[str]) – The name, or extended filepath, towards the predictors file

  • model_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file

  • predictor_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file

  • query (Union[pl.Expr, str, Dict[str, list]], default = None) – Please refer to _apply_query()

  • plotting_engine (str, default = "plotly") – Please refer to get_engine()

  • subset (bool, default = True) – Whether to only keep a subset of columns for efficiency purposes Refer to _available_columns() for the default list of columns.

  • drop_cols (Optional[list]) – Columns to exclude from reading

  • include_cols (Optional[list]) – Additionial columns to include when reading

  • context_keys (list, default = ["Channel", "Direction", "Issue", "Group"]) – Which columns to use as context keys

  • extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.

  • verbose (bool, default = False) – Whether to print out information during importing

  • **reading_opts – Additional parameters used while reading. Refer to pdstools.pega_io.File.import_file() for more info.

modelData

If available, holds the preprocessed data about the models

Type:

pl.LazyFrame

predictorData

If available, holds the preprocessed data about the predictor binning

Type:

pl.LazyFrame

combinedData

If both modelData and predictorData are available, holds the merged data about the models and predictors

Type:

pl.LazyFrame

import_strategy

See the import_strategy parameter

query

See the query parameter

context_keys

See the context_keys parameter

verbose

See the verbose parameter

Examples

>>> Data =  ADMDatamart("/CDHSample")
>>> Data =  ADMDatamart("Data/Adaptive Models & Predictors Export",
            model_filename = "Data-Decision-ADM-ModelSnapshot_AdaptiveModelSnapshotRepo20201110T085543_GMT/data.json",
            predictor_filename = "Data-Decision-ADM-PredictorBinningSnapshot_PredictorBinningSnapshotRepo20201110T084825_GMT/data.json")
>>> Data =  ADMDatamart("Data/files",
            model_filename = "ModelData.csv",
            predictor_filename = "PredictorData.csv")
property is_available: bool
Return type:

bool

standardChannelGroups = ['Web', 'Mobile', 'E-mail', 'Push', 'SMS', 'Retail', 'Call Center', 'IVR']
standardDirections = ['Inbound', 'Outbound']
NBAD_model_configurations
static get_engine(plotting_engine)

Which engine to use for creating the plots.

By supplying a custom class here, you can re-use the pdstools functions but create visualisations to your own specifications, in any library.

import_data(path: str | pathlib.Path | None = Path('.'), *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, extract_keys: bool = False, verbose: bool = False, **reading_opts) Tuple[polars.LazyFrame | None, polars.LazyFrame | None]

Method to import & format the relevant data.

The method first imports the model data, and then the predictor data. If model_df or predictor_df is supplied, it will use those instead If any filters are included in the the query argument of the ADMDatmart, those will be applied to the modeldata, and the predictordata will be filtered such that it only contains the modelids leftover after filtering. After reading, some additional values (such as success rate) are automatically computed. Lastly, if there are missing columns from both datasets, this will be printed to the user if verbose is True.

Parameters:
  • path (Path) – The path of the data files Default = current path (‘.’)

  • subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns

  • model_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file

  • predictor_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file

  • drop_cols (Optional[list]) – Columns to exclude from reading

  • include_cols (Optional[list]) – Additionial columns to include when reading

  • extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.

  • verbose (bool, default = False) – Whether to print out information during importing

  • model_filename (Optional[str])

  • predictor_filename (Optional[str])

Returns:

The model data and predictor binning data as LazyFrames

Return type:

(polars.LazyFrame, polars.LazyFrame)

_import_utils(name: str | pdstools.utils.types.any_frame, path: str | None = None, *, subset: bool = True, extract_keys: bool = False, drop_cols: list | None = None, include_cols: list | None = None, **reading_opts) Tuple[polars.LazyFrame, dict, dict]

Handler function to interface to the cdh_utils methods

Parameters:
  • name (Union[str, pl.DataFrame]) – One of {modelData, predictorData} or a dataframe

  • path (str, default = None) – The path of the data file

  • subset (bool)

  • extract_keys (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

Keyword Arguments:
  • subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns

  • drop_cols (list) – Supply columns to drop from the dataframe

  • include_cols (list) – Supply columns to include with the dataframe

  • extract_keys (bool) – Treatments are typically hidden within the pyName column, extract_keys can expand that cell to also show these values.

  • arguments (Additional keyword)

  • ----------------------------

Return type:

Tuple[polars.LazyFrame, dict, dict]

:keyword See pdstools.pega_io.File.readDSExport():

Returns:

  • The requested dataframe,

  • The renamed columns

  • The columns missing in both dataframes)

Return type:

(pl.LazyFrame, dict, dict)

Parameters:
  • name (Union[str, pdstools.utils.types.any_frame])

  • path (Optional[str])

  • subset (bool)

  • extract_keys (bool)

  • drop_cols (Optional[list])

  • include_cols (Optional[list])

_available_columns(df: polars.LazyFrame, include_cols: list | None = None, drop_cols: list | None = None) Tuple[set, set]

Based on the default names for variables, rename available data to proper formatting

Parameters:
  • df (pl.LazyFrame) – Input dataframe

  • include_cols (list) – Supply columns to include with the dataframe

  • drop_cols (list) – Supply columns to not import at all

Returns:

The original dataframe, but renamed for the found columns & The original and updated names for all renamed columns & The variables that were not found in the table

Return type:

Tuple[set, set]

_set_types(df: pdstools.utils.types.any_frame, table: str = 'infer', *, timestamp_fmt: str = None, strict_conversion: bool = True) pdstools.utils.types.any_frame

A method to change columns to their proper type

Parameters:
  • df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe

  • table (str) – The table to set types for. Default is infer, in which case it infers the table type from the columns in it.

  • timestamp_fmt (str)

  • strict_conversion (bool)

Keyword Arguments:
  • timestamp_fmt (str) – The format of Date type columns

  • strict_conversion (bool) – Raises an error if timestamp conversion to given/default date format(timestamp_fmt) fails See ‘https://strftime.org/’ for timestamp formats

Returns:

The input dataframe, but the proper typing applied

Return type:

Union[pl.DataFrame, pl.LazyFrame]

last(table='modelData', strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame

Convenience function to get the last values for a table

Parameters:
  • table (str, default = modelData) – Which table to get the last values for One of {modelData, predictorData, combinedData}

  • strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

Returns:

The last snapshot for each model

Return type:

Union[pl.DataFrame, pl.LazyFrame]

static _last(df: pdstools.utils.types.any_frame) pdstools.utils.types.any_frame
Parameters:

df (pdstools.utils.types.any_frame)

Return type:

pdstools.utils.types.any_frame

static _last_timestamp(col: Literal[ResponseCount, Positives]) polars.Expr

Add a column to indicate the last timestamp a column has changed.

Parameters:

col (Literal['ResponseCount', 'Positives']) – The column to calculate the diff for

Return type:

polars.Expr

_get_combined_data(last=True, strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame

Combines the model data and predictor data into one dataframe.

Parameters:
  • last (bool, default=True) – Whether to only use the last snapshot for each table

  • strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.

Returns:

The combined dataframe

Return type:

Union[pl.DataFrame, pl.LazyFrame]

processTables(query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) ADMDatamart

Processes modelData, predictorData and combinedData tables.

Can take in a query, which it will apply to modelData If a query is given, it joins predictorData to only retain the modelIDs the modelData was filtered on. If both modelData and predictorData are present, it joins them together into combinedData.

If memory_strategy is eager, which is the default, this method also collects the tables and then sets them back to lazy.

Parameters:

query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]], default = None) – An optional query to apply to the modelData table. See: _apply_query()

Return type:

ADMDatamart

save_data(path: str = '.') Tuple[os.PathLike, os.PathLike]

Cache modelData and predictorData to files.

Parameters:

path (str) – Where to place the files

Returns:

The paths to the model and predictor data files

Return type:

(os.PathLike, os.PathLike)

_apply_query(df: pdstools.utils.types.any_frame, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) polars.LazyFrame

Given an input Polars dataframe, it filters the dataframe based on input query

Parameters:
  • df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – If a Polars Expression, passes the expression into Polars’ filter function. If a list of Polars Expressions, applies each of the expressions as filters. If a string, uses the Pandas query function (works only in eager mode, not recommended). Else, a dict of lists where the key is column name in the dataframe and the corresponding value is a list of values to keep in the dataframe

Returns:

Filtered Polars DataFrame

Return type:

pl.DataFrame

discover_modelTypes(df: polars.LazyFrame, by: str = 'Configuration', allow_collect=False) Dict

Discovers the type of model embedded in the pyModelData column.

By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict.

Parameters:
  • df (pl.LazyFrame) – The dataframe to search for model types

  • by (str) – The column to look for types in. Configuration is recommended.

  • allow_collect (bool, default = False) – Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration.

Return type:

Dict

get_AGB_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, verbose: bool = True, **kwargs) Dict

Method to automatically extract AGB models.

Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models.

Parameters:
  • last (bool, default = False) – Whether to only look at the last snapshot for each model

  • by (str, default = 'Configuration') – Which column to determine unique models with

  • n_threads (int, default = 6) – The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import.

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to _apply_query()

  • verbose (bool, default = False) – Whether to print out information while importing

Return type:

Dict

static _create_sign_df(df: polars.LazyFrame, by: str = 'Name', *, what: str = 'ResponseCount', every: str = '1d', pivot: bool = True, mask: bool = True) polars.LazyFrame

Generates dataframe to show whether responses decreased/increased from day to day

For a given dataframe where columns are dates and rows are model names(by parameter), subtracts each day’s value from the previous day’s value per model. Then masks the data. If increased (desired situtation), it will put 1 in the cell, if no change, it will put 0, and if decreased it will put -1. This dataframe then could be used in the heatmap

Parameters:
  • df (pd.DataFrame) – This is typically pivoted ModelData

  • by (str, default = Name) – Column to calculate the daily change for.

  • what (str)

  • every (str)

  • pivot (bool)

  • mask (bool)

Keyword Arguments:
  • what (str, default = ResponseCount) – Column that contains response counts

  • every (str, default = 1d) – Interval of the change window

  • pivot (bool, default = True) – Returns a pivotted table with signs as value if set to true

  • mask (bool, default = True) – Drops SnapshotTime and returns direction of change(sign).

Returns:

The dataframe with signs for increase or decrease in day to day

Return type:

pd.LazyFrame

model_summary(by: str = 'ModelID', query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, **kwargs) polars.LazyFrame

Convenience method to automatically generate a summary over models

By default, it summarizes ResponseCount, Performance, SuccessRate & Positives by model ID. It also adds weighted means for Performance and SuccessRate, And adds the count of models without responses and the percentage.

Parameters:
  • by (str, default = ModelID) – By what column to summarize the models

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to _apply_query()

Returns:

group_by dataframe over all models

Return type:

pl.LazyFrame

pivot_df(df: polars.LazyFrame, by: str | list = 'Name', *, allow_collect: bool = True, top_n: int = 0) polars.DataFrame

Simple function to extract pivoted information

Parameters:
  • df (pl.LazyFrame) – The input DataFrame.

  • by (Union[str, list], default = Name) – The column(s) to pivot the DataFrame by. If a list is provided, only the first element is used.

  • allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.

  • top_n (int, optional (default=0)) – The number of rows to include in the pivoted DataFrame. If set to 0, all rows are included.

Returns:

The pivoted DataFrame.

Return type:

pl.DataFrame

static response_gain_df(df: pdstools.utils.types.any_frame, by: str = 'Channel') pdstools.utils.types.any_frame

Simple function to extract the response gain per model

Parameters:
  • df (pdstools.utils.types.any_frame)

  • by (str)

Return type:

pdstools.utils.types.any_frame

models_by_positives_df(df: polars.LazyFrame, by: str = 'Channel', allow_collect=True) polars.LazyFrame

Compute statistics on the dataframe by grouping it by a given column by and computing the count of unique ModelIDs and cumulative percentage of unique models for with regard to the number of positive answers.

Parameters:
  • df (pl.LazyFrame) – The input DataFrame

  • by (str, default = Channel) – The column name to group the DataFrame by, by default “Channel”

  • allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.

Returns:

DataFrame with PositivesBin column and model count statistics

Return type:

pl.LazyFrame

get_model_stats(last: bool = True) dict

Returns a dictionary containing various statistics for the model data.

Parameters:

last (bool) – Whether to compute statistics only on the last snapshot. Defaults to True.

Returns:

A dictionary containing the following keys: ‘models_n_snapshots’: The number of distinct snapshot times in the data. ‘models_total’: The total number of models in the data. ‘models_empty’: The models with no responses. ‘models_nopositives’: The models with responses but no positive responses. ‘models_isimmature’: The models with less than 200 positive responses. ‘models_noperformance’: The models with at least 200 positive responses but a performance of 50. ‘models_n_nonperforming’: The total number of models that are not performing well. ‘models_missing_{key}’: The number of models with missing values for each context key. ‘models_bottom_left’: The models with a performance of 50 and a success rate of 0.

Return type:

Dict

describe_models(**kwargs) NoReturn

Convenience method to quickly summarize the models

Return type:

NoReturn

applyGlobalQuery(query: polars.Expr | List[polars.Expr] | str | Dict[str, list]) ADMDatamart

Convenience method to further query the datamart

It’s possible to give this query to the initial ADMDatamart class directly, but this method is more explicit. Filters on the model data (query is put in a polars.filter() method), filters the predictorData on the ModelIDs remaining after the query, and recomputes combinedData.

Only works with Polars expressions.

Paramters

query: Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]

The query to apply, see _apply_query()

Parameters:

query (Union[polars.Expr, List[polars.Expr], str, Dict[str, list]])

Return type:

ADMDatamart

fillMissing() ADMDatamart

Convenience method to fill missing values

  • Fills categorical, string and null type columns with “NA”

  • Fills SuccessRate, Performance and ResponseCount columns with 0

  • When context keys have empty string values, replaces them

with “NA” string

Return type:

ADMDatamart

summary_by_channel(custom_channels: Dict[str, str] = None, keep_lists: bool = False)
Parameters:
  • custom_channels (Dict[str, str])

  • keep_lists (bool)

overall_summary(custom_channels: Dict[str, str] = None)
Parameters:

custom_channels (Dict[str, str])

generateReport(name: str | None = None, working_dir: pathlib.Path = Path('.'), *, modelid: str | None = '', delete_temp_files: bool = True, output_type: str = 'html', allow_collect: bool = True, cached_data: bool = False, predictordetails_activeonly: bool = False, **kwargs)

Generates a report based on the provided parameters. If modelid is provided, a model report will be generated. If not, an overall HealthCheck report will be generated.

Parameters:
  • name (Optional[str], default = None) – The name of the report.

  • working_dir (Path, default = Path(".")) – The working directory. Cached files will be written here.

  • *

  • modelid (Optional[str])

  • delete_temp_files (bool)

  • output_type (str)

  • allow_collect (bool)

  • cached_data (bool)

  • predictordetails_activeonly (bool)

Keyword Arguments:
  • modelid (Optional[str], default = "") – The model id,

  • delete_temp_files (bool, default = True) – Whether to delete temporary files.

  • output_type (str, default = "html") – The type of the output file.

  • allow_collect (bool, default = True) – Whether to allow collection of data.

  • cached_data (bool, default = False) – Whether to use cached data.

  • del_cache (bool, default = True) – Whether to delete cache.

  • predictordetails_activeonly (bool, default = False) – Whether to only include active predictor details.

  • **kwargs – Additional keyword arguments.

exportTables(file: pathlib.Path = 'Tables.xlsx', predictorBinning=False)

Exports all tables from pdstools.adm.Tables into one Excel file.

Parameters:
  • file (Path, default = 'Tables.xlsx') – The file name of the exported Excel file

  • predictorBinning (bool, default = True) – If False, the ‘predictorbinning’ table will not be created