pdstools.adm.ADMDatamart
¶
Module Contents¶
Classes¶
Main class for importing, preprocessing and structuring Pega ADM Datamart. |
- class ADMDatamart(path: str | pathlib.Path = Path('.'), import_strategy: Literal[eager, lazy] = 'eager', *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, context_keys: list = ['Channel', 'Direction', 'Issue', 'Group'], extract_keys: bool = False, predictorCategorization: polars.Expr = cdh_utils.defaultPredictorCategorization, plotting_engine: str | Any = 'plotly', verbose: bool = False, **reading_opts)¶
Bases:
pdstools.plots.plot_base.Plots
,pdstools.adm.Tables.Tables
Main class for importing, preprocessing and structuring Pega ADM Datamart. Gets all available data, properly names and merges into one main dataframe.
It’s also possible to import directly from S3. Please refer to
pdstools.pega_io.S3.S3Data.get_ADMDatamart()
.- Parameters:
path (str, default = ".") – The path of the data files
import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
model_filename (Optional[str])
predictor_filename (Optional[str])
model_df (Optional[pdstools.utils.types.any_frame])
predictor_df (Optional[pdstools.utils.types.any_frame])
query (Optional[Union[polars.Expr, List[polars.Expr], str, Dict[str, list]]])
subset (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
context_keys (list)
extract_keys (bool)
predictorCategorization (polars.Expr)
plotting_engine (Union[str, Any])
verbose (bool)
- Keyword Arguments:
model_filename (Optional[str]) – The name, or extended filepath, towards the model file
predictor_filename (Optional[str]) – The name, or extended filepath, towards the predictors file
model_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file
predictor_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file
query (Union[pl.Expr, str, Dict[str, list]], default = None) – Please refer to
_apply_query()
plotting_engine (str, default = "plotly") – Please refer to
get_engine()
subset (bool, default = True) – Whether to only keep a subset of columns for efficiency purposes Refer to
_available_columns()
for the default list of columns.drop_cols (Optional[list]) – Columns to exclude from reading
include_cols (Optional[list]) – Additionial columns to include when reading
context_keys (list, default = ["Channel", "Direction", "Issue", "Group"]) – Which columns to use as context keys
extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.
verbose (bool, default = False) – Whether to print out information during importing
**reading_opts – Additional parameters used while reading. Refer to
pdstools.pega_io.File.import_file()
for more info.
- modelData¶
If available, holds the preprocessed data about the models
- Type:
pl.LazyFrame
- predictorData¶
If available, holds the preprocessed data about the predictor binning
- Type:
pl.LazyFrame
- combinedData¶
If both modelData and predictorData are available, holds the merged data about the models and predictors
- Type:
pl.LazyFrame
- import_strategy¶
See the import_strategy parameter
- query¶
See the query parameter
- context_keys¶
See the context_keys parameter
- verbose¶
See the verbose parameter
Examples
>>> Data = ADMDatamart("/CDHSample") >>> Data = ADMDatamart("Data/Adaptive Models & Predictors Export", model_filename = "Data-Decision-ADM-ModelSnapshot_AdaptiveModelSnapshotRepo20201110T085543_GMT/data.json", predictor_filename = "Data-Decision-ADM-PredictorBinningSnapshot_PredictorBinningSnapshotRepo20201110T084825_GMT/data.json") >>> Data = ADMDatamart("Data/files", model_filename = "ModelData.csv", predictor_filename = "PredictorData.csv")
- property is_available: bool¶
- Return type:
bool
- standardChannelGroups = ['Web', 'Mobile', 'E-mail', 'Push', 'SMS', 'Retail', 'Call Center', 'IVR']¶
- standardDirections = ['Inbound', 'Outbound']¶
- NBAD_model_configurations¶
- static get_engine(plotting_engine)¶
Which engine to use for creating the plots.
By supplying a custom class here, you can re-use the pdstools functions but create visualisations to your own specifications, in any library.
- import_data(path: str | pathlib.Path | None = Path('.'), *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, extract_keys: bool = False, verbose: bool = False, **reading_opts) Tuple[polars.LazyFrame | None, polars.LazyFrame | None] ¶
Method to import & format the relevant data.
The method first imports the model data, and then the predictor data. If model_df or predictor_df is supplied, it will use those instead If any filters are included in the the query argument of the ADMDatmart, those will be applied to the modeldata, and the predictordata will be filtered such that it only contains the modelids leftover after filtering. After reading, some additional values (such as success rate) are automatically computed. Lastly, if there are missing columns from both datasets, this will be printed to the user if verbose is True.
- Parameters:
path (Path) – The path of the data files Default = current path (‘.’)
subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns
model_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file
predictor_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file
drop_cols (Optional[list]) – Columns to exclude from reading
include_cols (Optional[list]) – Additionial columns to include when reading
extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.
verbose (bool, default = False) – Whether to print out information during importing
model_filename (Optional[str])
predictor_filename (Optional[str])
- Returns:
The model data and predictor binning data as LazyFrames
- Return type:
(polars.LazyFrame, polars.LazyFrame)
- _import_utils(name: str | pdstools.utils.types.any_frame, path: str | None = None, *, subset: bool = True, extract_keys: bool = False, drop_cols: list | None = None, include_cols: list | None = None, **reading_opts) Tuple[polars.LazyFrame, dict, dict] ¶
Handler function to interface to the cdh_utils methods
- Parameters:
name (Union[str, pl.DataFrame]) – One of {modelData, predictorData} or a dataframe
path (str, default = None) – The path of the data file
subset (bool)
extract_keys (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
- Keyword Arguments:
subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns
drop_cols (list) – Supply columns to drop from the dataframe
include_cols (list) – Supply columns to include with the dataframe
extract_keys (bool) – Treatments are typically hidden within the pyName column, extract_keys can expand that cell to also show these values.
arguments (Additional keyword)
----------------------------
- Return type:
Tuple[polars.LazyFrame, dict, dict]
:keyword See
pdstools.pega_io.File.readDSExport()
:- Returns:
The requested dataframe,
The renamed columns
The columns missing in both dataframes)
- Return type:
(pl.LazyFrame, dict, dict)
- Parameters:
name (Union[str, pdstools.utils.types.any_frame])
path (Optional[str])
subset (bool)
extract_keys (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
- _available_columns(df: polars.LazyFrame, include_cols: list | None = None, drop_cols: list | None = None) Tuple[set, set] ¶
Based on the default names for variables, rename available data to proper formatting
- Parameters:
df (pl.LazyFrame) – Input dataframe
include_cols (list) – Supply columns to include with the dataframe
drop_cols (list) – Supply columns to not import at all
- Returns:
The original dataframe, but renamed for the found columns & The original and updated names for all renamed columns & The variables that were not found in the table
- Return type:
Tuple[set, set]
- _set_types(df: pdstools.utils.types.any_frame, table: str = 'infer', *, timestamp_fmt: str = None, strict_conversion: bool = True) pdstools.utils.types.any_frame ¶
A method to change columns to their proper type
- Parameters:
df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe
table (str) – The table to set types for. Default is infer, in which case it infers the table type from the columns in it.
timestamp_fmt (str)
strict_conversion (bool)
- Keyword Arguments:
timestamp_fmt (str) – The format of Date type columns
strict_conversion (bool) – Raises an error if timestamp conversion to given/default date format(timestamp_fmt) fails See ‘https://strftime.org/’ for timestamp formats
- Returns:
The input dataframe, but the proper typing applied
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- last(table='modelData', strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame ¶
Convenience function to get the last values for a table
- Parameters:
table (str, default = modelData) – Which table to get the last values for One of {modelData, predictorData, combinedData}
strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
- Returns:
The last snapshot for each model
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- static _last(df: pdstools.utils.types.any_frame) pdstools.utils.types.any_frame ¶
- Parameters:
df (pdstools.utils.types.any_frame)
- Return type:
pdstools.utils.types.any_frame
- static _last_timestamp(col: Literal[ResponseCount, Positives]) polars.Expr ¶
Add a column to indicate the last timestamp a column has changed.
- Parameters:
col (Literal['ResponseCount', 'Positives']) – The column to calculate the diff for
- Return type:
polars.Expr
- _get_combined_data(last=True, strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame ¶
Combines the model data and predictor data into one dataframe.
- Parameters:
last (bool, default=True) – Whether to only use the last snapshot for each table
strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
- Returns:
The combined dataframe
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- processTables(query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) ADMDatamart ¶
Processes modelData, predictorData and combinedData tables.
Can take in a query, which it will apply to modelData If a query is given, it joins predictorData to only retain the modelIDs the modelData was filtered on. If both modelData and predictorData are present, it joins them together into combinedData.
If memory_strategy is eager, which is the default, this method also collects the tables and then sets them back to lazy.
- Parameters:
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]], default = None) – An optional query to apply to the modelData table. See:
_apply_query()
- Return type:
- save_data(path: str = '.') Tuple[os.PathLike, os.PathLike] ¶
Cache modelData and predictorData to files.
- Parameters:
path (str) – Where to place the files
- Returns:
The paths to the model and predictor data files
- Return type:
(os.PathLike, os.PathLike)
- _apply_query(df: pdstools.utils.types.any_frame, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) polars.LazyFrame ¶
Given an input Polars dataframe, it filters the dataframe based on input query
- Parameters:
df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – If a Polars Expression, passes the expression into Polars’ filter function. If a list of Polars Expressions, applies each of the expressions as filters. If a string, uses the Pandas query function (works only in eager mode, not recommended). Else, a dict of lists where the key is column name in the dataframe and the corresponding value is a list of values to keep in the dataframe
- Returns:
Filtered Polars DataFrame
- Return type:
pl.DataFrame
- discover_modelTypes(df: polars.LazyFrame, by: str = 'Configuration', allow_collect=False) Dict ¶
Discovers the type of model embedded in the pyModelData column.
By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict.
- Parameters:
df (pl.LazyFrame) – The dataframe to search for model types
by (str) – The column to look for types in. Configuration is recommended.
allow_collect (bool, default = False) – Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration.
- Return type:
Dict
- get_AGB_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, verbose: bool = True, **kwargs) Dict ¶
Method to automatically extract AGB models.
Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models.
- Parameters:
last (bool, default = False) – Whether to only look at the last snapshot for each model
by (str, default = 'Configuration') – Which column to determine unique models with
n_threads (int, default = 6) – The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import.
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to
_apply_query()
verbose (bool, default = False) – Whether to print out information while importing
- Return type:
Dict
- static _create_sign_df(df: polars.LazyFrame, by: str = 'Name', *, what: str = 'ResponseCount', every: str = '1d', pivot: bool = True, mask: bool = True) polars.LazyFrame ¶
Generates dataframe to show whether responses decreased/increased from day to day
For a given dataframe where columns are dates and rows are model names(by parameter), subtracts each day’s value from the previous day’s value per model. Then masks the data. If increased (desired situtation), it will put 1 in the cell, if no change, it will put 0, and if decreased it will put -1. This dataframe then could be used in the heatmap
- Parameters:
df (pd.DataFrame) – This is typically pivoted ModelData
by (str, default = Name) – Column to calculate the daily change for.
what (str)
every (str)
pivot (bool)
mask (bool)
- Keyword Arguments:
what (str, default = ResponseCount) – Column that contains response counts
every (str, default = 1d) – Interval of the change window
pivot (bool, default = True) – Returns a pivotted table with signs as value if set to true
mask (bool, default = True) – Drops SnapshotTime and returns direction of change(sign).
- Returns:
The dataframe with signs for increase or decrease in day to day
- Return type:
pd.LazyFrame
- model_summary(by: str = 'ModelID', query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, **kwargs) polars.LazyFrame ¶
Convenience method to automatically generate a summary over models
By default, it summarizes ResponseCount, Performance, SuccessRate & Positives by model ID. It also adds weighted means for Performance and SuccessRate, And adds the count of models without responses and the percentage.
- Parameters:
by (str, default = ModelID) – By what column to summarize the models
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to
_apply_query()
- Returns:
group_by dataframe over all models
- Return type:
pl.LazyFrame
- pivot_df(df: polars.LazyFrame, by: str | list = 'Name', *, allow_collect: bool = True, top_n: int = 0) polars.DataFrame ¶
Simple function to extract pivoted information
- Parameters:
df (pl.LazyFrame) – The input DataFrame.
by (Union[str, list], default = Name) – The column(s) to pivot the DataFrame by. If a list is provided, only the first element is used.
allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.
top_n (int, optional (default=0)) – The number of rows to include in the pivoted DataFrame. If set to 0, all rows are included.
- Returns:
The pivoted DataFrame.
- Return type:
pl.DataFrame
- static response_gain_df(df: pdstools.utils.types.any_frame, by: str = 'Channel') pdstools.utils.types.any_frame ¶
Simple function to extract the response gain per model
- Parameters:
df (pdstools.utils.types.any_frame)
by (str)
- Return type:
pdstools.utils.types.any_frame
- models_by_positives_df(df: polars.LazyFrame, by: str = 'Channel', allow_collect=True) polars.LazyFrame ¶
Compute statistics on the dataframe by grouping it by a given column by and computing the count of unique ModelIDs and cumulative percentage of unique models for with regard to the number of positive answers.
- Parameters:
df (pl.LazyFrame) – The input DataFrame
by (str, default = Channel) – The column name to group the DataFrame by, by default “Channel”
allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.
- Returns:
DataFrame with PositivesBin column and model count statistics
- Return type:
pl.LazyFrame
- get_model_stats(last: bool = True) dict ¶
Returns a dictionary containing various statistics for the model data.
- Parameters:
last (bool) – Whether to compute statistics only on the last snapshot. Defaults to True.
- Returns:
A dictionary containing the following keys: ‘models_n_snapshots’: The number of distinct snapshot times in the data. ‘models_total’: The total number of models in the data. ‘models_empty’: The models with no responses. ‘models_nopositives’: The models with responses but no positive responses. ‘models_isimmature’: The models with less than 200 positive responses. ‘models_noperformance’: The models with at least 200 positive responses but a performance of 50. ‘models_n_nonperforming’: The total number of models that are not performing well. ‘models_missing_{key}’: The number of models with missing values for each context key. ‘models_bottom_left’: The models with a performance of 50 and a success rate of 0.
- Return type:
Dict
- describe_models(**kwargs) NoReturn ¶
Convenience method to quickly summarize the models
- Return type:
NoReturn
- applyGlobalQuery(query: polars.Expr | List[polars.Expr] | str | Dict[str, list]) ADMDatamart ¶
Convenience method to further query the datamart
It’s possible to give this query to the initial ADMDatamart class directly, but this method is more explicit. Filters on the model data (query is put in a
polars.filter()
method), filters the predictorData on the ModelIDs remaining after the query, and recomputes combinedData.Only works with Polars expressions.
Paramters¶
- query: Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]
The query to apply, see
_apply_query()
- Parameters:
query (Union[polars.Expr, List[polars.Expr], str, Dict[str, list]])
- Return type:
- fillMissing() ADMDatamart ¶
Convenience method to fill missing values
Fills categorical, string and null type columns with “NA”
Fills SuccessRate, Performance and ResponseCount columns with 0
When context keys have empty string values, replaces them
with “NA” string
- Return type:
- summary_by_channel(custom_channels: Dict[str, str] = None, keep_lists: bool = False)¶
- Parameters:
custom_channels (Dict[str, str])
keep_lists (bool)
- overall_summary(custom_channels: Dict[str, str] = None)¶
- Parameters:
custom_channels (Dict[str, str])
- generateReport(name: str | None = None, working_dir: pathlib.Path = Path('.'), *, modelid: str | None = '', delete_temp_files: bool = True, output_type: str = 'html', allow_collect: bool = True, cached_data: bool = False, predictordetails_activeonly: bool = False, **kwargs)¶
Generates a report based on the provided parameters. If modelid is provided, a model report will be generated. If not, an overall HealthCheck report will be generated.
- Parameters:
name (Optional[str], default = None) – The name of the report.
working_dir (Path, default = Path(".")) – The working directory. Cached files will be written here.
*
modelid (Optional[str])
delete_temp_files (bool)
output_type (str)
allow_collect (bool)
cached_data (bool)
predictordetails_activeonly (bool)
- Keyword Arguments:
modelid (Optional[str], default = "") – The model id,
delete_temp_files (bool, default = True) – Whether to delete temporary files.
output_type (str, default = "html") – The type of the output file.
allow_collect (bool, default = True) – Whether to allow collection of data.
cached_data (bool, default = False) – Whether to use cached data.
del_cache (bool, default = True) – Whether to delete cache.
predictordetails_activeonly (bool, default = False) – Whether to only include active predictor details.
**kwargs – Additional keyword arguments.
- exportTables(file: pathlib.Path = 'Tables.xlsx', predictorBinning=False)¶
Exports all tables from pdstools.adm.Tables into one Excel file.
- Parameters:
file (Path, default = 'Tables.xlsx') – The file name of the exported Excel file
predictorBinning (bool, default = True) – If False, the ‘predictorbinning’ table will not be created