pdstools
¶
Python pdstools
Subpackages¶
pdstools.adm
pdstools.app
pdstools.ih
pdstools.pega_io
pdstools.plots
pdstools.prediction
pdstools.reports
pdstools.utils
pdstools.utils.CDHLimits
pdstools.utils.NBAD
pdstools.utils.cdh_utils
pdstools.utils.datasets
pdstools.utils.errors
pdstools.utils.hds_utils
pdstools.utils.hds_utils_experimental
pdstools.utils.pega_template
pdstools.utils.polars_ext
pdstools.utils.show_versions
pdstools.utils.streamlit_utils
pdstools.utils.table_definitions
pdstools.utils.types
pdstools.valuefinder
Package Contents¶
Classes¶
Main class for importing, preprocessing and structuring Pega ADM Datamart. |
|
A class to generate rolled up insights from ADM predictor binning. |
|
A singleton container for best practice limits for CDH. |
|
Configuration file for the data anonymizer. |
|
Anonymize a historical dataset. |
|
Class to analyze Value Finder datasets. |
|
Functions¶
|
Get API credentials to a Pega Platform instance. |
|
Read a Pega dataset export file. |
|
Convenience function to automagically setup Azure AD-based authentication |
|
Function to determine the 'category' of a predictor. |
|
Print out version of pdstools and dependencies to stdout. |
|
|
|
Attributes¶
- __version__ = '3.4.3'¶
- class ADMDatamart(path: str | pathlib.Path = Path('.'), import_strategy: Literal[eager, lazy] = 'eager', *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, context_keys: list = ['Channel', 'Direction', 'Issue', 'Group'], extract_keys: bool = False, predictorCategorization: polars.Expr = cdh_utils.defaultPredictorCategorization, plotting_engine: str | Any = 'plotly', verbose: bool = False, **reading_opts)¶
Bases:
pdstools.plots.plot_base.Plots
,pdstools.adm.Tables.Tables
Main class for importing, preprocessing and structuring Pega ADM Datamart. Gets all available data, properly names and merges into one main dataframe.
It’s also possible to import directly from S3. Please refer to
pdstools.pega_io.S3.S3Data.get_ADMDatamart()
.- Parameters:
path (str, default = ".") – The path of the data files
import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
model_filename (Optional[str])
predictor_filename (Optional[str])
model_df (Optional[pdstools.utils.types.any_frame])
predictor_df (Optional[pdstools.utils.types.any_frame])
query (Optional[Union[polars.Expr, List[polars.Expr], str, Dict[str, list]]])
subset (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
context_keys (list)
extract_keys (bool)
predictorCategorization (polars.Expr)
plotting_engine (Union[str, Any])
verbose (bool)
- Keyword Arguments:
model_filename (Optional[str]) – The name, or extended filepath, towards the model file
predictor_filename (Optional[str]) – The name, or extended filepath, towards the predictors file
model_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file
predictor_df (Union[pl.DataFrame, pl.LazyFrame, pd.DataFrame]) – Optional override to supply a dataframe instead of a file
query (Union[pl.Expr, str, Dict[str, list]], default = None) – Please refer to
_apply_query()
plotting_engine (str, default = "plotly") – Please refer to
get_engine()
subset (bool, default = True) – Whether to only keep a subset of columns for efficiency purposes Refer to
_available_columns()
for the default list of columns.drop_cols (Optional[list]) – Columns to exclude from reading
include_cols (Optional[list]) – Additionial columns to include when reading
context_keys (list, default = ["Channel", "Direction", "Issue", "Group"]) – Which columns to use as context keys
extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.
verbose (bool, default = False) – Whether to print out information during importing
**reading_opts – Additional parameters used while reading. Refer to
pdstools.pega_io.File.import_file()
for more info.
- modelData¶
If available, holds the preprocessed data about the models
- Type:
pl.LazyFrame
- predictorData¶
If available, holds the preprocessed data about the predictor binning
- Type:
pl.LazyFrame
- combinedData¶
If both modelData and predictorData are available, holds the merged data about the models and predictors
- Type:
pl.LazyFrame
- import_strategy¶
See the import_strategy parameter
- query¶
See the query parameter
- context_keys¶
See the context_keys parameter
- verbose¶
See the verbose parameter
Examples
>>> Data = ADMDatamart("/CDHSample") >>> Data = ADMDatamart("Data/Adaptive Models & Predictors Export", model_filename = "Data-Decision-ADM-ModelSnapshot_AdaptiveModelSnapshotRepo20201110T085543_GMT/data.json", predictor_filename = "Data-Decision-ADM-PredictorBinningSnapshot_PredictorBinningSnapshotRepo20201110T084825_GMT/data.json") >>> Data = ADMDatamart("Data/files", model_filename = "ModelData.csv", predictor_filename = "PredictorData.csv")
- property is_available: bool¶
- Return type:
bool
- standardChannelGroups = ['Web', 'Mobile', 'E-mail', 'Push', 'SMS', 'Retail', 'Call Center', 'IVR']¶
- standardDirections = ['Inbound', 'Outbound']¶
- NBAD_model_configurations¶
- static get_engine(plotting_engine)¶
Which engine to use for creating the plots.
By supplying a custom class here, you can re-use the pdstools functions but create visualisations to your own specifications, in any library.
- import_data(path: str | pathlib.Path | None = Path('.'), *, model_filename: str | None = 'modelData', predictor_filename: str | None = 'predictorData', model_df: pdstools.utils.types.any_frame | None = None, predictor_df: pdstools.utils.types.any_frame | None = None, subset: bool = True, drop_cols: list | None = None, include_cols: list | None = None, extract_keys: bool = False, verbose: bool = False, **reading_opts) Tuple[polars.LazyFrame | None, polars.LazyFrame | None] ¶
Method to import & format the relevant data.
The method first imports the model data, and then the predictor data. If model_df or predictor_df is supplied, it will use those instead If any filters are included in the the query argument of the ADMDatmart, those will be applied to the modeldata, and the predictordata will be filtered such that it only contains the modelids leftover after filtering. After reading, some additional values (such as success rate) are automatically computed. Lastly, if there are missing columns from both datasets, this will be printed to the user if verbose is True.
- Parameters:
path (Path) – The path of the data files Default = current path (‘.’)
subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns
model_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file
predictor_df (pd.DataFrame) – Optional override to supply a dataframe instead of a file
drop_cols (Optional[list]) – Columns to exclude from reading
include_cols (Optional[list]) – Additionial columns to include when reading
extract_keys (bool, default = False) – Extra keys, particularly pyTreatment, are hidden within the pyName column. extract_keys can expand that cell to also show these values. To extract these extra keys, set extract_keys to True.
verbose (bool, default = False) – Whether to print out information during importing
model_filename (Optional[str])
predictor_filename (Optional[str])
- Returns:
The model data and predictor binning data as LazyFrames
- Return type:
(polars.LazyFrame, polars.LazyFrame)
- _import_utils(name: str | pdstools.utils.types.any_frame, path: str | None = None, *, subset: bool = True, extract_keys: bool = False, drop_cols: list | None = None, include_cols: list | None = None, **reading_opts) Tuple[polars.LazyFrame, dict, dict] ¶
Handler function to interface to the cdh_utils methods
- Parameters:
name (Union[str, pl.DataFrame]) – One of {modelData, predictorData} or a dataframe
path (str, default = None) – The path of the data file
subset (bool)
extract_keys (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
- Keyword Arguments:
subset (bool, default = True) – Whether to only select the renamed columns, set to False to keep all columns
drop_cols (list) – Supply columns to drop from the dataframe
include_cols (list) – Supply columns to include with the dataframe
extract_keys (bool) – Treatments are typically hidden within the pyName column, extract_keys can expand that cell to also show these values.
arguments (Additional keyword)
----------------------------
- Return type:
Tuple[polars.LazyFrame, dict, dict]
:keyword See
pdstools.pega_io.File.readDSExport()
:- Returns:
The requested dataframe,
The renamed columns
The columns missing in both dataframes)
- Return type:
(pl.LazyFrame, dict, dict)
- Parameters:
name (Union[str, pdstools.utils.types.any_frame])
path (Optional[str])
subset (bool)
extract_keys (bool)
drop_cols (Optional[list])
include_cols (Optional[list])
- _available_columns(df: polars.LazyFrame, include_cols: list | None = None, drop_cols: list | None = None) Tuple[set, set] ¶
Based on the default names for variables, rename available data to proper formatting
- Parameters:
df (pl.LazyFrame) – Input dataframe
include_cols (list) – Supply columns to include with the dataframe
drop_cols (list) – Supply columns to not import at all
- Returns:
The original dataframe, but renamed for the found columns & The original and updated names for all renamed columns & The variables that were not found in the table
- Return type:
Tuple[set, set]
- _set_types(df: pdstools.utils.types.any_frame, table: str = 'infer', *, timestamp_fmt: str = None, strict_conversion: bool = True) pdstools.utils.types.any_frame ¶
A method to change columns to their proper type
- Parameters:
df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe
table (str) – The table to set types for. Default is infer, in which case it infers the table type from the columns in it.
timestamp_fmt (str)
strict_conversion (bool)
- Keyword Arguments:
timestamp_fmt (str) – The format of Date type columns
strict_conversion (bool) – Raises an error if timestamp conversion to given/default date format(timestamp_fmt) fails See ‘https://strftime.org/’ for timestamp formats
- Returns:
The input dataframe, but the proper typing applied
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- last(table='modelData', strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame ¶
Convenience function to get the last values for a table
- Parameters:
table (str, default = modelData) – Which table to get the last values for One of {modelData, predictorData, combinedData}
strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
- Returns:
The last snapshot for each model
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- static _last(df: pdstools.utils.types.any_frame) pdstools.utils.types.any_frame ¶
- Parameters:
df (pdstools.utils.types.any_frame)
- Return type:
pdstools.utils.types.any_frame
- static _last_timestamp(col: Literal[ResponseCount, Positives]) polars.Expr ¶
Add a column to indicate the last timestamp a column has changed.
- Parameters:
col (Literal['ResponseCount', 'Positives']) – The column to calculate the diff for
- Return type:
polars.Expr
- _get_combined_data(last=True, strategy: Literal[eager, lazy] = 'eager') pdstools.utils.types.any_frame ¶
Combines the model data and predictor data into one dataframe.
- Parameters:
last (bool, default=True) – Whether to only use the last snapshot for each table
strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
- Returns:
The combined dataframe
- Return type:
Union[pl.DataFrame, pl.LazyFrame]
- processTables(query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) ADMDatamart ¶
Processes modelData, predictorData and combinedData tables.
Can take in a query, which it will apply to modelData If a query is given, it joins predictorData to only retain the modelIDs the modelData was filtered on. If both modelData and predictorData are present, it joins them together into combinedData.
If memory_strategy is eager, which is the default, this method also collects the tables and then sets them back to lazy.
- Parameters:
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]], default = None) – An optional query to apply to the modelData table. See:
_apply_query()
- Return type:
- save_data(path: str = '.') Tuple[os.PathLike, os.PathLike] ¶
Cache modelData and predictorData to files.
- Parameters:
path (str) – Where to place the files
- Returns:
The paths to the model and predictor data files
- Return type:
(os.PathLike, os.PathLike)
- _apply_query(df: pdstools.utils.types.any_frame, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None) polars.LazyFrame ¶
Given an input Polars dataframe, it filters the dataframe based on input query
- Parameters:
df (Union[pl.DataFrame, pl.LazyFrame]) – The input dataframe
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – If a Polars Expression, passes the expression into Polars’ filter function. If a list of Polars Expressions, applies each of the expressions as filters. If a string, uses the Pandas query function (works only in eager mode, not recommended). Else, a dict of lists where the key is column name in the dataframe and the corresponding value is a list of values to keep in the dataframe
- Returns:
Filtered Polars DataFrame
- Return type:
pl.DataFrame
- discover_modelTypes(df: polars.LazyFrame, by: str = 'Configuration', allow_collect=False) Dict ¶
Discovers the type of model embedded in the pyModelData column.
By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict.
- Parameters:
df (pl.LazyFrame) – The dataframe to search for model types
by (str) – The column to look for types in. Configuration is recommended.
allow_collect (bool, default = False) – Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration.
- Return type:
Dict
- get_AGB_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, verbose: bool = True, **kwargs) Dict ¶
Method to automatically extract AGB models.
Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models.
- Parameters:
last (bool, default = False) – Whether to only look at the last snapshot for each model
by (str, default = 'Configuration') – Which column to determine unique models with
n_threads (int, default = 6) – The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import.
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to
_apply_query()
verbose (bool, default = False) – Whether to print out information while importing
- Return type:
Dict
- static _create_sign_df(df: polars.LazyFrame, by: str = 'Name', *, what: str = 'ResponseCount', every: str = '1d', pivot: bool = True, mask: bool = True) polars.LazyFrame ¶
Generates dataframe to show whether responses decreased/increased from day to day
For a given dataframe where columns are dates and rows are model names(by parameter), subtracts each day’s value from the previous day’s value per model. Then masks the data. If increased (desired situtation), it will put 1 in the cell, if no change, it will put 0, and if decreased it will put -1. This dataframe then could be used in the heatmap
- Parameters:
df (pd.DataFrame) – This is typically pivoted ModelData
by (str, default = Name) – Column to calculate the daily change for.
what (str)
every (str)
pivot (bool)
mask (bool)
- Keyword Arguments:
what (str, default = ResponseCount) – Column that contains response counts
every (str, default = 1d) – Interval of the change window
pivot (bool, default = True) – Returns a pivotted table with signs as value if set to true
mask (bool, default = True) – Drops SnapshotTime and returns direction of change(sign).
- Returns:
The dataframe with signs for increase or decrease in day to day
- Return type:
pd.LazyFrame
- model_summary(by: str = 'ModelID', query: polars.Expr | List[polars.Expr] | str | Dict[str, list] | None = None, **kwargs) polars.LazyFrame ¶
Convenience method to automatically generate a summary over models
By default, it summarizes ResponseCount, Performance, SuccessRate & Positives by model ID. It also adds weighted means for Performance and SuccessRate, And adds the count of models without responses and the percentage.
- Parameters:
by (str, default = ModelID) – By what column to summarize the models
query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to
_apply_query()
- Returns:
group_by dataframe over all models
- Return type:
pl.LazyFrame
- pivot_df(df: polars.LazyFrame, by: str | list = 'Name', *, allow_collect: bool = True, top_n: int = 0) polars.DataFrame ¶
Simple function to extract pivoted information
- Parameters:
df (pl.LazyFrame) – The input DataFrame.
by (Union[str, list], default = Name) – The column(s) to pivot the DataFrame by. If a list is provided, only the first element is used.
allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.
top_n (int, optional (default=0)) – The number of rows to include in the pivoted DataFrame. If set to 0, all rows are included.
- Returns:
The pivoted DataFrame.
- Return type:
pl.DataFrame
- static response_gain_df(df: pdstools.utils.types.any_frame, by: str = 'Channel') pdstools.utils.types.any_frame ¶
Simple function to extract the response gain per model
- Parameters:
df (pdstools.utils.types.any_frame)
by (str)
- Return type:
pdstools.utils.types.any_frame
- models_by_positives_df(df: polars.LazyFrame, by: str = 'Channel', allow_collect=True) polars.LazyFrame ¶
Compute statistics on the dataframe by grouping it by a given column by and computing the count of unique ModelIDs and cumulative percentage of unique models for with regard to the number of positive answers.
- Parameters:
df (pl.LazyFrame) – The input DataFrame
by (str, default = Channel) – The column name to group the DataFrame by, by default “Channel”
allow_collect (bool, default = True) – Whether to allow eager computation. If set to False and the import strategy is “lazy”, an error will be raised.
- Returns:
DataFrame with PositivesBin column and model count statistics
- Return type:
pl.LazyFrame
- get_model_stats(last: bool = True) dict ¶
Returns a dictionary containing various statistics for the model data.
- Parameters:
last (bool) – Whether to compute statistics only on the last snapshot. Defaults to True.
- Returns:
A dictionary containing the following keys: ‘models_n_snapshots’: The number of distinct snapshot times in the data. ‘models_total’: The total number of models in the data. ‘models_empty’: The models with no responses. ‘models_nopositives’: The models with responses but no positive responses. ‘models_isimmature’: The models with less than 200 positive responses. ‘models_noperformance’: The models with at least 200 positive responses but a performance of 50. ‘models_n_nonperforming’: The total number of models that are not performing well. ‘models_missing_{key}’: The number of models with missing values for each context key. ‘models_bottom_left’: The models with a performance of 50 and a success rate of 0.
- Return type:
Dict
- describe_models(**kwargs) NoReturn ¶
Convenience method to quickly summarize the models
- Return type:
NoReturn
- applyGlobalQuery(query: polars.Expr | List[polars.Expr] | str | Dict[str, list]) ADMDatamart ¶
Convenience method to further query the datamart
It’s possible to give this query to the initial ADMDatamart class directly, but this method is more explicit. Filters on the model data (query is put in a
polars.filter()
method), filters the predictorData on the ModelIDs remaining after the query, and recomputes combinedData.Only works with Polars expressions.
Paramters¶
- query: Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]
The query to apply, see
_apply_query()
- Parameters:
query (Union[polars.Expr, List[polars.Expr], str, Dict[str, list]])
- Return type:
- fillMissing() ADMDatamart ¶
Convenience method to fill missing values
Fills categorical, string and null type columns with “NA”
Fills SuccessRate, Performance and ResponseCount columns with 0
When context keys have empty string values, replaces them
with “NA” string
- Return type:
- summary_by_channel(custom_channels: Dict[str, str] = None, keep_lists: bool = False)¶
- Parameters:
custom_channels (Dict[str, str])
keep_lists (bool)
- overall_summary(custom_channels: Dict[str, str] = None)¶
- Parameters:
custom_channels (Dict[str, str])
- generateReport(name: str | None = None, working_dir: pathlib.Path = Path('.'), *, modelid: str | None = '', delete_temp_files: bool = True, output_type: str = 'html', allow_collect: bool = True, cached_data: bool = False, predictordetails_activeonly: bool = False, **kwargs)¶
Generates a report based on the provided parameters. If modelid is provided, a model report will be generated. If not, an overall HealthCheck report will be generated.
- Parameters:
name (Optional[str], default = None) – The name of the report.
working_dir (Path, default = Path(".")) – The working directory. Cached files will be written here.
*
modelid (Optional[str])
delete_temp_files (bool)
output_type (str)
allow_collect (bool)
cached_data (bool)
predictordetails_activeonly (bool)
- Keyword Arguments:
modelid (Optional[str], default = "") – The model id,
delete_temp_files (bool, default = True) – Whether to delete temporary files.
output_type (str, default = "html") – The type of the output file.
allow_collect (bool, default = True) – Whether to allow collection of data.
cached_data (bool, default = False) – Whether to use cached data.
del_cache (bool, default = True) – Whether to delete cache.
predictordetails_activeonly (bool, default = False) – Whether to only include active predictor details.
**kwargs – Additional keyword arguments.
- exportTables(file: pathlib.Path = 'Tables.xlsx', predictorBinning=False)¶
Exports all tables from pdstools.adm.Tables into one Excel file.
- Parameters:
file (Path, default = 'Tables.xlsx') – The file name of the exported Excel file
predictorBinning (bool, default = True) – If False, the ‘predictorbinning’ table will not be created
- class ADMTrees¶
- static getMultiTrees(file: polars.DataFrame, n_threads=1, verbose=True, **kwargs)¶
- Parameters:
file (polars.DataFrame)
- class MultiTrees¶
- property first¶
- property last¶
- trees: dict¶
- model_name: str¶
- context_keys: list¶
- __repr__()¶
Return repr(self).
- __getitem__(index)¶
- __len__()¶
- __add__(other)¶
- computeOverTime(predictorCategorization=None)¶
- plotSplitsPerVariableType(predictorCategorization=None, **kwargs)¶
- class BinAggregator(dm: pdstools.ADMDatamart, query: polars.Expr = None)¶
A class to generate rolled up insights from ADM predictor binning.
- Parameters:
dm (pdstools.ADMDatamart)
query (polars.Expr)
- roll_up(predictors: str | list, n: int = 10, distribution: Literal[lin, log] = 'lin', boundaries: float | list | None = None, symbols: str | list | None = None, minimum: float | None = None, maximum: float | None = None, aggregation: str | None = None, as_numeric: bool | None = None, return_df: bool = False, verbose: bool = False) polars.DataFrame | plotly.graph_objects.Figure ¶
Roll up a predictor across all the models defined when creating the class.
Predictors can be both numeric and symbolic (also called ‘categorical’). You can aggregate the same predictor across different sets of models by specifying a column name in the aggregation argument.
- Parameters:
predictors (str | list) – Name of the predictor to roll up. Multiple predictors can be passed in as a list.
n (int, optional) – Number of bins (intervals or symbols) to generate, by default 10. Any custom intervals or symbols specified with the ‘musthave’ argument will count towards this number as well. For symbolic predictors can be None, which means unlimited.
distribution (str, optional) – For numeric predictors: the way the intervals are constructed. By default “lin” for an evenly-spaced distribution, can be set to “log” for a long tailed distribution (for fields like income).
boundaries (float | list, optional) – For numeric predictors: one value, or a list of the numeric values to include as interval boundaries. They will be used at the front of the automatically created intervals. By default None, all intervals are created automatically.
symbols (str | list, optional) – For symbolic predictors, any symbol(s) that must be included in the symbol list in the generated binning. By default None.
minimum (float, optional) – Minimum value for numeric predictors, by default None. When None the minimum is taken from the binning data of the models.
maximum (float, optional) – Maximum value for numeric predictors, by default None. When None the maximum is taken from the binning data of the models.
aggregation (str, optional) – Optional column name in the data to aggregate over, creating separate aggregations for each of the different values. By default None.
as_numeric (bool, optional) – Optional override for the type of the predictor, so to be able to override in the (exceptional) situation that a predictor with the same name is numeric in some and symbolic in some other models. By default None which means the type is taken from the first predictor in the data.
return_df (bool, optional) – Return the underlying binning instead of a plot.
verbose (bool, optional) – Show detailed debug information while executing, by default False
- Returns:
By default returns a nicely formatted plot. When ‘return_df’ is set to True, it returns the actual binning with the lift aggregated over all the models, optionally per predictor and per set of models.
- Return type:
pl.DataFrame | Figure
- accumulate_num_binnings(predictor, modelids, target_binning, verbose=False) polars.DataFrame ¶
- Return type:
polars.DataFrame
- create_symbol_list(predictor, n_symbols, musthave_symbols) list ¶
- Return type:
list
- accumulate_sym_binnings(predictor, modelids, symbollist, verbose=False) polars.DataFrame ¶
- Return type:
polars.DataFrame
- normalize_all_binnings(combined_dm: polars.LazyFrame) polars.LazyFrame ¶
Prepare all predictor binning
Fix up the boundaries for numeric bins and parse the bin labels into clean lists for symbolics.
- Parameters:
combined_dm (polars.LazyFrame)
- Return type:
polars.LazyFrame
- create_empty_numbinning(predictor: str, n: int, distribution: str = 'lin', boundaries: list | None = None, minimum: float | None = None, maximum: float | None = None) polars.DataFrame ¶
- Parameters:
predictor (str)
n (int)
distribution (str)
boundaries (Optional[list])
minimum (Optional[float])
maximum (Optional[float])
- Return type:
polars.DataFrame
- get_source_numbinning(predictor: str, modelid: str) polars.DataFrame ¶
- Parameters:
predictor (str)
modelid (str)
- Return type:
polars.DataFrame
- combine_two_numbinnings(source: polars.DataFrame, target: polars.DataFrame, verbose=False) polars.DataFrame ¶
- Parameters:
source (polars.DataFrame)
target (polars.DataFrame)
- Return type:
polars.DataFrame
- plot_binning_attribution(source: polars.DataFrame, target: polars.DataFrame) plotly.graph_objects.Figure ¶
- Parameters:
source (polars.DataFrame)
target (polars.DataFrame)
- Return type:
plotly.graph_objects.Figure
- plotBinningLift(binning, col_facet=None, row_facet=None, custom_data=['PredictorName', 'BinSymbol'], return_df=False) polars.DataFrame | plotly.graph_objects.Figure ¶
- Return type:
Union[polars.DataFrame, plotly.graph_objects.Figure]
- plot_lift_binning(binning: polars.DataFrame) plotly.graph_objects.Figure ¶
- Parameters:
binning (polars.DataFrame)
- Return type:
plotly.graph_objects.Figure
- get_token(credentialFile: str, verify: bool = True, **kwargs)¶
Get API credentials to a Pega Platform instance.
After setting up OAuth2 authentication in Dev Studio, you should be able to download a credential file. Simply point this method to that file, and it’ll read the relevant properties and give you your access token.
- Parameters:
credentialFile (str) – The credential file downloaded after setting up OAuth in a Pega system
verify (bool, default = True) – Whether to only allow safe SSL requests. In case you’re connecting to an unsecured API endpoint, you need to explicitly set verify to False, otherwise Python will yell at you.
- Keyword Arguments:
url (str) – An optional override of the URL to connect to. This is also extracted out of the credential file, but you may want to customize this (to a different port, etc).
- readDSExport(filename: pandas.DataFrame | polars.DataFrame | str, path: str = '.', verbose: bool = True, **reading_opts) polars.LazyFrame ¶
Read a Pega dataset export file. Can accept either a Pandas DataFrame or one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet
It automatically infers the default file names for both model data as well as predictor data. If you supply either ‘modelData’ or ‘predictorData’ as the ‘file’ argument, it will search for them. If you supply the full name of the file in the ‘path’ directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call .collect() to get an eager frame.
- Parameters:
filename ([pd.DataFrame, pl.DataFrame, str]) – Either a Pandas/Polars DataFrame with the source data (for compatibility), or a string, in which case it can either be: - The name of the file (if a custom name) or - Whether we want to look for ‘modelData’ or ‘predictorData’ in the path folder.
path (str, default = '.') – The location of the file
verbose (bool, default = True) – Whether to print out which file will be imported
- Keyword Arguments:
Any – Any arguments to plug into the scan_* function from Polars.
- Returns:
pl.LazyFrame – The (lazy) dataframe
Examples – >>> df = readDSExport(filename = ‘modelData’, path = ‘./datamart’) >>> df = readDSExport(filename = ‘ModelSnapshot.json’, path = ‘data/ADMData’)
>>> df = pd.read_csv('file.csv') >>> df = readDSExport(filename = df)
- Return type:
polars.LazyFrame
- setupAzureOpenAI(api_base: str = 'https://aze-openai-01.openai.azure.com/', api_version: Literal[2022-12-01, 2023-03-15-preview, 2023-05-15, 2023-06-01-preview, 2023-07-01-preview, 2023-09-15-preview, 2023-10-01-preview, 2023-12-01-preview] = '2023-12-01-preview')¶
Convenience function to automagically setup Azure AD-based authentication for the Azure OpenAI service. Mostly meant as an internal tool within Pega, but can of course also be used beyond.
Prerequisites (you should only need to do this once!): - Download Azure CLI (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli) - Once installed, run ‘az login’ in your terminal - Additional dependencies: (pip install) azure-identity and (pip install) openai
Running this function automatically sets, among others: - openai.api_key - os.environ[“OPENAI_API_KEY”]
This should ensure that you don’t need to pass tokens and/or api_keys around. The key that’s set has a lifetime, typically of one hour. Therefore, if you get an error message like ‘invalid token’, you may need to run this method again to refresh the token for another hour.
- Parameters:
api_base (str) – The url of the Azure service name you’d like to connect to If you have access to the Azure OpenAI playground (https://oai.azure.com/portal), you can easily find this url by clicking ‘view code’ in one of the playgrounds. If you have access to the Azure portal directly (https://portal.azure.com), this will be found under ‘endpoint’. Else, ask your system administrator for the correct url.
api_version (str) – The version of the api to use
Usage
-----
setupAzureOpenAI (>>> from pdstools import)
setupAzureOpenAI() (>>>)
- class CDHLimits¶
Bases:
object
A singleton container for best practice limits for CDH.
- LimitStatus¶
- Metrics¶
- _instance¶
- num_limit¶
- lims¶
- get_limits(metric: Metrics) num_limit | None ¶
- Parameters:
metric (Metrics)
- Return type:
Union[num_limit, None]
- check_limits(metric: Metrics, value: object) LimitStatus ¶
- Parameters:
metric (Metrics)
value (object)
- Return type:
LimitStatus
- defaultPredictorCategorization(x: str | polars.Expr = pl.col('PredictorName')) polars.Expr ¶
Function to determine the ‘category’ of a predictor.
It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain.
By default, this function returns “Primary” whenever there is no ‘.’ anywhere in the name string, otherwise returns the first string before the first period
- Parameters:
x (Union[str, pl.Expr], default = pl.col('PredictorName')) – The column to parse
- Return type:
polars.Expr
- show_versions() None ¶
Print out version of pdstools and dependencies to stdout.
Examples
>>> from pdstools import show_versions >>> show_versions() ---Version info--- pdstools: 3.1.0 Platform: macOS-12.6.4-x86_64-i386-64bit Python: 3.11.0 (v3.11.0:deaf509e8f, Oct 24 2022, 14:43:23) [Clang 13.0.0 (clang-1300.0.29.30)] ---Dependencies--- plotly: 5.13.1 requests: 2.28.1 pydot: 1.4.2 polars: 0.17.0 pyarrow: 11.0.0.dev52 tqdm: 4.64.1 pyyaml: <not installed> aioboto3: 11.0.1 ---Streamlit app dependencies--- streamlit: 1.20.0 quarto: 0.1.0 papermill: 2.4.0 itables: 1.5.1 pandas: 1.5.3 jinja2: 3.1.2 xlsxwriter: 3.0
- Return type:
None
- CDHSample(plotting_engine='plotly', query=None, **kwargs)¶
- SampleTrees()¶
- SampleValueFinder(verbose=True)¶
- class Config(config_file: str | None = None, hds_folder: pathlib.Path = '.', use_datamart: bool = False, datamart_folder: pathlib.Path = 'datamart', output_format: Literal[ndjson, parquet, arrow, csv] = 'ndjson', output_folder: pathlib.Path = 'output', mapping_file: str = 'mapping.map', mask_predictor_names: bool = True, mask_context_key_names: bool = False, mask_ih_names: bool = True, mask_outcome_name: bool = False, mask_predictor_values: bool = True, mask_context_key_values: bool = True, mask_ih_values: bool = True, mask_outcome_values: bool = True, context_key_label: str = 'Context_*', ih_label: str = 'IH_*', outcome_column: str = 'Decision_Outcome', positive_outcomes: list = ['Accepted', 'Clicked'], negative_outcomes: list = ['Rejected', 'Impression'], special_predictors: list = ['Decision_DecisionTime', 'Decision_OutcomeTime', 'Decision_Rank'], sample_percentage_schema_inferencing: float = 0.01)¶
Configuration file for the data anonymizer.
- Parameters:
config_file (str = None) – An optional path to a config file
hds_folder (Path = ".") – The path to the hds files
use_datamart (bool = False) – Whether to use the datamart to infer predictor types
datamart_folder (Path = "datamart") – The folder of the datamart files
output_format (Literal["ndjson", "parquet", "arrow", "csv"] = "ndjson") – The output format to write the files in
output_folder (Path = "output") – The location to write the files to
mapping_file (str = "mapping.map") – The name of the predictor mapping file
mask_predictor_names (bool = True) – Whether to mask the names of regular predictors
mask_context_key_names (bool = True) – Whether to mask the names of context key predictors
mask_ih_names (bool = True) – Whether to mask the name of Interaction History summary predictors
mask_outcome_name (bool = True) – Whether to mask the name of the outcome column
mask_predictor_values (bool = True) – Whether to mask the values of regular predictors
mask_context_key_values (bool = True) – Whether to mask the values of context key predictors
mask_ih_values (bool = True) – Whether to mask the values of Interaction History summary predictors
mask_outcome_values (bool = True) – Whether to mask the values of the outcomes to binary
context_key_label (str = "Context_*") – The pattern of names for context key predictors
ih_label (str = "IH_*") – The pattern of names for Interaction History summary predictors
outcome_column (str = "Decision_Outcome") – The name of the outcome column
positive_outcomes (list = ["Accepted", "Clicked"]) – Which positive outcomes to map to True
negative_outcomes (list = ["Rejected", "Impression"]) – Which negative outcomes to map to False
special_predictors (list = ["Decision_DecisionTime", "Decision_OutcomeTime"]) – A list of special predictors which are not touched
sample_percentage_schema_inferencing (float) – The percentage of records to sample to infer the column type. In case you’re getting casting errors, it may be useful to increase this percentage to check a larger portion of data.
- load_from_config_file(config_file: pathlib.Path)¶
Load the configurations from a file.
- Parameters:
config_file (Path) – The path to the configuration file
- save_to_config_file(file_name: str = None)¶
Save the configurations to a file.
- Parameters:
file_name (str) – The name of the configuration file
- validate_paths()¶
Validate the outcome folder exists.
- class DataAnonymization(config: Config | None = None, df: polars.LazyFrame | None = None, datamart: pdstools.ADMDatamart | None = None, **config_args)¶
Anonymize a historical dataset.
- Parameters:
config (Optional[Config]) – Override the default configurations with the Config class
df (Optional[polars.LazyFrame]) – Manually supply a Polars lazyframe to anonymize
datamart (Optional[pdstools.ADMDatamart]) – Manually supply a Datamart file to infer predictor types
- Keyword Arguments:
**config_args – See
Config
Example
- write_to_output(df: polars.DataFrame | None = None, ext: Literal[ndjson, parquet, arrow, csv] = None, mode: Literal[optimized, robust] = 'optimized')¶
Write the processed dataframe to an output file.
- Parameters:
df (Optional[pl.DataFrame]) – Dataframe to write. If not provided, runs self.process()
ext (Literal["ndjson", "parquet", "arrow", "csv"]) – What extension to write the file to
mode (Literal['optimized', 'robust'], default = 'optimized') – Whether to output a single file (optimized) or maintain the same file structure as the original files (robust). Optimized should be faster, but robust should allow for bigger data as we don’t need all data in memory at the same time.
- create_mapping_file()¶
Create a file to write the column mapping
- load_hds_files()¶
Load the historical dataset files from the config.hds_folder location.
- read_predictor_type_from_file(df: polars.LazyFrame)¶
Infer the types of the preditors from the data.
This is non-trivial, as it’s not ideal to pull in all data to memory for this. For this reason, we sample 1% of data, or all data if less than 50 rows, and try to cast it to numeric. If that fails, we set it to categorical, else we set it to numeric.
It is technically supported to manually override this, by just overriding the symbolic_predictors_to_mask & numeric_predictors_to_mask properties.
- Parameters:
df (pl.LazyFrame) – The lazyframe to infer the types with
- static read_predictor_type_from_datamart(datamart_folder: pathlib.Path, datamart: pdstools.ADMDatamart = None)¶
The datamart contains type information about each predictor. This function extracts that information to infer types for the HDS.
- Parameters:
datamart_folder (Path) – The path to the datamart files
datamart (ADMDatamart) – The direct ADMDatamart object
- get_columns_by_type()¶
Get a list of columns for each type.
- get_predictors_mapping()¶
Map the predictor names to their anonymized form.
- getHasher(cols, algorithm='xxhash', seed='random', seed_1=None, seed_2=None, seed_3=None)¶
- process(strategy='eager', **kwargs)¶
Anonymize the dataset.
- class PegaDefaultTables¶
- class ADMModelSnapshot¶
- pxApplication¶
- pyAppliesToClass¶
- pyModelID¶
- pyConfigurationName¶
- pySnapshotTime¶
- pyIssue¶
- pyGroup¶
- pyName¶
- pyChannel¶
- pyDirection¶
- pyTreatment¶
- pyPerformance¶
- pySuccessRate¶
- pyResponseCount¶
- pxObjClass¶
- pzInsKey¶
- pxInsName¶
- pxSaveDateTime¶
- pxCommitDateTime¶
- pyExtension¶
- pyActivePredictors¶
- pyTotalPredictors¶
- pyNegatives¶
- pyPositives¶
- pyRelativeNegatives¶
- pyRelativePositives¶
- pyRelativeResponseCount¶
- pyMemory¶
- pyPerformanceThreshold¶
- pyCorrelationThreshold¶
- pyPerformanceError¶
- pyModelData¶
- pyModelVersion¶
- pyFactoryUpdatetime¶
- class ADMPredictorBinningSnapshot¶
- pxCommitDateTime¶
- pxSaveDateTime¶
- pyModelID¶
- pxObjClass¶
- pzInsKey¶
- pxInsName¶
- pyPredictorName¶
- pyContents¶
- pyPerformance¶
- pyPositives¶
- pyNegatives¶
- pyType¶
- pyTotalBins¶
- pyResponseCount¶
- pyRelativePositives¶
- pyRelativeNegatives¶
- pyRelativeResponseCount¶
- pyBinNegatives¶
- pyBinPositives¶
- pyBinType¶
- pyBinNegativesPercentage¶
- pyBinPositivesPercentage¶
- pyBinSymbol¶
- pyBinLowerBound¶
- pyBinUpperBound¶
- pyRelativeBinPositives¶
- pyRelativeBinNegatives¶
- pyBinResponseCount¶
- pyRelativeBinResponseCount¶
- pyBinResponseCountPercentage¶
- pySnapshotTime¶
- pyBinIndex¶
- pyLift¶
- pyZRatio¶
- pyEntryType¶
- pyExtension¶
- pyGroupIndex¶
- pyCorrelationPredictor¶
- class ValueFinder(path: str | None = None, df: pandas.DataFrame | polars.DataFrame | polars.LazyFrame | None = None, verbose: bool = True, import_strategy: Literal[eager, lazy] = 'eager', ncust: int = None, **kwargs)¶
Class to analyze Value Finder datasets.
Relies heavily on polars for faster reading and transformations. See https://pola-rs.github.io/polars/py-polars/html/index.html
Requires either df or a path to be supplied, If a path is supplied, the ‘filename’ argument is optional. If path is given and no filename is, it will look for the most recent.
- Parameters:
path (Optional[str]) – Path to the ValueFinder data files
df (Optional[DataFrame]) – Override to supply a dataframe instead of a file. Supports pandas or polars dataframes
import_strategy (Literal['eager', 'lazy'], default = 'eager') – Whether to import the file fully to memory, or scan the file When data fits into memory, ‘eager’ is typically more efficient However, when data does not fit, the lazy methods typically allow you to still use the data.
verbose (bool) – Whether to print out information during importing
ncust (int)
- Keyword Arguments:
th (float) – An optional keyword argument to override the propensity threshold
filename (Optional[str]) – The name, or extended filepath, towards the file
subset (bool) – Whether to select only a subset of columns. Will speed up analysis and reduce unused information
- save_data(path: str = '.') os.PathLike ¶
Cache the ValueFinder dataset to a file
- Parameters:
path (str) – Where to place the file
- Returns:
The paths to the file
- Return type:
PathLike
- getCustomerSummary(th: float | None = None) polars.DataFrame ¶
Computes the summary of propensities for all customers
- Parameters:
th (Optional[float]) – The threshold to consider an action ‘good’. If a customer has actions with propensity above this, the customer has at least one relevant action. If not given, will default to 5th quantile.
- Return type:
polars.DataFrame
- getCountsPerStage(customersummary: polars.DataFrame | None = None) polars.DataFrame ¶
Generates an aggregated view per stage.
- Parameters:
customersummary (Optional[pl.DataFrame]) – Optional override of the customer summary, which can be generated by getCustomerSummary().
- Return type:
polars.DataFrame
- getThFromQuantile(quantile: float) float ¶
Return the propensity threshold corresponding to a given quantile
If the threshold is already in self._thMap, simply gets it from there Otherwise, computes the threshold and then adds it to the map.
- Parameters:
quantile (float) – The quantile to get the threshold for
- Return type:
float
- getCountsPerThreshold(th, return_df=False) polars.LazyFrame | None ¶
- Return type:
Optional[polars.LazyFrame]
- addCountsForThresholdRange(start, stop, step, method=Literal['threshold, quantile']) None ¶
Adds the counts per stage for a range of quantiles or thresholds.
Once computed, the values are added to .countsPerThreshold so we only need to compute each value once.
- Parameters:
start (float) – The starting of the range
stop (float) – The end of the range
step (float) – The steps to compute between start and stop
method (Literal["threshold", "quantile"]:) – Whether to get a range of thresholds directly or compute the thresholds from their quantiles
- Return type:
None
- plotPropensityDistribution(sampledN: int = 10000) plotly.graph_objects.Figure ¶
Plots the distribution of the different propensities.
For optimization reasons (storage for all points in a boxplot and time complexity for computing the distribution plot), we have to sample to a reasonable amount of data points.
- Parameters:
sampledN (int, default = 10_000) – The number of datapoints to sample
- Return type:
plotly.graph_objects.Figure
- plotPropensityThreshold(sampledN=10000, stage='Eligibility') plotly.graph_objects.Figure ¶
Plots the propensity threshold vs the different propensities.
- Parameters:
sampledN (int, default = 10_000) – The number of datapoints to sample
- Return type:
plotly.graph_objects.Figure
- plotPieCharts(start: float = None, stop: float = None, step: float = None, *, method: Literal[ValueFinder.plotPieCharts.threshold, quantile] = 'threshold', rounding: int = 3, th: float | None = None) plotly.graph_objects.FigureWidget ¶
Plots pie charts showing the distribution of customers
The pie charts each represent the fraction of customers with the color indicating whether they have sufficient relevant actions in that stage of the NBAD arbitration.
If no values are provided for start, stop or step, the pie charts are shown using the default propensity threshold, as part of the Value Finder class.
- Parameters:
start (float) – The starting of the range
stop (float) – The end of the range
step (float) – The steps to compute between start and stop
method (Literal[ValueFinder.plotPieCharts.threshold, quantile])
rounding (int)
th (Optional[float])
- Keyword Arguments:
method (Literal['threshold', 'quantile'], default='threshold') – Whether the range is computed based on the threshold directly or based on the quantile of the propensity
rounding (int) – The number of digits to round the values by
th (Optional[float]) – Choose a specific propensity threshold to plot
- Return type:
plotly.graph_objects.FigureWidget
- plotDistributionPerThreshold(start: float = None, stop: float = None, step: float = None, *, method: Literal[threshold, ValueFinder.plotDistributionPerThreshold.quantile] = 'threshold', rounding=3) plotly.graph_objects.FigureWidget ¶
Plots the distribution of customers per threshold, per stage.
Based on the precomputed data in self.countsPerThreshold, this function will plot the distribution per stage.
To add more data points between a given range, simply pass all three arguments to this function: start, stop and step.
- Parameters:
start (float) – The starting of the range
stop (float) – The end of the range
step (float) – The steps to compute between start and stop
method (Literal[threshold, ValueFinder.plotDistributionPerThreshold.quantile])
- Keyword Arguments:
method (Literal['threshold', 'quantile'], default='threshold') – Whether the range is computed based on the threshold directly or based on the quantile of the propensity
rounding (int) – The number of digits to round the values by
- Return type:
plotly.graph_objects.FigureWidget
- plotFunnelChart(level: str = 'Action', query=None, return_df=False, **kwargs)¶
Plots the funnel of actions or issues per stage.
- Parameters:
level (str, default = 'Actions') – Which element to plot: - If ‘Actions’, plots the distribution of actions. - If ‘Issues’, plots the distribution of issues
- class Sample(ldf: polars.LazyFrame)¶
- Parameters:
ldf (polars.LazyFrame)
- sample(n)¶
- height()¶
- shape()¶
- item()¶
- __reports__¶