pdstools ======== .. py:module:: pdstools .. autoapi-nested-parse:: Pega Data Scientist Tools Python library Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/pdstools/adm/index /autoapi/pdstools/app/index /autoapi/pdstools/cli/index /autoapi/pdstools/decision_analyzer/index /autoapi/pdstools/ih/index /autoapi/pdstools/infinity/index /autoapi/pdstools/pega_io/index /autoapi/pdstools/prediction/index /autoapi/pdstools/reports/index /autoapi/pdstools/utils/index /autoapi/pdstools/valuefinder/index Classes ------- .. autoapisummary:: pdstools.ADMDatamart pdstools.Prediction pdstools.ValueFinder Functions --------- .. autoapisummary:: pdstools.read_ds_export pdstools.default_predictor_categorization pdstools.cdh_sample pdstools.sample_value_finder pdstools.show_versions Package Contents ---------------- .. py:class:: ADMDatamart(model_df: Optional[polars.LazyFrame] = None, predictor_df: Optional[polars.LazyFrame] = None, *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True) Monitor and analyze ADM data from the Pega Datamart. To initialize this class, either 1. Initialize directly with the model_df and predictor_df polars LazyFrames 2. Use one of the class methods: `from_ds_export`, `from_s3` or `from_dataflow_export` This class will read in the data from different sources, properly structure them from further analysis, and apply correct typing and useful renaming. There is also a few "namespaces" that you can call from this class: - `.plot` contains ready-made plots to analyze the data with - `.aggregates` contains mostly internal data aggregations queries - `.agb` contains analysis utilities for Adaptive Gradient Boosting models - `.generate` leads to some ready-made reports, such as the Health Check - `.bin_aggregator` allows you to compare the bins across various models :param model_df: The Polars LazyFrame representation of the model snapshot table. :type model_df: pl.LazyFrame, optional :param predictor_df: The Polars LazyFrame represenation of the predictor binning table. :type predictor_df: pl.LazyFrame, optional :param query: An optional query to apply to the input data. For details, see :meth:`pdstools.utils.cdh_utils._apply_query`. :type query: QUERY, optional :param extract_pyname_keys: Whether to extract extra keys from the `pyName` column. In older Pega versions, this contained pyTreatment among other (customizable) fields. By default True :type extract_pyname_keys: bool, default = True .. rubric:: Examples >>> from pdstools import ADMDatamart >>> from glob import glob >>> dm = ADMDatamart( model_df = pl.scan_parquet('models.parquet'), predictor_df = pl.scan_parquet('predictors.parquet') query = {"Configuration":["Web_Click_Through"]} ) >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder') >>> dm = ADMDatamart.from_s3("pega_export") >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*")) .. note:: This class depends on two datasets: - `pyModelSnapshots` corresponds to the `model_data` attribute - `pyADMPredictorSnapshots` corresponds to the `predictor_data` attribute For instructions on how to download these datasets, please refer to the following article: https://docs.pega.com/bundle/platform/page/platform/decision-management/exporting-monitoring-database.html .. seealso:: :obj:`pdstools.adm.Plots` The out of the box plots on the Datamart data :obj:`pdstools.adm.Reports` Methods to generate the Health Check and Model Report :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods .. py:attribute:: model_data :type: Optional[polars.LazyFrame] .. py:attribute:: predictor_data :type: Optional[polars.LazyFrame] .. py:attribute:: combined_data :type: Optional[polars.LazyFrame] .. py:attribute:: plot :type: pdstools.adm.Plots.Plots .. py:attribute:: aggregates :type: pdstools.adm.Aggregates.Aggregates .. py:attribute:: agb :type: pdstools.adm.ADMTrees.AGB .. py:attribute:: generate :type: pdstools.adm.Reports.Reports .. py:attribute:: cdh_guidelines :type: pdstools.adm.CDH_Guidelines.CDHGuidelines .. py:attribute:: bin_aggregator :type: pdstools.adm.BinAggregator.BinAggregator .. py:attribute:: context_keys :type: List[str] :value: ['Channel', 'Direction', 'Issue', 'Group', 'Name'] .. py:method:: from_ds_export(model_filename: Optional[str] = None, predictor_filename: Optional[str] = None, base_path: Union[os.PathLike, str] = '.', *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True) :classmethod: Import the ADMDatamart class from a Pega Dataset Export :param model_filename: The full path or name (if base_path is given) to the model snapshot files, by default None :type model_filename: Optional[str], optional :param predictor_filename: The full path or name (if base_path is given) to the predictor binning snapshot files, by default None :type predictor_filename: Optional[str], optional :param base_path: A base path to provide so that we can automatically find the most recent files for both the model and predictor snapshots, if model_filename and predictor_filename are not given as full paths, by default "." :type base_path: Union[os.PathLike, str], optional :param query: An optional argument to filter out selected data, by default None :type query: Optional[QUERY], optional :param extract_pyname_keys: Whether to extract additional keys from the `pyName` column, by default True :type extract_pyname_keys: bool, optional :returns: The properly initialized ADMDatamart class :rtype: ADMDatamart .. rubric:: Examples >>> from pdstools import ADMDatamart >>> # To automatically find the most recent files in the 'my_export_folder' dir: >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder') >>> # To specify individual files: >>> dm = ADMDatamart.from_ds_export( model_df='/Downloads/model_snapshots.parquet', predictor_df = '/Downloads/predictor_snapshots.parquet' ) .. note:: By default, the dataset export in Infinity returns a zip file per table. You do not need to open up this zip file! You can simply point to the zip, and this method will be able to read in the underlying data. .. seealso:: :obj:`pdstools.pega_io.File.read_ds_export` More information on file compatibility :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods .. py:method:: from_s3() :classmethod: Not implemented yet. Please let us know if you would like this functionality! .. py:method:: from_dataflow_export(model_data_files: Union[Iterable[str], str], predictor_data_files: Union[Iterable[str], str], *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: Union[os.PathLike, str] = 'cache') :classmethod: Read in data generated by a data flow, such as the Prediction Studio export. Dataflows are able to export data from and to various sources. As they are meant to be used in production, they are highly resiliant. For every partition and every node, a dataflow will output a small json file every few seconds. While this is great for production loads, it can be a bit more tricky to read in the data for smaller-scale and ad-hoc analyses. This method aims to make the ingestion of such highly partitioned data easier. It reads in every individual small json file that the dataflow has output, and caches them to a parquet file in the `cache_directory` folder. As such, if you re-run this method later with more data added since the last export, we will not read in from the (slow) dataflow files, but rather from the (much faster) cache. :param model_data_files: A list of files to read in as the model snapshots :type model_data_files: Union[Iterable[str], str] :param predictor_data_files: A list of files to read in as the predictor snapshots :type predictor_data_files: Union[Iterable[str], str] :param query: A, by default None :type query: Optional[QUERY], optional :param extract_pyname_keys: Whether to extract extra keys from the pyName column, by default True :type extract_pyname_keys: bool, optional :param cache_file_prefix: An optional prefix for the cache files, by default "" :type cache_file_prefix: str, optional :param extension: The extension of the source data, by default "json" :type extension: Literal["json"], optional :param compression: The compression of the source files, by default "gzip" :type compression: Literal["gzip"], optional :param cache_directory: Where to store the cached files, by default "cache" :type cache_directory: Union[os.PathLike, str], optional :returns: An initialized instance of the datamart class :rtype: ADMDatamart .. rubric:: Examples >>> from pdstools import ADMDatamart >>> import glob >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*")) .. seealso:: :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods :obj:`glob` Makes creating lists of files much easier .. py:method:: _validate_model_data(df: Optional[polars.LazyFrame], query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True) -> Optional[polars.LazyFrame] Internal method to validate model data .. py:method:: _validate_predictor_data(df: Optional[polars.LazyFrame]) -> Optional[polars.LazyFrame] Internal method to validate predictor data .. py:method:: apply_predictor_categorization(df: Optional[polars.LazyFrame] = None, categorization: Union[polars.Expr, Callable[Ellipsis, polars.Expr]] = cdh_utils.default_predictor_categorization) Apply a new predictor categorization to the datamart tables In certain plots, we use the predictor categorization to indicate what 'kind' a certain predictor is, such as IH, Customer, etc. Call this method with a custom Polars Expression (or a method that returns one) - and it will be applied to the predictor data (and the combined dataset too). For a reference implementation of a custom predictor categorization, refer to `pdstools.utils.cdh_utils.default_predictor_categorization`. :param df: A Polars Lazyframe to apply the categorization to. If not provided, applies it over the predictor data and combined datasets. By default, None :type df: Optional[pl.LazyFrame], optional :param categorization: A polars Expression (or method that returns one) to apply the mapping with. Should be based on Polars' when.then.otherwise syntax. By default, `pdstools.utils.cdh_utils.default_predictor_categorization` :type categorization: Union[pl.Expr, Callable[..., pl.Expr]] .. seealso:: :obj:`pdstools.utils.cdh_utils.default_predictor_categorization` The default method .. rubric:: Examples >>> dm = ADMDatamart(my_data) #uses the OOTB predictor categorization >>> dm.apply_predictor_categorization(categorization=pl.when( >>> pl.col("PredictorName").cast(pl.Utf8).str.contains("Propensity") >>> ).then(pl.lit("External Model") >>> ).otherwise(pl.lit("Adaptive Model)") >>> # Now, every subsequent plot will use the custom categorization .. py:method:: save_data(path: Union[os.PathLike, str] = '.', selected_model_ids: Optional[List[str]] = None) -> Tuple[Optional[pathlib.Path], Optional[pathlib.Path]] Caches model_data and predictor_data to files. :param path: Where to place the files :type path: str :param selected_model_ids: Optional list of model IDs to restrict to :type selected_model_ids: List[str] :returns: The paths to the model and predictor data files :rtype: (Optional[Path], Optional[Path]) .. py:property:: unique_channels A consistently ordered set of unique channels in the data Used for making the color schemes in different plots consistent .. py:property:: unique_configurations A consistently ordered set of unique configurations in the data Used for making the color schemes in different plots consistent .. py:property:: unique_channel_direction A consistently ordered set of unique channel+direction combos in the data Used for making the color schemes in different plots consistent .. py:property:: unique_configuration_channel_direction A consistently ordered set of unique configuration+channel+direction Used for making the color schemes in different plots consistent .. py:property:: unique_predictor_categories A consistently ordered set of unique predictor categories in the data Used for making the color schemes in different plots consistent .. py:function:: read_ds_export(filename: Union[str, io.BytesIO], path: Union[str, os.PathLike] = '.', verbose: bool = False, **reading_opts) -> Optional[polars.LazyFrame] Read in most out of the box Pega dataset export formats Accepts one of the following formats: - .csv - .json - .zip (zipped json or CSV) - .feather - .ipc - .parquet It automatically infers the default file names for both model data as well as predictor data. If you supply either 'modelData' or 'predictorData' as the 'file' argument, it will search for them. If you supply the full name of the file in the 'path' directory, it will import that instead. Since pdstools V3.x, returns a Polars LazyFrame. Simply call `.collect()` to get an eager frame. :param filename: Can be one of the following: - A string with the full path to the file - A string with the name of the file (to be searched in the given path) - A BytesIO object containing the file data (e.g., from an uploaded file in a webapp) :type filename: Union[str, BytesIO] :param path: The location of the file :type path: str, default = '.' :param verbose: Whether to print out which file will be imported :type verbose: bool, default = True :keyword Any: Any arguments to plug into the scan_* function from Polars. :returns: * *pl.LazyFrame* -- The (lazy) dataframe * *Examples* -- >>> df = read_ds_export(filename='full/path/to/ModelSnapshot.json') >>> df = read_ds_export(filename='ModelSnapshot.json', path='data/ADMData') >>> df = read_ds_export(filename=uploaded_file) # Where uploaded_file is a BytesIO object .. py:class:: Prediction(df: polars.LazyFrame) Monitor Pega Prediction Studio Predictions .. py:attribute:: predictions :type: polars.LazyFrame .. py:attribute:: plot :type: PredictionPlots .. py:attribute:: prediction_validity_expr .. py:attribute:: cdh_guidelines .. py:method:: from_mock_data(days=70) :staticmethod: .. py:property:: is_available :type: bool .. py:property:: is_valid :type: bool .. py:method:: summary_by_channel(custom_predictions: Optional[List[List]] = None, by_period: str = None) -> polars.LazyFrame Summarize prediction per channel :param custom_predictions: Optional list with custom prediction name to channel mappings. Defaults to None. :type custom_predictions: Optional[List[CDH_Guidelines.NBAD_Prediction]], optional :param by_period: Optional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. If provided, creates a new Period column with the truncated date/time. Defaults to None. :type by_period: str, optional :returns: Dataframe with prediction summary (validity, numbers in test, control etc.) :rtype: pl.LazyFrame .. py:method:: overall_summary(custom_predictions: Optional[List[List]] = None, by_period: str = None) -> polars.LazyFrame Overall prediction summary. Only valid prediction data is included. :param custom_predictions: Optional list with custom prediction name to channel mappings. Defaults to None. :type custom_predictions: Optional[List[CDH_Guidelines.NBAD_Prediction]], optional :param by_period: Optional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. If provided, creates a new Period column with the truncated date/time. Defaults to None. :type by_period: str, optional :returns: Summary across all valid predictions as a dataframe :rtype: pl.LazyFrame .. py:function:: default_predictor_categorization(x: Union[str, polars.Expr] = pl.col('PredictorName')) -> polars.Expr Function to determine the 'category' of a predictor. It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain. By default, this function returns "Primary" whenever there is no '.' anywhere in the name string, otherwise returns the first string before the first period :param x: The column to parse :type x: Union[str, pl.Expr], default = pl.col('PredictorName') .. py:function:: cdh_sample(query: Optional[pdstools.utils.types.QUERY] = None) -> pdstools.adm.ADMDatamart.ADMDatamart Import a sample dataset from the CDH Sample application :param query: An optional query to apply to the data, by default None :type query: Optional[QUERY], optional :returns: The ADM Datamart class populated with CDH Sample data :rtype: ADMDatamart .. py:function:: sample_value_finder(threshold: Optional[float] = None) -> pdstools.valuefinder.ValueFinder.ValueFinder Import a sample dataset of a Value Finder simulation This simulation was ran on a stock CDH Sample system. :param threshold: Optional override of the propensity threshold in the system, by default None :type threshold: Optional[float], optional :returns: The Value Finder class populated with the Value Finder simulation data :rtype: ValueFinder .. py:function:: show_versions(print_output: Literal[True] = True) -> None show_versions(print_output: Literal[False] = False) -> str Get a list of currently installed versions of pdstools and its dependencies. :param print_output: If True, print the version information to stdout. If False, return the version information as a string. Default is True. :type print_output: bool, optional :returns: Version information as a string if print_output is False, else None. :rtype: Optional[str] .. rubric:: Examples >>> from pdstools import show_versions >>> show_versions() --- Version info --- pdstools: 4.0.0-alpha Platform: macOS-14.7-arm64-arm-64bit Python: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] --- Dependencies --- typing_extensions: 4.12.2 polars>=1.9: 1.9.0 --- Dependency group: adm --- plotly>=5.5.0: 5.24.1 --- Dependency group: api --- pydantic: 2.9.2 httpx: 0.27.2 .. py:class:: ValueFinder(df: polars.LazyFrame, *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None) Analyze the Value Finder dataset for detailed insights .. py:attribute:: df :type: polars.LazyFrame .. py:attribute:: n_customers :type: int .. py:attribute:: nbad_stages :value: ['Eligibility', 'Applicability', 'Suitability', 'Arbitration'] .. py:attribute:: aggregates .. py:attribute:: plot .. py:method:: from_ds_export(filename: Optional[str] = None, base_path: Union[os.PathLike, str] = '.', *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None) :classmethod: .. py:method:: from_dataflow_export(files: Union[Iterable[str], str], *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: Union[os.PathLike, str] = 'cache') :classmethod: .. py:method:: set_threshold(new_threshold: Optional[float] = None) .. py:property:: threshold .. py:method:: save_data(path: Union[os.PathLike, str] = '.') -> Optional[pathlib.Path] Cache the pyValueFinder dataset to a Parquet file :param path: Where to place the file :type path: str :returns: The paths to the model and predictor data files :rtype: (Optional[Path], Optional[Path])