pdstools
========

.. py:module:: pdstools

.. autoapi-nested-parse::

   Pega Data Scientist Tools Python library


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/pdstools/adm/index
   /autoapi/pdstools/app/index
   /autoapi/pdstools/cli/index
   /autoapi/pdstools/decision_analyzer/index
   /autoapi/pdstools/ih/index
   /autoapi/pdstools/impactanalyzer/index
   /autoapi/pdstools/infinity/index
   /autoapi/pdstools/pega_io/index
   /autoapi/pdstools/prediction/index
   /autoapi/pdstools/reports/index
   /autoapi/pdstools/utils/index
   /autoapi/pdstools/valuefinder/index


Classes
-------

.. autoapisummary::

   pdstools.ADMDatamart
   pdstools.IH
   pdstools.ImpactAnalyzer
   pdstools.Prediction
   pdstools.ValueFinder


Functions
---------

.. autoapisummary::

   pdstools.read_ds_export
   pdstools.default_predictor_categorization
   pdstools.cdh_sample
   pdstools.sample_value_finder
   pdstools.show_versions


Package Contents
----------------

.. py:class:: ADMDatamart(model_df: Optional[polars.LazyFrame] = None, predictor_df: Optional[polars.LazyFrame] = None, *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True)

   Monitor and analyze ADM data from the Pega Datamart.

   To initialize this class, either
   1. Initialize directly with the model_df and predictor_df polars LazyFrames
   2. Use one of the class methods: `from_ds_export`, `from_s3`, `from_dataflow_export` etc.

   This class will read in the data from different sources, properly structure them
   from further analysis, and apply correct typing and useful renaming.

   There is also a few "namespaces" that you can call from this class:

   - `.plot` contains ready-made plots to analyze the data with
   - `.aggregates` contains mostly internal data aggregations queries
   - `.agb` contains analysis utilities for Adaptive Gradient Boosting models
   - `.generate` leads to some ready-made reports, such as the Health Check
   - `.bin_aggregator` allows you to compare the bins across various models

   :param model_df: The Polars LazyFrame representation of the model snapshot table.
   :type model_df: pl.LazyFrame, optional
   :param predictor_df: The Polars LazyFrame represenation of the predictor binning table.
   :type predictor_df: pl.LazyFrame, optional
   :param query: An optional query to apply to the input data.
                 For details, see :meth:`pdstools.utils.cdh_utils._apply_query`.
   :type query: QUERY, optional
   :param extract_pyname_keys: Whether to extract extra keys from the `pyName` column.
                               In older Pega versions, this contained pyTreatment among other
                               (customizable) fields. By default True
   :type extract_pyname_keys: bool, default = True

   .. rubric:: Examples

   >>> from pdstools import ADMDatamart
   >>> from glob import glob
   >>> dm = ADMDatamart(
            model_df = pl.scan_parquet('models.parquet'),
            predictor_df = pl.scan_parquet('predictors.parquet')
            query = {"Configuration":["Web_Click_Through"]}
            )
   >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder')
   >>> dm = ADMDatamart.from_s3("pega_export")
   >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*"))

   .. note::

      This class depends on two datasets:
      
      - `pyModelSnapshots` corresponds to the `model_data` attribute
      - `pyADMPredictorSnapshots` corresponds to the `predictor_data` attribute
      
      For instructions on how to download these datasets, please refer to the following
      article: https://docs.pega.com/bundle/platform/page/platform/decision-management/exporting-monitoring-database.html

   .. seealso::

      :obj:`pdstools.adm.Plots`
          The out of the box plots on the Datamart data
      
      :obj:`pdstools.adm.Reports`
          Methods to generate the Health Check and Model Report
      
      :obj:`pdstools.utils.cdh_utils._apply_query`
          How to query the ADMDatamart class and methods


   .. py:attribute:: model_data
      :type:  Optional[polars.LazyFrame]


   .. py:attribute:: predictor_data
      :type:  Optional[polars.LazyFrame]


   .. py:attribute:: combined_data
      :type:  Optional[polars.LazyFrame]


   .. py:attribute:: plot
      :type:  pdstools.adm.Plots.Plots


   .. py:attribute:: aggregates
      :type:  pdstools.adm.Aggregates.Aggregates


   .. py:attribute:: agb
      :type:  pdstools.adm.ADMTrees.AGB


   .. py:attribute:: generate
      :type:  pdstools.adm.Reports.Reports


   .. py:attribute:: cdh_guidelines
      :type:  pdstools.adm.CDH_Guidelines.CDHGuidelines


   .. py:attribute:: bin_aggregator
      :type:  pdstools.adm.BinAggregator.BinAggregator


   .. py:attribute:: first_action_dates
      :type:  Optional[polars.LazyFrame]


   .. py:attribute:: context_keys
      :type:  List[str]
      :value: ['Channel', 'Direction', 'Issue', 'Group', 'Name']


   .. py:method:: _get_first_action_dates(df: Optional[polars.LazyFrame]) -> polars.LazyFrame


   .. py:method:: from_ds_export(model_filename: Optional[str] = None, predictor_filename: Optional[str] = None, base_path: Union[os.PathLike, str] = '.', *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True)
      :classmethod:


      Import the ADMDatamart class from a Pega Dataset Export

      :param model_filename: The full path or name (if base_path is given) to the model snapshot files,
                             by default None
      :type model_filename: Optional[str], optional
      :param predictor_filename: The full path or name (if base_path is given) to the predictor binning
                                 snapshot files, by default None
      :type predictor_filename: Optional[str], optional
      :param base_path: A base path to provide so that we can automatically find the most recent
                        files for both the model and predictor snapshots, if model_filename and
                        predictor_filename are not given as full paths, by default "."
      :type base_path: Union[os.PathLike, str], optional
      :param query: An optional argument to filter out selected data, by default None
      :type query: Optional[QUERY], optional
      :param extract_pyname_keys: Whether to extract additional keys from the `pyName` column, by default True
      :type extract_pyname_keys: bool, optional

      :returns: The properly initialized ADMDatamart class
      :rtype: ADMDatamart

      .. rubric:: Examples

      >>> from pdstools import ADMDatamart

      >>> # To automatically find the most recent files in the 'my_export_folder' dir:
      >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder')

      >>> # To specify individual files:
      >>> dm = ADMDatamart.from_ds_export(
              model_df='/Downloads/model_snapshots.parquet',
              predictor_df = '/Downloads/predictor_snapshots.parquet'
              )

      .. note::

         By default, the dataset export in Infinity returns a zip file per table.
         You do not need to open up this zip file! You can simply point to the zip,
         and this method will be able to read in the underlying data.

      .. seealso::

         :obj:`pdstools.pega_io.File.read_ds_export`
             More information on file compatibility
         
         :obj:`pdstools.utils.cdh_utils._apply_query`
             How to query the ADMDatamart class and methods


   .. py:method:: from_s3()
      :classmethod:


      Not implemented yet. Please let us know if you would like this functionality!


   .. py:method:: from_dataflow_export(model_data_files: Union[Iterable[str], str], predictor_data_files: Union[Iterable[str], str], *, query: Optional[pdstools.utils.types.QUERY] = None, extract_pyname_keys: bool = True, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: Union[os.PathLike, str] = 'cache')
      :classmethod:


      Read in data generated by a data flow, such as the Prediction Studio export.

      Dataflows are able to export data from and to various sources.
      As they are meant to be used in production, they are highly resiliant.
      For every partition and every node, a dataflow will output a small json file
      every few seconds. While this is great for production loads, it can be a bit
      more tricky to read in the data for smaller-scale and ad-hoc analyses.

      This method aims to make the ingestion of such highly partitioned data easier.
      It reads in every individual small json file that the dataflow has output,
      and caches them to a parquet file in the `cache_directory` folder.
      As such, if you re-run this method later with more data added since the last
      export, we will not read in from the (slow) dataflow files, but rather from the
      (much faster) cache.

      :param model_data_files: A list of files to read in as the model snapshots
      :type model_data_files: Union[Iterable[str], str]
      :param predictor_data_files: A list of files to read in as the predictor snapshots
      :type predictor_data_files: Union[Iterable[str], str]
      :param query: A, by default None
      :type query: Optional[QUERY], optional
      :param extract_pyname_keys: Whether to extract extra keys from the pyName column, by default True
      :type extract_pyname_keys: bool, optional
      :param cache_file_prefix: An optional prefix for the cache files, by default ""
      :type cache_file_prefix: str, optional
      :param extension: The extension of the source data, by default "json"
      :type extension: Literal[&quot;json&quot;], optional
      :param compression: The compression of the source files, by default "gzip"
      :type compression: Literal[&quot;gzip&quot;], optional
      :param cache_directory: Where to store the cached files, by default "cache"
      :type cache_directory: Union[os.PathLike, str], optional

      :returns: An initialized instance of the datamart class
      :rtype: ADMDatamart

      .. rubric:: Examples

      >>> from pdstools import ADMDatamart
      >>> import glob
      >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*"))

      .. seealso::

         :obj:`pdstools.utils.cdh_utils._apply_query`
             How to query the ADMDatamart class and methods
         
         :obj:`glob`
             Makes creating lists of files much easier


   .. py:method:: from_pdc(df: polars.LazyFrame, return_df=False)
      :classmethod:


   .. py:method:: _validate_model_data(df: Optional[polars.LazyFrame], extract_pyname_keys: bool = True) -> Optional[polars.LazyFrame]

      Internal method to validate model data


   .. py:method:: _validate_predictor_data(df: Optional[polars.LazyFrame]) -> Optional[polars.LazyFrame]

      Internal method to validate predictor data


   .. py:method:: apply_predictor_categorization(categorization: Union[polars.Expr, Callable[Ellipsis, polars.Expr], Dict[str, Union[str, List[str]]]] = cdh_utils.default_predictor_categorization, *, use_regexp: bool = False, df: Optional[polars.LazyFrame] = None)

      Apply a new predictor categorization to the datamart tables

      In certain plots, we use the predictor categorization to indicate what 'kind'
      a certain predictor is, such as IH, Customer, etc. Call this method with a
      custom Polars Expression (or a method that returns one) or a simple mapping
      and it will be applied to the predictor data (and the combined dataset too).
      When the categorization provides no match, the existing categories are kept
      as they are.

      For a reference implementation of a custom predictor categorization,
      refer to `pdstools.utils.cdh_utils.default_predictor_categorization`.

      :param categorization: A Polars Expression (or method that returns one) that returns the
                             predictor categories. Should be based on Polars' when.then.otherwise syntax.
                             Alternatively can be a dictionary of categories to (list of) string matches
                             which can be either exact (the default) or regular expressions.
                             By default, `pdstools.utils.cdh_utils.default_predictor_categorization` is used.
      :type categorization: Union[pl.Expr, Callable[..., pl.Expr], Dict[str, Union[str, List[str]]]]
      :param use_regexp: Treat the mapping patterns in the `categorization` dictionary as regular expressions
                         rather than plain strings. When treated as regular expressions, they will be
                         interpreted in non-strict mode, so invalid expressions will return in no match. See
                         https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.contains.html
                         for exact behavior of the regular expressions.
                         By default, False
      :type use_regexp: bool, optional
      :param df: A Polars Lazyframe to apply the categorization to.
                 If not provided, applies it over the predictor data and combined datasets.
                 By default, None
      :type df: Optional[pl.LazyFrame], optional

      .. seealso::

         :obj:`pdstools.utils.cdh_utils.default_predictor_categorization`
             The default method

      .. rubric:: Examples

      >>> dm = ADMDatamart(my_data) #uses the OOTB predictor categorization

      >>> # Uses a custom Polars expression to set the categories
      >>> dm.apply_predictor_categorization(categorization=pl.when(
      >>> pl.col("PredictorName").cast(pl.Utf8).str.contains("Propensity")
      >>> ).then(pl.lit("External Model")
      >>> )

      >>> # Uses a simple dictionary to set the categories
      >>> dm.apply_predictor_categorization(categorization={
      >>> "External Model" : ["Score", "Propensity"]}
      >>> )


   .. py:method:: save_data(path: Union[os.PathLike, str] = '.', selected_model_ids: Optional[List[str]] = None) -> Tuple[Optional[pathlib.Path], Optional[pathlib.Path]]

      Caches model_data and predictor_data to files.

      :param path: Where to place the files
      :type path: str
      :param selected_model_ids: Optional list of model IDs to restrict to
      :type selected_model_ids: List[str]

      :returns: The paths to the model and predictor data files
      :rtype: (Optional[Path], Optional[Path])


   .. py:property:: unique_channels

      A consistently ordered set of unique channels in the data

      Used for making the color schemes in different plots consistent


   .. py:property:: unique_configurations

      A consistently ordered set of unique configurations in the data

      Used for making the color schemes in different plots consistent


   .. py:property:: unique_channel_direction

      A consistently ordered set of unique channel+direction combos in the data
      Used for making the color schemes in different plots consistent


   .. py:property:: unique_configuration_channel_direction

      A consistently ordered set of unique configuration+channel+direction
      Used for making the color schemes in different plots consistent


   .. py:property:: unique_predictor_categories

      A consistently ordered set of unique predictor categories in the data
      Used for making the color schemes in different plots consistent


   .. py:method:: _minMaxScoresPerModel(bin_data: polars.LazyFrame) -> polars.LazyFrame
      :classmethod:


   .. py:method:: active_ranges(model_ids: Optional[Union[str, List[str]]] = None) -> polars.LazyFrame

      Calculate the active, reachable bins in classifiers.

      The classifiers exported by Pega contain (in certain product versions) more than
      the bins that can be reached given the current state of the predictors. This method
      first calculates the min and max score range from the predictor log odds, then maps
      that to the interval boundaries of the classifier(s) to find the min and max index.

      It returns a LazyFrame with the score min/max, the min/max index, as well as the
      AUC as reported in the datamart data, when calculated from the full range, and when
      calculated from the reachable bins only.

      This information can be used in the Health Check documents or when verifying the
      AUC numbers from the datamart.

      :param model_ids: An optional list of model id's, or just a single one, to report on. When
                        not given, the information is returned for all models.
      :type model_ids: Optional[Union[str, List[str]]], optional

      :returns: A table with all the index and AUC information for all the models with the following fields:

                Model Identification:
                - ModelID - The unique identifier for the model

                AUC Metrics:
                - AUC_Datamart - The AUC value as reported in the datamart
                - AUC_FullRange - The AUC calculated from the full range of bins in the classifier
                - AUC_ActiveRange - The AUC calculated from only the active/reachable bins

                Classifier Information:
                - Bins - The total number of bins in the classifier
                - nActivePredictors - The number of active predictors in the model

                Log Odds Information (mostly for internal use):
                - classifierLogOffset - The log offset of the classifier (baseline log odds)
                - sumMinLogOdds - The sum of minimum log odds across all active predictors
                - sumMaxLogOdds - The sum of maximum log odds across all active predictors
                - score_min - The minimum score (normalized sum of log odds including classifier offset)
                - score_max - The maximum score (normalized sum of log odds including classifier offset)

                Active Range Information:
                - idx_min - The minimum bin index that can be reached given the current binning of all predictors
                - idx_max - The maximum bin index that can be reached given the current binning of all predictors
      :rtype: pl.LazyFrame


.. py:class:: IH(data: polars.LazyFrame)

   .. py:attribute:: data
      :type:  polars.LazyFrame


   .. py:attribute:: positive_outcome_labels
      :type:  Dict[str, List[str]]


   .. py:attribute:: aggregates


   .. py:attribute:: plot


   .. py:attribute:: negative_outcome_labels


   .. py:method:: from_ds_export(ih_filename: Union[os.PathLike, str], query: Optional[pdstools.utils.types.QUERY] = None)
      :classmethod:


      Create an IH instance from a file with Pega Dataset Export

      :param ih_filename: The full path to the dataset files
      :type ih_filename: Union[os.PathLike, str]
      :param query: An optional argument to filter out selected data, by default None
      :type query: Optional[QUERY], optional

      :returns: The properly initialized IH object
      :rtype: IH


   .. py:method:: from_s3()
      :classmethod:


      Not implemented yet. Please let us know if you would like this functionality!


   .. py:method:: from_mock_data(days=90, n=100000)
      :classmethod:


      Initialize an IH instance with sample data

      :param days:
      :type days: number of days, defaults to 90 days
      :param n:
      :type n: number of interaction data records, defaults to 100k

      :returns: The properly initialized IH object
      :rtype: IH


   .. py:method:: get_sequences(positive_outcome_label: str, level: str, outcome_column: str, customerid_column: str) -> tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]

      Generates customer sequences, outcome labels, counts
      needed for PMI (Pointwise Mutual Information) calculations.

      This function processes customer interaction data to produce:
      1. Action sequences per customer.
      2. Corresponding binary outcome sequences (1 for positive outcome, 0 otherwise).
      3. Counts of bigrams and ≥3-grams that end with a positive outcome.
      4. Counts of all possible bigrams within that corpus.

      :param positive_outcome_label: The outcome label that marks the final event in a sequence.
      :type positive_outcome_label: str
      :param level: Column name that contains the action (offer / treatment).
      :type level: str
      :param outcome_column: Column name that contains the outcome label.
      :type outcome_column: str
      :param customerid_column: Column name that identifies a unique customer / subject.
      :type customerid_column: str

      :returns: * **customer_sequences** (*list[tuple[str, ...]]*) -- Sequences of actions per customer.
                * **customer_outcomes** (*list[tuple[int, ...]]*) -- Binary outcomes (0 or 1) for each customer action sequence.
                * **count_actions** (*list[defaultdict[tuple[str], int]]*) -- Actions frequency counts.
                  Index 0 = count of first element in all bigrams
                  Index 1 = count of second element in all bigrams
                * **count_sequences** (*list[defaultdict[tuple[str, …], int]]*) -- Sequence frequency counts.
                  Index 0 = bigrams (all)
                  Index 1 = ≥3-grams that end with positive outcome
                  Index 2 = bigrams that end with positive outcome
                  Index 3 = unique ngrams per customer


   .. py:method:: calculate_pmi(count_actions: list[collections.defaultdict[tuple[str], int]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]]) -> tuple[dict[tuple[str, str], float], dict[tuple[str, Ellipsis], float]]
      :staticmethod:


      Computes PMI scores for n-grams (n ≥ 2) in customer action sequences.
      Returns an unsorted dictionary mapping sequences to their PMI values, providing insights into significant action associations.

      Bigrams values are calculated by PMI.
      N-gram values are computed by averaging the PMI of their constituent bigrams.
      Higher values indicate more informative or surprising paths.

      :param count_actions: Actions frequency counts.
                            Index 0 = count of first element in all bigrams
                            Index 1 = count of second element in all bigrams
      :type count_actions: list[defaultdict[tuple[str], int]]
      :param count_sequences: Sequence frequency counts.
                              Index 0 = bigrams (all)
                              Index 1 = ≥3-grams that end with positive outcome
                              Index 2 = bigrams that end with positive outcome
                              Index 3 = unique ngrams per customer
      :type count_sequences: list[defaultdict[tuple[str, …], int]]

      :returns: **ngrams_pmi** -- Dictionary containing PMI information for bigrams and n-grams.
                For bigrams, the value is a float representing the PMI value.
                For higher-order n-grams, the value is a dictionary with:
                    - 'average_pmi: The average PMI value.
                    - 'links': A dictionary mapping each constituent bigram to its PMI value.
      :rtype: dict[tuple[str, ...], float | dict[str, float | dict[tuple[str, str], float]]]


   .. py:method:: pmi_overview(ngrams_pmi: Dict[str, Dict[str, Union[Dict[str, float], float]]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]], customer_sequences: list[tuple[str, Ellipsis]], customer_outcomes: list[tuple[int, Ellipsis]]) -> polars.DataFrame
      :staticmethod:


      Analyzes customer sequences to identify patterns linked to positive outcomes. Returns a sorted Polars DataFrame of significant n-grams

      :param ngrams_pmi: Dictionary containing PMI information for bigrams and n-grams.
                         For bigrams, the value is a float representing the PMI value.
                         For higher-order n-grams, the value is a dictionary with:
                             - 'average_pmi: The average PMI value.
                             - 'links': A dictionary mapping each constituent bigram to its PMI value.
      :type ngrams_pmi: dict[tuple[str, ...], float | dict[str, float | dict[tuple[str, str], float]]]
      :param count_sequences: Sequence frequency counts.
                              Index 1 = ≥3-grams ending in positive outcome.
                              Index 2 = bigrams ending in positive outcome.
      :type count_sequences: list[defaultdict[tuple[str, ...], int]]
      :param customer_sequences: Sequences of actions per customer.
      :type customer_sequences: list[tuple[str, ...]]
      :param customer_outcomes: Binary outcomes (0 or 1) for each customer action sequence.
      :type customer_outcomes: list[tuple[int, ...]]

      :returns: DataFrame containing:
                - 'Sequence': the action sequence
                - 'Length': number of actions
                - 'Avg PMI': average PMI value
                - 'Frequency': number of times the sequence appears
                - 'Unique freq': number of unique customers who had this sequence ending in a positive outcome
                - 'Score': Avg PMI x log(Frequency), sorted descending
      :rtype: pl.DataFrame


.. py:class:: ImpactAnalyzer(raw_data: polars.LazyFrame)

   .. py:attribute:: ia_data
      :type:  polars.LazyFrame


   .. py:attribute:: default_ia_experiments


   .. py:attribute:: default_ia_controlgroups


   .. py:attribute:: plot


   .. py:method:: from_pdc(pdc_source: Union[os.PathLike, str, dict], *, query: Optional[pdstools.utils.types.QUERY] = None, return_input_df: Optional[bool] = False, return_df: Optional[bool] = False)
      :classmethod:


      Create an ImpactAnalyzer instance from a PDC file

      :param pdc_filename: The full path to the PDC file
      :type pdc_filename: Union[os.PathLike, str]
      :param query: An optional argument to filter out selected data, by default None
      :type query: Optional[QUERY], optional
      :param return_input_df: Debugging option to return the wide data from the raw JSON file as a DataFrame, by default False
      :type return_input_df: Optional[QUERY], optional
      :param return_df: Returns the processed input data as a DataFrame. Multiple of these can be stacked up and used to initialize the ImpactAnalyzer class, by default False
      :type return_df: Optional[QUERY], optional

      :returns: The properly initialized ImpactAnalyzer object
      :rtype: ImpactAnalyzer


   .. py:method:: _from_pdc_json(json_data: dict, *, query: Optional[pdstools.utils.types.QUERY] = None, return_input_df: Optional[bool] = False, return_df: Optional[bool] = False)
      :classmethod:


      Internal method to create an ImpactAnalyzer instance from PDC JSON data

      The PDC data is really structured as a list of expriments: control group A vs control group B. There
      is no explicit indicator whether the B's are really the same customers or not. The PDC data also contains
      a lot of UI related information that is not necessary.

      We turn this data into a series of control groups with just counts of impressions and accepts. This
      does need to assume a few implicit assumptions.


   .. py:method:: summary_by_channel() -> polars.LazyFrame

      Summarization of the experiments in Impact Analyzer split by Channel.


      :returns: Summary across all running Impact Analyzer experiments as a dataframe with the following fields:

                Channel Identification:
                - Channel: The channel name

                Performance Metrics:
                - CTR_Lift Adaptive Models vs Random Propensity: Lift in Engagement when testing prioritization with just Adaptive Models vs just Random Propensity
                - CTR_Lift NBA vs No Levers: Lift in Engagement for the full NBA Framework as configured vs prioritization without levers (only p, V and C)
                - CTR_Lift NBA vs Only Eligibility Rules: Lift in Engagement for the full NBA Framework as configured vs Only Eligibility policies applied (no Applicability or Suitability, and prioritized with pVCL)
                - CTR_Lift NBA vs Propensity Only: Lift in Engagement for the full NBA Framework as configured vs prioritization with model propensity only (no V, C or L)
                - CTR_Lift NBA vs Random: Lift in Engagement for the full NBA Framework as configured vs a Random eligible action (all engagement policies but randomly prioritized)
                - Value_Lift Adaptive Models vs Random Propensity: Lift in Expected Value when testing prioritization with just Adaptive Models vs just Random Propensity
                - Value_Lift NBA vs No Levers: Lift in Expected Value for the full NBA Framework as configured vs prioritization without levers (only p, V and C)
                - Value_Lift NBA vs Only Eligibility Rules: Lift in Expected Value for the full NBA Framework as configured vs Only Eligibility policies applied (no Applicability or Suitability, and prioritized with pVCL)
                - Value_Lift NBA vs Propensity Only: Lift in Expected Value for the full NBA Framework as configured vs prioritization with model propensity only (no V, C or L)
                - Value_Lift NBA vs Random: Lift in Expected Value for the full NBA Framework as configured vs a Random eligible action (all engagement policies but randomly prioritized)
      :rtype: pl.LazyFrame


   .. py:method:: overall_summary() -> polars.LazyFrame

      Summarization of the experiments in Impact Analyzer.


      :returns: Summary across all running Impact Analyzer experiments as a dataframe with the following fields:

                Performance Metrics:
                - CTR_Lift Adaptive Models vs Random Propensity: Lift in Engagement when testing prioritization with just Adaptive Models vs just Random Propensity
                - CTR_Lift NBA vs No Levers: Lift in Engagement for the full NBA Framework as configured vs prioritization without levers (only p, V and C)
                - CTR_Lift NBA vs Only Eligibility Rules: Lift in Engagement for the full NBA Framework as configured vs Only Eligibility policies applied (no Applicability or Suitability, and prioritized with pVCL)
                - CTR_Lift NBA vs Propensity Only: Lift in Engagement for the full NBA Framework as configured vs prioritization with model propensity only (no V, C or L)
                - CTR_Lift NBA vs Random: Lift in Engagement for the full NBA Framework as configured vs a Random eligible action (all engagement policies but randomly prioritized)
                - Value_Lift Adaptive Models vs Random Propensity: Lift in Expected Value when testing prioritization with just Adaptive Models vs just Random Propensity
                - Value_Lift NBA vs No Levers: Lift in Expected Value for the full NBA Framework as configured vs prioritization without levers (only p, V and C)
                - Value_Lift NBA vs Only Eligibility Rules: Lift in Expected Value for the full NBA Framework as configured vs Only Eligibility policies applied (no Applicability or Suitability, and prioritized with pVCL)
                - Value_Lift NBA vs Propensity Only: Lift in Expected Value for the full NBA Framework as configured vs prioritization with model propensity only (no V, C or L)
                - Value_Lift NBA vs Random: Lift in Expected Value for the full NBA Framework as configured vs a Random eligible action (all engagement policies but randomly prioritized)
      :rtype: pl.LazyFrame


   .. py:method:: summarize_control_groups(by: Optional[Union[List[str], str]] = None, drop_internal_cols=True) -> polars.LazyFrame


   .. py:method:: summarize_experiments(by: Optional[Union[List[str], str]] = None) -> polars.LazyFrame


.. py:function:: read_ds_export(filename: Union[str, io.BytesIO], path: Union[str, os.PathLike] = '.', verbose: bool = False, **reading_opts) -> Optional[polars.LazyFrame]

   Read in most out of the box Pega dataset export formats
   Accepts one of the following formats:
   - .csv
   - .json
   - .zip (zipped json or CSV)
   - .feather
   - .ipc
   - .parquet

   It automatically infers the default file names for both model data as well as predictor data.
   If you supply either 'modelData' or 'predictorData' as the 'file' argument, it will search for them.
   If you supply the full name of the file in the 'path' directory, it will import that instead.
   Since pdstools V3.x, returns a Polars LazyFrame. Simply call `.collect()` to get an eager frame.

   :param filename: Can be one of the following:
                    - A string with the full path to the file
                    - A string with the name of the file (to be searched in the given path)
                    - A BytesIO object containing the file data (e.g., from an uploaded file in a webapp)
   :type filename: Union[str, BytesIO]
   :param path: The location of the file
   :type path: str, default = '.'
   :param verbose: Whether to print out which file will be imported
   :type verbose: bool, default = True

   :keyword Any: Any arguments to plug into the scan_* function from Polars.

   :returns: * *pl.LazyFrame* -- The (lazy) dataframe
             * *Examples* -- >>> df = read_ds_export(filename='full/path/to/ModelSnapshot.json')
               >>> df = read_ds_export(filename='ModelSnapshot.json', path='data/ADMData')
               >>> df = read_ds_export(filename=uploaded_file)  # Where uploaded_file is a BytesIO object


.. py:class:: Prediction(df: polars.LazyFrame, *, query: Optional[pdstools.utils.types.QUERY] = None)

   Monitor Pega Prediction Studio Predictions


   .. py:attribute:: predictions
      :type:  polars.LazyFrame


   .. py:attribute:: plot
      :type:  PredictionPlots


   .. py:attribute:: prediction_validity_expr


   .. py:attribute:: cdh_guidelines


   .. py:method:: from_pdc(df: polars.LazyFrame, return_df=False)
      :classmethod:


   .. py:method:: from_mock_data(days=70)
      :staticmethod:


   .. py:property:: is_available
      :type: bool


   .. py:property:: is_valid
      :type: bool


   .. py:method:: summary_by_channel(custom_predictions: Optional[List[List]] = None, *, start_date: Optional[datetime.datetime] = None, end_date: Optional[datetime.datetime] = None, window: Optional[Union[int, datetime.timedelta]] = None, by_period: Optional[str] = None, debug: bool = False) -> polars.LazyFrame

      Summarize prediction per channel

      :param custom_predictions: Optional list with custom prediction name to channel mappings. Defaults to None.
      :type custom_predictions: Optional[List[CDH_Guidelines.NBAD_Prediction]], optional
      :param start_date: Start date of the summary period. If None (default) uses the end date minus the window, or if both absent, the earliest date in the data
      :type start_date: datetime.datetime, optional
      :param end_date: End date of the summary period. If None (default) uses the start date plus the window, or if both absent, the latest date in the data
      :type end_date: datetime.datetime, optional
      :param window: Number of days to use for the summary period or an explicit timedelta. If None (default) uses the whole period. Can't be given if start and end date are also given.
      :type window: int or datetime.timedelta, optional
      :param by_period: Optional additional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. Defaults to None.
      :type by_period: str, optional
      :param debug: If True, enables debug mode for additional logging or outputs. Defaults to False.
      :type debug: bool, optional

      :returns: Summary across all Predictions as a dataframe with the following fields:

                Time and Configuration Fields:
                - DateRange Min - The minimum date in the summary time range
                - DateRange Max - The maximum date in the summary time range
                - Duration - The duration in seconds between the minimum and maximum snapshot times
                - Prediction: The prediction name
                - Channel: The channel name
                - Direction: The direction (e.g., Inbound, Outbound)
                - ChannelDirectionGroup: Combined Channel/Direction identifier
                - isValid: Boolean indicating if the prediction data is valid
                - isStandardNBADPrediction: Boolean indicating if this is a standard NBAD prediction
                - isMultiChannelPrediction: Boolean indicating if this is a multichannel prediction
                - ControlPercentage: Percentage of responses in control group
                - TestPercentage: Percentage of responses in test group

                Performance Metrics:
                - Performance: Weighted model performance (AUC)
                - Positives: Sum of positive responses
                - Negatives: Sum of negative responses
                - Responses: Sum of all responses
                - Positives_Test: Sum of positive responses in test group
                - Positives_Control: Sum of positive responses in control group
                - Positives_NBA: Sum of positive responses in NBA group
                - Negatives_Test: Sum of negative responses in test group
                - Negatives_Control: Sum of negative responses in control group
                - Negatives_NBA: Sum of negative responses in NBA group
                - CTR: Clickthrough rate (Positives over Positives + Negatives)
                - CTR_Test: Clickthrough rate for test group (model propensitities)
                - CTR_Control: Clickthrough rate for control group (random propensities)
                - CTR_NBA: Clickthrough rate for NBA group (available only when Impact Analyzer is used)
                - Lift: Lift in Engagement when testing prioritization with just Adaptive Models vs just Random Propensity

                Technology Usage Indicators:
                - usesImpactAnalyzer: Boolean indicating if Impact Analyzer is used
      :rtype: pl.LazyFrame


   .. py:method:: overall_summary(custom_predictions: Optional[List[List]] = None, *, start_date: Optional[datetime.datetime] = None, end_date: Optional[datetime.datetime] = None, window: Optional[Union[int, datetime.timedelta]] = None, by_period: Optional[str] = None, debug: bool = False) -> polars.LazyFrame

      Overall prediction summary. Only valid prediction data is included.

      :param custom_predictions: Optional list with custom prediction name to channel mappings. Defaults to None.
      :type custom_predictions: Optional[List[CDH_Guidelines.NBAD_Prediction]], optional
      :param start_date: Start date of the summary period. If None (default) uses the end date minus the window, or if both absent, the earliest date in the data
      :type start_date: datetime.datetime, optional
      :param end_date: End date of the summary period. If None (default) uses the start date plus the window, or if both absent, the latest date in the data
      :type end_date: datetime.datetime, optional
      :param window: Number of days to use for the summary period or an explicit timedelta. If None (default) uses the whole period. Can't be given if start and end date are also given.
      :type window: int or datetime.timedelta, optional
      :param by_period: Optional additional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. Defaults to None.
      :type by_period: str, optional
      :param debug: If True, enables debug mode for additional logging or outputs. Defaults to False.
      :type debug: bool, optional

      :returns: Summary across all Predictions as a dataframe with the following fields:

                Time and Configuration Fields:
                - DateRange Min - The minimum date in the summary time range
                - DateRange Max - The maximum date in the summary time range
                - Duration - The duration in seconds between the minimum and maximum snapshot times
                - ControlPercentage: Weighted average percentage of control group responses
                - TestPercentage: Weighted average percentage of test group responses

                Performance Metrics:
                - Performance: Weighted average performance across all valid channels
                - Positives Inbound: Sum of positive responses across all valid inbound channels
                - Positives Outbound: Sum of positive responses across all valid outbound channels
                - Responses Inbound: Sum of all responses across all valid inbound channels
                - Responses Outbound: Sum of all responses across all valid outbound channels
                - Overall Lift: Weighted average lift across all valid channels
                - Minimum Negative Lift: The lowest negative lift value found

                Channel Statistics:
                - Number of Valid Channels: Count of unique valid channel/direction combinations
                - Channel with Minimum Negative Lift: Channel with the lowest negative lift value

                Technology Usage Indicators:
                - usesImpactAnalyzer: Boolean indicating if any channel uses Impact Analyzer
      :rtype: pl.LazyFrame


.. py:function:: default_predictor_categorization(x: Union[str, polars.Expr] = pl.col('PredictorName')) -> polars.Expr

   Function to determine the 'category' of a predictor.

   It is possible to supply a custom function.
   This function can accept an optional column as input
   And as output should be a Polars expression.
   The most straight-forward way to implement this is with
   pl.when().then().otherwise(), which you can chain.

   By default, this function returns "Primary" whenever
   there is no '.' anywhere in the name string,
   otherwise returns the first string before the first period

   :param x: The column to parse
   :type x: Union[str, pl.Expr], default = pl.col('PredictorName')


.. py:function:: cdh_sample(query: Optional[pdstools.utils.types.QUERY] = None) -> pdstools.adm.ADMDatamart.ADMDatamart

   Import a sample dataset from the CDH Sample application

   :param query: An optional query to apply to the data, by default None
   :type query: Optional[QUERY], optional

   :returns: The ADM Datamart class populated with CDH Sample data
   :rtype: ADMDatamart


.. py:function:: sample_value_finder(threshold: Optional[float] = None) -> pdstools.valuefinder.ValueFinder.ValueFinder

   Import a sample dataset of a Value Finder simulation

   This simulation was ran on a stock CDH Sample system.

   :param threshold: Optional override of the propensity threshold in the system, by default None
   :type threshold: Optional[float], optional

   :returns: The Value Finder class populated with the Value Finder simulation data
   :rtype: ValueFinder


.. py:function:: show_versions(print_output: Literal[True] = True) -> None
                 show_versions(print_output: Literal[False] = False) -> str

   Get a list of currently installed versions of pdstools and its dependencies.

   :param print_output: If True, print the version information to stdout.
                        If False, return the version information as a string.
                        Default is True.
   :type print_output: bool, optional

   :returns: Version information as a string if print_output is False, else None.
   :rtype: Optional[str]

   .. rubric:: Examples

   >>> from pdstools import show_versions
   >>> show_versions()
   --- Version info ---
   pdstools: 4.0.0-alpha
   Platform: macOS-14.7-arm64-arm-64bit
   Python: 3.12.4 (main, Jun  6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)]

   --- Dependencies ---
   typing_extensions: 4.12.2
   polars>=1.9: 1.9.0

   --- Dependency group: adm ---
   plotly>=5.5.0: 5.24.1

   --- Dependency group: api ---
   pydantic: 2.9.2
   httpx: 0.27.2


.. py:class:: ValueFinder(df: polars.LazyFrame, *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None)

   Analyze the Value Finder dataset for detailed insights


   .. py:attribute:: df
      :type:  polars.LazyFrame


   .. py:attribute:: n_customers
      :type:  int


   .. py:attribute:: nbad_stages
      :value: ['Eligibility', 'Applicability', 'Suitability', 'Arbitration']


   .. py:attribute:: aggregates


   .. py:attribute:: plot


   .. py:method:: from_ds_export(filename: Optional[str] = None, base_path: Union[os.PathLike, str] = '.', *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None)
      :classmethod:


   .. py:method:: from_dataflow_export(files: Union[Iterable[str], str], *, query: Optional[pdstools.utils.types.QUERY] = None, n_customers: Optional[int] = None, threshold: Optional[float] = None, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: Union[os.PathLike, str] = 'cache')
      :classmethod:


   .. py:method:: set_threshold(new_threshold: Optional[float] = None)


   .. py:property:: threshold


   .. py:method:: save_data(path: Union[os.PathLike, str] = '.') -> Optional[pathlib.Path]

      Cache the pyValueFinder dataset to a Parquet file

      :param path: Where to place the file
      :type path: str

      :returns: The paths to the model and predictor data files
      :rtype: (Optional[Path], Optional[Path])