pdstools ======== .. py:module:: pdstools .. autoapi-nested-parse:: Pega Data Scientist Tools Python library Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/pdstools/adm/index /autoapi/pdstools/cli/index /autoapi/pdstools/decision_analyzer/index /autoapi/pdstools/explanations/index /autoapi/pdstools/ih/index /autoapi/pdstools/impactanalyzer/index /autoapi/pdstools/infinity/index /autoapi/pdstools/pega_io/index /autoapi/pdstools/prediction/index /autoapi/pdstools/reports/index /autoapi/pdstools/resources/index /autoapi/pdstools/utils/index /autoapi/pdstools/valuefinder/index Classes ------- .. autoapisummary:: pdstools.ADMDatamart pdstools.IH pdstools.ImpactAnalyzer pdstools.Prediction pdstools.ValueFinder Functions --------- .. autoapisummary:: pdstools.read_ds_export pdstools.default_predictor_categorization pdstools.cdh_sample pdstools.sample_value_finder pdstools.show_versions Package Contents ---------------- .. py:class:: ADMDatamart(model_df: polars.LazyFrame | None = None, predictor_df: polars.LazyFrame | None = None, *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True) Monitor and analyze ADM data from the Pega Datamart. To initialize this class, either 1. Initialize directly with the model_df and predictor_df polars LazyFrames 2. Use one of the class methods: `from_ds_export`, `from_s3`, `from_dataflow_export` etc. This class will read in the data from different sources, properly structure them from further analysis, and apply correct typing and useful renaming. There is also a few "namespaces" that you can call from this class: - `.plot` contains ready-made plots to analyze the data with - `.aggregates` contains mostly internal data aggregations queries - `.agb` contains analysis utilities for Adaptive Gradient Boosting models - `.generate` leads to some ready-made reports, such as the Health Check - `.bin_aggregator` allows you to compare the bins across various models :param model_df: The Polars LazyFrame representation of the model snapshot table. :type model_df: pl.LazyFrame, optional :param predictor_df: The Polars LazyFrame represenation of the predictor binning table. :type predictor_df: pl.LazyFrame, optional :param query: An optional query to apply to the input data. For details, see :meth:`pdstools.utils.cdh_utils._apply_query`. :type query: QUERY, optional :param extract_pyname_keys: Whether to extract extra keys from the `pyName` column. In older Pega versions, this contained pyTreatment among other (customizable) fields. By default True :type extract_pyname_keys: bool, default = True .. rubric:: Examples >>> from pdstools import ADMDatamart >>> from glob import glob >>> dm = ADMDatamart( model_df = pl.scan_parquet('models.parquet'), predictor_df = pl.scan_parquet('predictors.parquet') query = {"Configuration":["Web_Click_Through"]} ) >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder') >>> dm = ADMDatamart.from_s3("pega_export") >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*")) .. note:: This class depends on two datasets: - `pyModelSnapshots` corresponds to the `model_data` attribute - `pyADMPredictorSnapshots` corresponds to the `predictor_data` attribute For instructions on how to download these datasets, please refer to the following article: https://docs.pega.com/bundle/platform/page/platform/decision-management/exporting-monitoring-database.html .. seealso:: :obj:`pdstools.adm.Plots` The out of the box plots on the Datamart data :obj:`pdstools.adm.Reports` Methods to generate the Health Check and Model Report :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods .. py:attribute:: model_data :type: polars.LazyFrame | None .. py:attribute:: predictor_data :type: polars.LazyFrame | None .. py:attribute:: combined_data :type: polars.LazyFrame | None .. py:attribute:: plot :type: pdstools.adm.Plots.Plots .. py:attribute:: aggregates :type: pdstools.adm.Aggregates.Aggregates .. py:attribute:: agb :type: pdstools.adm.ADMTrees.AGB .. py:attribute:: generate :type: pdstools.adm.Reports.Reports .. py:attribute:: bin_aggregator :type: pdstools.adm.BinAggregator.BinAggregator .. py:attribute:: first_action_dates :type: polars.LazyFrame | None .. py:attribute:: context_keys :type: list[str] :value: ['Channel', 'Direction', 'Issue', 'Group', 'Name'] .. py:method:: _get_first_action_dates(df: polars.LazyFrame | None) -> polars.LazyFrame | None .. py:method:: from_ds_export(model_filename: str | None = None, predictor_filename: str | None = None, base_path: os.PathLike | str = '.', *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True, infer_schema_length: int = 10000) :classmethod: Import the ADMDatamart class from a Pega Dataset Export :param model_filename: The full path or name (if base_path is given) to the model snapshot files, by default None :type model_filename: Optional[str], optional :param predictor_filename: The full path or name (if base_path is given) to the predictor binning snapshot files, by default None :type predictor_filename: Optional[str], optional :param base_path: A base path to provide so that we can automatically find the most recent files for both the model and predictor snapshots, if model_filename and predictor_filename are not given as full paths, by default "." :type base_path: Union[os.PathLike, str], optional :param query: An optional argument to filter out selected data, by default None :type query: Optional[QUERY], optional :param extract_pyname_keys: Whether to extract additional keys from the `pyName` column, by default True :type extract_pyname_keys: bool, optional :param infer_schema_length: Number of rows to scan when inferring the schema for CSV/JSON files. For large production datasets, increase this value (e.g., 200000) if columns are not being detected correctly. Higher values use more memory but provide more accurate schema detection. By default 10000 :type infer_schema_length: int, optional :returns: The properly initialized ADMDatamart class :rtype: ADMDatamart .. rubric:: Examples >>> from pdstools import ADMDatamart >>> # To automatically find the most recent files in the 'my_export_folder' dir: >>> dm = ADMDatamart.from_ds_export(base_path='/my_export_folder') >>> # To specify individual files: >>> dm = ADMDatamart.from_ds_export( model_df='/Downloads/model_snapshots.parquet', predictor_df = '/Downloads/predictor_snapshots.parquet' ) >>> # To use a higher schema inference length for large datasets: >>> dm = ADMDatamart.from_ds_export( base_path='/my_export_folder', infer_schema_length=200000 ) .. note:: By default, the dataset export in Infinity returns a zip file per table. You do not need to open up this zip file! You can simply point to the zip, and this method will be able to read in the underlying data. .. seealso:: :obj:`pdstools.pega_io.File.read_ds_export` More information on file compatibility :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods .. py:method:: from_s3() :classmethod: Not implemented yet. Please let us know if you would like this functionality! .. py:method:: from_dataflow_export(model_data_files: collections.abc.Iterable[str] | str, predictor_data_files: collections.abc.Iterable[str] | str, *, query: pdstools.utils.types.QUERY | None = None, extract_pyname_keys: bool = True, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: os.PathLike | str = 'cache') :classmethod: Read in data generated by a data flow, such as the Prediction Studio export. Dataflows are able to export data from and to various sources. As they are meant to be used in production, they are highly resiliant. For every partition and every node, a dataflow will output a small json file every few seconds. While this is great for production loads, it can be a bit more tricky to read in the data for smaller-scale and ad-hoc analyses. This method aims to make the ingestion of such highly partitioned data easier. It reads in every individual small json file that the dataflow has output, and caches them to a parquet file in the `cache_directory` folder. As such, if you re-run this method later with more data added since the last export, we will not read in from the (slow) dataflow files, but rather from the (much faster) cache. :param model_data_files: A list of files to read in as the model snapshots :type model_data_files: Union[Iterable[str], str] :param predictor_data_files: A list of files to read in as the predictor snapshots :type predictor_data_files: Union[Iterable[str], str] :param query: A, by default None :type query: Optional[QUERY], optional :param extract_pyname_keys: Whether to extract extra keys from the pyName column, by default True :type extract_pyname_keys: bool, optional :param cache_file_prefix: An optional prefix for the cache files, by default "" :type cache_file_prefix: str, optional :param extension: The extension of the source data, by default "json" :type extension: Literal["json"], optional :param compression: The compression of the source files, by default "gzip" :type compression: Literal["gzip"], optional :param cache_directory: Where to store the cached files, by default "cache" :type cache_directory: Union[os.PathLike, str], optional :returns: An initialized instance of the datamart class :rtype: ADMDatamart .. rubric:: Examples >>> from pdstools import ADMDatamart >>> import glob >>> dm = ADMDatamart.from_dataflow_export(glob("data/models*"), glob("data/preds*")) .. seealso:: :obj:`pdstools.utils.cdh_utils._apply_query` How to query the ADMDatamart class and methods :obj:`glob` Makes creating lists of files much easier .. py:method:: from_pdc(df: polars.LazyFrame, return_df=False) :classmethod: .. py:method:: _validate_model_data(df: polars.LazyFrame | None, extract_pyname_keys: bool = True) -> polars.LazyFrame | None Internal method to validate model data .. py:method:: _validate_predictor_data(df: polars.LazyFrame | None) -> polars.LazyFrame | None Internal method to validate predictor data .. py:method:: apply_predictor_categorization(categorization: polars.Expr | collections.abc.Callable[Ellipsis, polars.Expr] | dict[str, str | list[str]] = cdh_utils.default_predictor_categorization, *, use_regexp: bool = False, df: polars.LazyFrame | None = None) Apply a new predictor categorization to the datamart tables In certain plots, we use the predictor categorization to indicate what 'kind' a certain predictor is, such as IH, Customer, etc. Call this method with a custom Polars Expression (or a method that returns one) or a simple mapping and it will be applied to the predictor data (and the combined dataset too). When the categorization provides no match, the existing categories are kept as they are. For a reference implementation of a custom predictor categorization, refer to `pdstools.utils.cdh_utils.default_predictor_categorization`. :param categorization: A Polars Expression (or method that returns one) that returns the predictor categories. Should be based on Polars' when.then.otherwise syntax. Alternatively can be a dictionary of categories to (list of) string matches which can be either exact (the default) or regular expressions. By default, `pdstools.utils.cdh_utils.default_predictor_categorization` is used. :type categorization: Union[pl.Expr, Callable[..., pl.Expr], dict[str, Union[str, list[str]]]] :param use_regexp: Treat the mapping patterns in the `categorization` dictionary as regular expressions rather than plain strings. When treated as regular expressions, they will be interpreted in non-strict mode, so invalid expressions will return in no match. See https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.contains.html for exact behavior of the regular expressions. By default, False :type use_regexp: bool, optional :param df: A Polars Lazyframe to apply the categorization to. If not provided, applies it over the predictor data and combined datasets. By default, None :type df: Optional[pl.LazyFrame], optional .. seealso:: :obj:`pdstools.utils.cdh_utils.default_predictor_categorization` The default method .. rubric:: Examples >>> dm = ADMDatamart(my_data) #uses the OOTB predictor categorization >>> # Uses a custom Polars expression to set the categories >>> dm.apply_predictor_categorization(categorization=pl.when( >>> pl.col("PredictorName").cast(pl.Utf8).str.contains("Propensity") >>> ).then(pl.lit("External Model") >>> ) >>> # Uses a simple dictionary to set the categories >>> dm.apply_predictor_categorization(categorization={ >>> "External Model" : ["Score", "Propensity"]} >>> ) .. py:method:: save_data(path: os.PathLike | str = '.', selected_model_ids: list[str] | None = None) -> tuple[pathlib.Path | None, pathlib.Path | None] Caches model_data and predictor_data to files. :param path: Where to place the files :type path: str :param selected_model_ids: Optional list of model IDs to restrict to :type selected_model_ids: list[str] :returns: The paths to the model and predictor data files :rtype: (Optional[Path], Optional[Path]) .. py:property:: unique_channels A consistently ordered set of unique channels in the data Used for making the color schemes in different plots consistent .. py:property:: unique_configurations A consistently ordered set of unique configurations in the data Used for making the color schemes in different plots consistent .. py:property:: unique_channel_direction A consistently ordered set of unique channel+direction combos in the data Used for making the color schemes in different plots consistent .. py:property:: unique_configuration_channel_direction A consistently ordered set of unique configuration+channel+direction Used for making the color schemes in different plots consistent .. py:property:: unique_predictor_categories A consistently ordered set of unique predictor categories in the data Used for making the color schemes in different plots consistent .. py:method:: get_last_data_for_report() -> polars.DataFrame Get the last snapshot of data formatted for report display. This method provides a standardized view of the most recent model data with formatting suitable for Health Check reports and other documents. It handles null values, type conversions, and creates useful combined columns like "Channel/Direction". :returns: Collected DataFrame with the following transformations applied: - Categorical columns cast to strings - String and Null columns filled with "NA" - SuccessRate and Performance filled with 0 for nulls/NaNs - ResponseCount filled with 0 for nulls - Channel/Direction combined column created :rtype: pl.DataFrame .. rubric:: Examples >>> datamart = ADMDatamart.from_ds_export(model_filename="models.csv") >>> last_data = datamart.get_last_data_for_report() >>> # Use in reports without additional processing >>> active_models = last_data.filter(pl.col("ResponseCount") > 1000) .. py:method:: _minMaxScoresPerModel(bin_data: polars.LazyFrame) -> polars.LazyFrame :classmethod: .. py:method:: active_ranges(model_ids: str | list[str] | None = None) -> polars.LazyFrame Calculate the active, reachable bins in classifiers. The classifiers exported by Pega contain (in certain product versions) more than the bins that can be reached given the current state of the predictors. This method first calculates the min and max score range from the predictor log odds, then maps that to the interval boundaries of the classifier(s) to find the min and max index. It returns a LazyFrame with the score min/max, the min/max index, as well as the AUC as reported in the datamart data, when calculated from the full range, and when calculated from the reachable bins only. This information can be used in the Health Check documents or when verifying the AUC numbers from the datamart. :param model_ids: An optional list of model id's, or just a single one, to report on. When not given, the information is returned for all models. :type model_ids: Optional[Union[str, list[str]]], optional :returns: A table with all the index and AUC information for all the models with the following fields: Model Identification: - ModelID - The unique identifier for the model AUC Metrics: - AUC_Datamart - The AUC value as reported in the datamart - AUC_FullRange - The AUC calculated from the full range of bins in the classifier - AUC_ActiveRange - The AUC calculated from only the active/reachable bins Classifier Information: - Bins - The total number of bins in the classifier - nActivePredictors - The number of active predictors in the model Log Odds Information (mostly for internal use): - classifierLogOffset - The log offset of the classifier (baseline log odds) - sumMinLogOdds - The sum of minimum log odds across all active predictors - sumMaxLogOdds - The sum of maximum log odds across all active predictors - score_min - The minimum score (normalized sum of log odds including classifier offset) - score_max - The maximum score (normalized sum of log odds including classifier offset) Active Range Information: - idx_min - The minimum bin index that can be reached given the current binning of all predictors - idx_max - The maximum bin index that can be reached given the current binning of all predictors :rtype: pl.LazyFrame .. py:class:: IH(data: polars.LazyFrame) Analyze Interaction History data from Pega CDH. The IH class provides analysis and visualization capabilities for customer interaction data from Pega's Customer Decision Hub. It supports engagement, conversion, and open rate metrics through customizable outcome label mappings. .. attribute:: data The underlying interaction history data. :type: pl.LazyFrame .. attribute:: aggregates Aggregation methods accessor. :type: Aggregates .. attribute:: plot Plot accessor for visualization methods. :type: Plots .. attribute:: positive_outcome_labels Mapping of metric types to positive outcome labels. :type: dict .. attribute:: negative_outcome_labels Mapping of metric types to negative outcome labels. :type: dict .. seealso:: :obj:`pdstools.adm.ADMDatamart` For ADM model analysis. :obj:`pdstools.impactanalyzer.ImpactAnalyzer` For Impact Analyzer experiments. .. rubric:: Examples >>> from pdstools import IH >>> ih = IH.from_ds_export("interaction_history.zip") >>> ih.aggregates.summary_by_channel().collect() >>> ih.plot.response_count_trend() .. py:attribute:: data :type: polars.LazyFrame .. py:attribute:: positive_outcome_labels :type: dict[str, list[str]] Mapping of metric types to positive outcome labels. .. py:attribute:: negative_outcome_labels :type: dict[str, list[str]] Mapping of metric types to negative outcome labels. .. py:attribute:: aggregates .. py:attribute:: plot .. py:method:: from_ds_export(ih_filename: os.PathLike | str, query: pdstools.utils.types.QUERY | None = None) -> IH :classmethod: Create an IH instance from a Pega Dataset Export. :param ih_filename: Path to the dataset export file (parquet, csv, ndjson, or zip). :type ih_filename: Union[os.PathLike, str] :param query: Polars expression to filter the data. Default is None. :type query: Optional[QUERY], optional :returns: Initialized IH instance. :rtype: IH .. rubric:: Examples >>> ih = IH.from_ds_export("Data-pxStrategyResult_pxInteractionHistory.zip") >>> ih.data.collect_schema() .. py:method:: from_s3() -> IH :classmethod: :abstractmethod: Create an IH instance from S3 data. .. note:: Not implemented yet. Please let us know if you would like this! :raises NotImplementedError: This method is not yet implemented. .. py:method:: from_mock_data(days: int = 90, n: int = 100000) -> IH :classmethod: Create an IH instance with synthetic sample data. Generates realistic interaction history data for testing and demonstration purposes. Includes inbound (Web) and outbound (Email) channels with configurable propensities and model noise. :param days: Number of days of data to generate. :type days: int, default 90 :param n: Number of interaction records to generate. :type n: int, default 100000 :returns: IH instance with synthetic data. :rtype: IH .. rubric:: Examples >>> ih = IH.from_mock_data(days=30, n=10000) >>> ih.data.select("pyChannel").collect().unique() .. py:method:: get_sequences(positive_outcome_label: str, level: str, outcome_column: str, customerid_column: str) -> tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict], list[collections.defaultdict]] Extract customer action sequences for PMI analysis. Processes customer interaction data to produce action sequences, outcome labels, and frequency counts needed for Pointwise Mutual Information (PMI) calculations. :param positive_outcome_label: Outcome label marking the target event (e.g., "Conversion"). :type positive_outcome_label: str :param level: Column name containing the action/offer/treatment. :type level: str :param outcome_column: Column name containing the outcome label. :type outcome_column: str :param customerid_column: Column name identifying unique customers. :type customerid_column: str :returns: * **customer_sequences** (*list[tuple[str, ...]]*) -- Action sequences per customer. * **customer_outcomes** (*list[tuple[int, ...]]*) -- Binary outcomes (1=positive, 0=other) per sequence position. * **count_actions** (*list[defaultdict]*) -- Action frequency counts: - [0]: First element counts in bigrams - [1]: Second element counts in bigrams * **count_sequences** (*list[defaultdict]*) -- Sequence frequency counts: - [0]: All bigrams - [1]: ≥3-grams ending with positive outcome - [2]: Bigrams ending with positive outcome - [3]: Unique n-grams per customer .. seealso:: :obj:`calculate_pmi` Compute PMI scores from sequence counts. :obj:`pmi_overview` Generate PMI analysis summary. .. py:method:: calculate_pmi(count_actions: list[collections.defaultdict], count_sequences: list[collections.defaultdict]) -> dict[tuple[str, Ellipsis], float | dict[str, float | dict]] :staticmethod: Compute PMI scores for action sequences. Calculates Pointwise Mutual Information scores for bigrams and higher-order n-grams. Higher values indicate more informative or surprising action sequences. :param count_actions: Action frequency counts from :meth:`get_sequences`. :type count_actions: list[defaultdict] :param count_sequences: Sequence frequency counts from :meth:`get_sequences`. :type count_sequences: list[defaultdict] :returns: PMI scores for sequences: - Bigrams: Direct PMI value (float) - N-grams (n≥3): dict with 'average_pmi' and 'links' (constituent bigram PMIs) :rtype: dict[tuple[str, ...], Union[float, dict]] .. seealso:: :obj:`get_sequences` Extract sequences for PMI analysis. :obj:`pmi_overview` Generate PMI analysis summary. .. rubric:: Notes Bigram PMI is calculated as: .. math:: PMI(a, b) = \log_2 \frac{P(a, b)}{P(a) \cdot P(b)} N-gram PMI is the average of constituent bigram PMIs. .. py:method:: pmi_overview(ngrams_pmi: dict[tuple[str, Ellipsis], float | dict], count_sequences: list[collections.defaultdict], customer_sequences: list[tuple[str, Ellipsis]], customer_outcomes: list[tuple[int, Ellipsis]]) -> polars.DataFrame :staticmethod: Generate PMI analysis summary DataFrame. Creates a summary of action sequences ranked by their significance in predicting positive outcomes. :param ngrams_pmi: PMI scores from :meth:`calculate_pmi`. :type ngrams_pmi: dict[tuple[str, ...], Union[float, dict]] :param count_sequences: Sequence frequency counts from :meth:`get_sequences`. :type count_sequences: list[defaultdict] :param customer_sequences: Customer action sequences from :meth:`get_sequences`. :type customer_sequences: list[tuple[str, ...]] :param customer_outcomes: Customer outcome sequences from :meth:`get_sequences`. :type customer_outcomes: list[tuple[int, ...]] :returns: Summary DataFrame with columns: - **Sequence**: Action sequence tuple - **Length**: Number of actions in sequence - **Avg PMI**: Average PMI value - **Frequency**: Total occurrence count - **Unique freq**: Unique customer count - **Score**: PMI × log(Frequency), sorted descending :rtype: pl.DataFrame .. seealso:: :obj:`get_sequences` Extract sequences for analysis. :obj:`calculate_pmi` Compute PMI scores. .. rubric:: Examples >>> seqs, outs, actions, counts = ih.get_sequences( ... "Conversion", "pyName", "pyOutcome", "pxInteractionID" ... ) >>> pmi = IH.calculate_pmi(actions, counts) >>> IH.pmi_overview(pmi, counts, seqs, outs) .. py:class:: ImpactAnalyzer(raw_data: polars.LazyFrame) Analyze and visualize Impact Analyzer experiment results from Pega CDH. The ImpactAnalyzer class provides analysis and visualization capabilities for NBA (Next-Best-Action) Impact Analyzer experiments. It processes experiment data from Pega's Customer Decision Hub to compare the effectiveness of different NBA strategies including adaptive models, propensity prioritization, lever usage, and engagement policies. Data can be loaded from three sources: - **PDC exports** via :meth:`from_pdc`: Uses pre-aggregated experiment data from PDC JSON exports. Value Lift is copied from PDC data as it cannot be re-calculated from the available numbers. - **VBD exports** via :meth:`from_vbd`: Reconstructs experiment metrics from raw VBD Actuals or Scenario Planner Actuals data. Allows flexible time ranges and data selection. Value Lift is calculated from ValuePerImpression. - **Interaction History** via :meth:`from_ih`: Loads experiment metrics from Interaction History data. Not yet implemented. .. math:: \text{Engagement Lift} = \frac{\text{SuccessRate}_{test} - \text{SuccessRate}_{control}}{\text{SuccessRate}_{control}} .. math:: \text{Value Lift} = \frac{\text{ValueCapture}_{test} - \text{ValueCapture}_{control}}{\text{ValueCapture}_{control}} .. attribute:: ia_data The underlying experiment data containing control group metrics. :type: pl.LazyFrame .. attribute:: plot Plot accessor for visualization methods. :type: Plots .. seealso:: :obj:`pdstools.adm.ADMDatamart` For ADM model analysis. :obj:`pdstools.ih.IH` For Interaction History analysis. .. rubric:: Examples >>> from pdstools import ImpactAnalyzer >>> ia = ImpactAnalyzer.from_pdc("impact_analyzer_export.json") >>> ia.overall_summary().collect() >>> ia.plot.overview() .. py:attribute:: ia_data :type: polars.LazyFrame .. py:attribute:: default_ia_experiments Default experiments mapping experiment names to (control, test) group tuples. .. py:attribute:: outcome_labels Mapping of metric names to outcome labels used for aggregation. .. py:attribute:: default_ia_controlgroups .. py:attribute:: plot .. py:method:: from_pdc(pdc_source: str | pathlib.Path | os.PathLike | list[str] | list[pathlib.Path] | list[os.PathLike], *, reader: collections.abc.Callable | None = None, query: pdstools.utils.types.QUERY | None = None, return_wide_df: Literal[True], return_df: bool = ...) -> polars.LazyFrame from_pdc(pdc_source: str | pathlib.Path | os.PathLike | list[str] | list[pathlib.Path] | list[os.PathLike], *, reader: collections.abc.Callable | None = None, query: pdstools.utils.types.QUERY | None = None, return_wide_df: Literal[False] = ..., return_df: Literal[True]) -> polars.LazyFrame from_pdc(pdc_source: str | pathlib.Path | os.PathLike | list[str] | list[pathlib.Path] | list[os.PathLike], *, reader: collections.abc.Callable | None = None, query: pdstools.utils.types.QUERY | None = None) -> ImpactAnalyzer :classmethod: Create an ImpactAnalyzer instance from PDC JSON export(s). Loads pre-aggregated experiment data from Pega Decision Central JSON exports. Value Lift metrics are copied directly from the PDC data. :param pdc_source: Path to PDC JSON file, or a list of paths to concatenate. :type pdc_source: Union[Path, str, os.PathLike, list[Union[Path, str, os.PathLike]]] :param reader: Custom function to read source data into a dict. If None, uses standard JSON file reader. Default is None. :type reader: Optional[Callable], optional :param query: Polars expression to filter the data. Default is None. :type query: Optional[QUERY], optional :param return_wide_df: If True, return the raw wide-format data as a LazyFrame for debugging. Default is False. :type return_wide_df: bool, optional :param return_df: If True, return the processed data as a LazyFrame instead of an ImpactAnalyzer instance. Default is False. :type return_df: bool, optional :returns: ImpactAnalyzer instance, or LazyFrame if return_df or return_wide_df is True. :rtype: ImpactAnalyzer or pl.LazyFrame :raises ValueError: If an empty list of source files is provided. .. rubric:: Examples >>> ia = ImpactAnalyzer.from_pdc("CDH_Metrics_ImpactAnalyzer.json") >>> ia.overall_summary().collect() .. py:method:: from_vbd(vbd_source: os.PathLike | str, *, return_df: Literal[True]) -> polars.LazyFrame | None from_vbd(vbd_source: os.PathLike | str) -> Optional[ImpactAnalyzer] :classmethod: Create an ImpactAnalyzer instance from VBD data. Processes VBD Actuals or Scenario Planner Actuals data to reconstruct Impact Analyzer experiment metrics. Provides more flexible time ranges and data selection compared to PDC exports. Value Lift is calculated from ValuePerImpression since raw value data is available in VBD exports. :param vbd_source: Path to VBD export file (parquet, csv, ndjson, or zip). :type vbd_source: Union[os.PathLike, str] :param return_df: If True, return processed data as LazyFrame instead of ImpactAnalyzer instance. Default is False. :type return_df: bool, optional :returns: ImpactAnalyzer instance, LazyFrame if return_df is True, or None if the source contains no data. :rtype: ImpactAnalyzer or pl.LazyFrame or None .. rubric:: Examples >>> ia = ImpactAnalyzer.from_vbd("ScenarioPlannerActuals.zip") >>> ia.summary_by_channel().collect() .. py:method:: from_ih(ih_source: os.PathLike | str, *, return_df: Literal[True]) -> polars.LazyFrame | None from_ih(ih_source: os.PathLike | str) -> Optional[ImpactAnalyzer] :classmethod: Create an ImpactAnalyzer instance from Interaction History data. .. note:: This method is not yet implemented. Reconstructs experiment metrics from Interaction History data, allowing analysis of experiments using detailed interaction-level records. :param ih_source: Path to Interaction History export file. :type ih_source: Union[os.PathLike, str] :param return_df: If True, return processed data as LazyFrame instead of ImpactAnalyzer instance. Default is False. :type return_df: bool, optional :returns: ImpactAnalyzer instance, LazyFrame if return_df is True, or None if the source contains no data. :rtype: ImpactAnalyzer or pl.LazyFrame or None :raises NotImplementedError: This method is not yet implemented. .. py:method:: _normalize_pdc_ia_data(json_data: dict, *, query: pdstools.utils.types.QUERY | None = None, return_wide_df: bool = False) -> polars.LazyFrame :classmethod: Transform PDC Impact Analyzer JSON into normalized long format. Converts the hierarchical PDC JSON structure (organized by experiments) into a flat structure organized by control groups with impression and accept counts. :param json_data: Parsed JSON data from PDC export. :type json_data: dict :param query: Polars expression to filter the data. Default is None. :type query: Optional[QUERY], optional :param return_wide_df: If True, return intermediate wide-format data. Default is False. :type return_wide_df: bool, optional :returns: Normalized data with columns: SnapshotTime, Channel, ControlGroup, Impressions, Accepts, ValuePerImpression, Pega_ValueLift. :rtype: pl.LazyFrame .. py:method:: summary_by_channel() -> polars.LazyFrame Get experiment summary pivoted by channel. Returns experiment lift metrics (CTR_Lift and Value_Lift) for each experiment, with one row per channel. :returns: Wide-format summary with columns: - **Channel**: Channel name - **CTR_Lift **: Engagement lift for each experiment - **Value_Lift **: Value lift for each experiment :rtype: pl.LazyFrame .. seealso:: :obj:`overall_summary` Summary without channel breakdown. :obj:`summarize_experiments` Long-format experiment summary. .. rubric:: Examples >>> ia.summary_by_channel().collect() .. py:method:: overall_summary() -> polars.LazyFrame Get overall experiment summary aggregated across all channels. Returns experiment lift metrics (CTR_Lift and Value_Lift) for each experiment, aggregated across all data. :returns: Single-row wide-format summary with columns: - **CTR_Lift **: Engagement lift for each experiment - **Value_Lift **: Value lift for each experiment :rtype: pl.LazyFrame .. seealso:: :obj:`summary_by_channel` Summary with channel breakdown. :obj:`summarize_experiments` Long-format experiment summary. .. rubric:: Examples >>> ia.overall_summary().collect() .. py:method:: summarize_control_groups(by: collections.abc.Sequence[str | polars.Expr] | str | polars.Expr | None = None, drop_internal_cols: bool = True) -> polars.LazyFrame Aggregate metrics by control group. Summarizes impressions, accepts, CTR, and value metrics for each control group, optionally grouped by additional dimensions. :param by: Column name(s) or expression(s) to group by in addition to ControlGroup. Default is None (aggregate all data). :type by: Optional[Union[list[str], list[pl.Expr], str, pl.Expr]], optional :param drop_internal_cols: If True, drop internal columns prefixed with 'Pega_'. Default is True. :type drop_internal_cols: bool, optional :returns: Aggregated metrics with columns: ControlGroup, Impressions, Accepts, CTR, ValuePerImpression, plus any grouping columns. :rtype: pl.LazyFrame .. rubric:: Examples >>> ia.summarize_control_groups().collect() >>> ia.summarize_control_groups(by="Channel").collect() .. py:method:: summarize_experiments(by: collections.abc.Sequence[str | polars.Expr] | str | polars.Expr | None = None) -> polars.LazyFrame Summarize experiment metrics comparing test vs control groups. Computes lift metrics for each defined experiment by comparing test and control group performance. .. note:: Returns all default experiments regardless of whether they are active in the data. Experiments without data will have null values for all metrics (Impressions, Accepts, CTR_Lift, Value_Lift, etc.). :param by: Column name(s) or expression(s) to group by. Default is None (aggregate all data). :type by: Optional[Union[list[str], list[pl.Expr], str, pl.Expr]], optional :returns: Experiment summary with columns: - **Experiment**: Experiment name - **Test**, **Control**: Control group names for the experiment - **Impressions_Test**, **Impressions_Control**: Impression counts (null if not active) - **Accepts_Test**, **Accepts_Control**: Accept counts (null if not active) - **CTR_Test**, **CTR_Control**: Click-through rates (null if not active) - **Control_Fraction**: Fraction of impressions in control group - **CTR_Lift**: Engagement lift (null if experiment not active) - **Value_Lift**: Value lift (null if experiment not active) :rtype: pl.LazyFrame .. seealso:: :obj:`summarize_control_groups` Lower-level control group aggregation. :obj:`overall_summary` Pivoted overall summary. :obj:`summary_by_channel` Pivoted summary by channel. .. rubric:: Examples >>> ia.summarize_experiments().collect() >>> ia.summarize_experiments(by="Channel").collect() .. py:function:: read_ds_export(filename: str | os.PathLike | io.BytesIO, path: str | os.PathLike = '.', verbose: bool = False, **reading_opts) -> polars.LazyFrame | None Read Pega dataset exports with additional capabilities. This function extends read_data() with: - Smart file finding: accepts 'modelData' or 'predictorData' and searches for matching files (ADM-specific) - URL downloads: fetches remote files when local paths are not found (useful for demos and examples) - Schema overrides: applies Pega-specific type corrections (e.g., PYMODELID as string) For simple file reading without these features, use read_data() instead. :param filename: File identifier. Can be: - Full file path - Generic name like 'modelData' or 'predictorData' (triggers smart search) - BytesIO object (delegates to read_data) :type filename: str, os.PathLike, or BytesIO :param path: Directory to search for files (ignored for BytesIO or full paths) :type path: str or os.PathLike, default='.' :param verbose: Print file selection details :type verbose: bool, default=False :param \*\*reading_opts: Additional Polars scan_* options. Common options include: - infer_schema_length (int, default=10000): Rows to scan for schema inference - separator (str): CSV delimiter - ignore_errors (bool): Continue on parse errors :returns: Lazy dataframe, or None if file not found :rtype: pl.LazyFrame or None .. rubric:: Examples Smart file finding: >>> df = read_ds_export('modelData', path='data/ADMData') Specific file: >>> df = read_ds_export('ModelSnapshot_20210101.json', path='data') URL download: >>> df = read_ds_export('ModelSnapshot.zip', path='https://example.com/exports') Schema control: >>> df = read_ds_export('export.csv', infer_schema_length=200000) .. py:class:: Prediction(df: polars.LazyFrame, *, query: pdstools.utils.types.QUERY | None = None) Monitor and analyze Pega Prediction Studio Predictions. To initialize this class, either 1. Initialize directly with the df polars LazyFrame 2. Use one of the class methods This class will read in the data from different sources, properly structure them for further analysis, and apply correct typing and useful renaming. There is also a "namespace" that you can call from this class: - `.plot` contains ready-made plots to analyze the prediction data with :param df: The Polars LazyFrame representation of the prediction data. :type df: pl.LazyFrame :param query: An optional query to apply to the input data. For details, see :meth:`pdstools.utils.cdh_utils._apply_query`. :type query: QUERY, optional .. rubric:: Examples >>> pred = Prediction.from_ds_export('/my_export_folder/predictions.zip') >>> pred = Prediction.from_mock_data(days=70) >>> from pdstools import Prediction >>> import polars as pl >>> pred = Prediction( df = pl.scan_parquet('predictions.parquet'), query = {"Class":["DATA-DECISION-REQUEST-CUSTOMER-CDH"]} ) .. seealso:: :obj:`pdstools.prediction.PredictionPlots` The out of the box plots on the Prediction data :obj:`pdstools.utils.cdh_utils._apply_query` How to query the Prediction class and methods .. py:attribute:: predictions :type: polars.LazyFrame .. py:attribute:: plot :type: PredictionPlots .. py:attribute:: prediction_validity_expr .. py:method:: from_ds_export(predictions_filename: os.PathLike | str, base_path: os.PathLike | str = '.', *, query: pdstools.utils.types.QUERY | None = None, infer_schema_length: int = 10000) :classmethod: Import from a Pega Dataset Export of the PR_DATA_DM_SNAPSHOTS table. :param predictions_filename: The full path or name (if base_path is given) to the prediction snapshot files :type predictions_filename: Union[os.PathLike, str] :param base_path: A base path to provide if predictions_filename is not given as a full path, by default "." :type base_path: Union[os.PathLike, str], optional :param query: An optional argument to filter out selected data, by default None :type query: Optional[QUERY], optional :param infer_schema_length: Number of rows to scan when inferring the schema for CSV/JSON files. For large production datasets, increase this value (e.g., 200000) if columns are not being detected correctly. Higher values use more memory but provide more accurate schema detection. By default 10000 :type infer_schema_length: int, optional :returns: The properly initialized Prediction class :rtype: Prediction .. rubric:: Examples >>> from pdstools import Prediction >>> pred = Prediction.from_ds_export('predictions.zip', '/my_export_folder') >>> # For large datasets with schema detection issues: >>> pred = Prediction.from_ds_export( 'predictions.zip', '/my_export_folder', infer_schema_length=200000 ) .. note:: By default, the dataset export in Infinity returns a zip file per table. You do not need to open up this zip file! You can simply point to the zip, and this method will be able to read in the underlying data. .. seealso:: :obj:`pdstools.pega_io.File.read_ds_export` More information on file compatibility :obj:`pdstools.utils.cdh_utils._apply_query` How to query the Prediction class and methods .. py:method:: from_s3() :classmethod: Not implemented yet. Please let us know if you would like this functionality! :returns: The properly initialized Prediction class :rtype: Prediction .. py:method:: from_dataflow_export() :classmethod: Import from a data flow, such as the Prediction Studio export. Not implemented yet. Please let us know if you would like this functionality! :returns: The properly initialized Prediction class :rtype: Prediction .. py:method:: from_pdc(df: polars.LazyFrame, *, return_df=False, query: pdstools.utils.types.QUERY | None = None) :classmethod: Import from (Pega-internal) PDC data, which is a combination of the PR_DATA_DM_SNAPSHOTS and PR_DATA_DM_ADMMART_MDL_FACT tables. :param df: The Polars LazyFrame containing the PDC data :type df: pl.LazyFrame :param return_df: If True, returns the processed DataFrame instead of initializing the class, by default False :type return_df: bool, optional :param query: An optional query to apply to the input data, by default None :type query: Optional[QUERY], optional :returns: Either the initialized Prediction class or the processed DataFrame if return_df is True :rtype: Union[Prediction, pl.LazyFrame] .. seealso:: :obj:`pdstools.utils.cdh_utils._read_pdc` More information on PDC data processing :obj:`pdstools.utils.cdh_utils._apply_query` How to query the Prediction class and methods .. py:method:: save_data(path: os.PathLike | str = '.') -> os.PathLike | None Cache predictions to a file. :param path: Where to place the file :type path: Union[os.PathLike, str] :returns: The path to the cached prediction data file, or None if no data available :rtype: Optional[os.PathLike] .. py:method:: from_processed_data(df: polars.LazyFrame) :classmethod: Load a Prediction from already-processed data (e.g., from cache). This bypasses the normal data transformation pipeline and directly assigns the data to self.predictions. Use this when loading data that has already been processed by the Prediction class constructor, such as data saved via save_data(). :param df: A LazyFrame containing already-processed prediction data with columns like 'Positives', 'CTR', 'Performance', etc. rather than the raw 'pyPositives', 'pyModelType', etc. :type df: pl.LazyFrame :returns: A Prediction instance with the processed data loaded :rtype: Prediction .. rubric:: Examples >>> # Load from a cached file >>> cached_data = pl.scan_parquet('cached_predictions.parquet') >>> pred = Prediction.from_processed_data(cached_data) .. py:method:: from_mock_data(days=70) :classmethod: Create a Prediction instance with mock data for testing and demonstration purposes. :param days: Number of days of mock data to generate, by default 70 :type days: int, optional :returns: The initialized Prediction class with mock data :rtype: Prediction .. rubric:: Examples >>> from pdstools import Prediction >>> pred = Prediction.from_mock_data(days=30) >>> pred.plot.performance_trend() .. py:property:: is_available :type: bool Check if prediction data is available. :returns: True if prediction data is available, False otherwise :rtype: bool .. py:property:: is_valid :type: bool Check if prediction data is valid. A valid prediction meets the criteria defined in prediction_validity_expr, which requires positive and negative responses in both test and control groups. :returns: True if prediction data is valid, False otherwise :rtype: bool .. py:method:: summary_by_channel(custom_predictions: list[list] | None = None, *, start_date: datetime.datetime | None = None, end_date: datetime.datetime | None = None, window: int | datetime.timedelta | None = None, every: str | None = None, debug: bool = False) -> polars.LazyFrame Summarize prediction per channel :param custom_predictions: Optional list with custom prediction name to channel mappings. Each item should be [PredictionName, Channel, Direction, isMultiChannel]. Defaults to None. :type custom_predictions: Optional[list[list]], optional :param start_date: Start date of the summary period. If None (default) uses the end date minus the window, or if both absent, the earliest date in the data :type start_date: datetime.datetime, optional :param end_date: End date of the summary period. If None (default) uses the start date plus the window, or if both absent, the latest date in the data :type end_date: datetime.datetime, optional :param window: Number of days to use for the summary period or an explicit timedelta. If None (default) uses the whole period. Can't be given if start and end date are also given. :type window: int or datetime.timedelta, optional :param every: Optional additional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. Defaults to None. :type every: str, optional :param debug: If True, include the Period column in output when `every` is specified. If False, the Period column is dropped from the results. This parameter affects the return value structure, not logging output. For debug logging, use logging.basicConfig(level=logging.DEBUG). :type debug: bool, default False :returns: Summary across all Predictions as a dataframe with the following fields: Time and Configuration Fields: - DateRange Min - The minimum date in the summary time range - DateRange Max - The maximum date in the summary time range - Duration - The duration in seconds between the minimum and maximum snapshot times - Prediction: The prediction name - Channel: The channel name - Direction: The direction (e.g., Inbound, Outbound) - ChannelDirectionGroup: Combined Channel/Direction identifier - isValid: Boolean indicating if the prediction data is valid - usesNBAD: Boolean indicating if this is a standard NBAD prediction - isMultiChannel: Boolean indicating if this is a multichannel prediction - ControlPercentage: Percentage of responses in control group - TestPercentage: Percentage of responses in test group Performance Metrics: - Performance: Weighted model performance (AUC) in range 0.5-1.0 - Positives: Sum of positive responses - Negatives: Sum of negative responses - Responses: Sum of all responses - Positives_Test: Sum of positive responses in test group - Positives_Control: Sum of positive responses in control group - Positives_NBA: Sum of positive responses in NBA group - Negatives_Test: Sum of negative responses in test group - Negatives_Control: Sum of negative responses in control group - Negatives_NBA: Sum of negative responses in NBA group - CTR: Clickthrough rate (Positives over Positives + Negatives) - CTR_Test: Clickthrough rate for test group (model propensitities) - CTR_Control: Clickthrough rate for control group (random propensities) - CTR_NBA: Clickthrough rate for NBA group (available only when Impact Analyzer is used) - Lift: Lift in Engagement when testing prioritization with just Adaptive Models vs just Random Propensity Technology Usage Indicators: - usesImpactAnalyzer: Boolean indicating if Impact Analyzer is used :rtype: pl.LazyFrame .. py:method:: overall_summary(custom_predictions: list[list] | None = None, *, start_date: datetime.datetime | None = None, end_date: datetime.datetime | None = None, window: int | datetime.timedelta | None = None, every: str | None = None, debug: bool = False) -> polars.LazyFrame Overall prediction summary. Only valid prediction data is included. :param custom_predictions: Optional list with custom prediction name to channel mappings. Each item should be [PredictionName, Channel, Direction, isMultiChannel]. Defaults to None. :type custom_predictions: Optional[list[list]], optional :param start_date: Start date of the summary period. If None (default) uses the end date minus the window, or if both absent, the earliest date in the data :type start_date: datetime.datetime, optional :param end_date: End date of the summary period. If None (default) uses the start date plus the window, or if both absent, the latest date in the data :type end_date: datetime.datetime, optional :param window: Number of days to use for the summary period or an explicit timedelta. If None (default) uses the whole period. Can't be given if start and end date are also given. :type window: int or datetime.timedelta, optional :param every: Optional additional grouping by time period. Format string as in polars.Expr.dt.truncate (https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.truncate.html), for example "1mo", "1w", "1d" for calendar month, week day. Defaults to None. :type every: str, optional :param debug: If True, include the Period column in output when `every` is specified. If False, the Period column is dropped from the results. This parameter affects the return value structure, not logging output. For debug logging, use logging.basicConfig(level=logging.DEBUG). :type debug: bool, default False :returns: Summary across all Predictions as a dataframe with the following fields: Time and Configuration Fields: - DateRange Min - The minimum date in the summary time range - DateRange Max - The maximum date in the summary time range - Duration - The duration in seconds between the minimum and maximum snapshot times - ControlPercentage: Weighted average percentage of control group responses - TestPercentage: Weighted average percentage of test group responses - usesNBAD: Boolean indicating if any of the predictions is a standard NBAD prediction Performance Metrics: - Performance: Weighted average performance (AUC) across all valid channels in range 0.5-1.0 - Positives Inbound: Sum of positive responses across all valid inbound channels - Positives Outbound: Sum of positive responses across all valid outbound channels - Responses Inbound: Sum of all responses across all valid inbound channels - Responses Outbound: Sum of all responses across all valid outbound channels - Overall Lift: Weighted average lift across all valid channels - Minimum Negative Lift: The lowest negative lift value found Channel Statistics: - Number of Valid Channels: Count of unique valid channel/direction combinations - Channel with Minimum Negative Lift: Channel with the lowest negative lift value Technology Usage Indicators: - usesImpactAnalyzer: Boolean indicating if any channel uses Impact Analyzer :rtype: pl.LazyFrame .. py:function:: default_predictor_categorization(x: str | polars.Expr = pl.col('PredictorName')) -> polars.Expr Function to determine the 'category' of a predictor. It is possible to supply a custom function. This function can accept an optional column as input And as output should be a Polars expression. The most straight-forward way to implement this is with pl.when().then().otherwise(), which you can chain. By default, this function returns "Primary" whenever there is no '.' anywhere in the name string, otherwise returns the first string before the first period :param x: The column to parse :type x: Union[str, pl.Expr], default = pl.col('PredictorName') .. py:function:: cdh_sample(query: pdstools.utils.types.QUERY | None = None) -> pdstools.adm.ADMDatamart.ADMDatamart Import a sample dataset from the CDH Sample application :param query: An optional query to apply to the data, by default None :type query: Optional[QUERY], optional :returns: The ADM Datamart class populated with CDH Sample data :rtype: ADMDatamart .. py:function:: sample_value_finder(threshold: float | None = None) -> pdstools.valuefinder.ValueFinder.ValueFinder Import a sample dataset of a Value Finder simulation This simulation was ran on a stock CDH Sample system. :param threshold: Optional override of the propensity threshold in the system, by default None :type threshold: Optional[float], optional :returns: The Value Finder class populated with the Value Finder simulation data :rtype: ValueFinder .. py:function:: show_versions(print_output: Literal[True] = True) -> None show_versions(print_output: Literal[False] = False) -> str Get a list of currently installed versions of pdstools and its dependencies. :param print_output: If True, print the version information to stdout. If False, return the version information as a string. Default is True. :type print_output: bool, optional :param include_dependencies: If True, include the versions of dependencies in the output. If False, only include the pdstools version and system information. Default is True. :type include_dependencies: bool, optional :returns: Version information as a string if print_output is False, else None. :rtype: Optional[str] .. rubric:: Examples >>> from pdstools import show_versions >>> show_versions() --- Version info --- pdstools: 4.0.0-alpha Platform: macOS-14.7-arm64-arm-64bit Python: 3.12.4 (main, Jun 6 2024, 18:26:44) [Clang 15.0.0 (clang-1500.3.9.4)] --- Dependencies --- typing_extensions: 4.12.2 polars>=1.9: 1.9.0 --- Dependency group: adm --- plotly>=5.5.0: 5.24.1 --- Dependency group: api --- pydantic: 2.9.2 httpx: 0.27.2 .. py:class:: ValueFinder(df: polars.LazyFrame, *, query: pdstools.utils.types.QUERY | None = None, n_customers: int | None = None, threshold: float | None = None) Analyze the Value Finder dataset for detailed insights .. py:attribute:: df :type: polars.LazyFrame .. py:attribute:: n_customers :type: int .. py:attribute:: nbad_stages :value: ['Eligibility', 'Applicability', 'Suitability', 'Arbitration'] .. py:attribute:: aggregates .. py:attribute:: plot .. py:method:: from_ds_export(filename: str | None = None, base_path: os.PathLike | str = '.', *, query: pdstools.utils.types.QUERY | None = None, n_customers: int | None = None, threshold: float | None = None) :classmethod: .. py:method:: from_dataflow_export(files: collections.abc.Iterable[str] | str, *, query: pdstools.utils.types.QUERY | None = None, n_customers: int | None = None, threshold: float | None = None, cache_file_prefix: str = '', extension: Literal['json'] = 'json', compression: Literal['gzip'] = 'gzip', cache_directory: os.PathLike | str = 'cache') :classmethod: .. py:method:: set_threshold(new_threshold: float | None = None) .. py:property:: threshold .. py:method:: save_data(path: os.PathLike | str = '.') -> pathlib.Path | None Cache the pyValueFinder dataset to a Parquet file :param path: Where to place the file :type path: str :returns: The paths to the model and predictor data files :rtype: (Optional[Path], Optional[Path])