pdstools.utils.cdh_utils
========================

.. py:module:: pdstools.utils.cdh_utils


Attributes
----------

.. autoapisummary::

   pdstools.utils.cdh_utils.F
   pdstools.utils.cdh_utils.Figure


Functions
---------

.. autoapisummary::

   pdstools.utils.cdh_utils._apply_query
   pdstools.utils.cdh_utils._combine_queries
   pdstools.utils.cdh_utils.default_predictor_categorization
   pdstools.utils.cdh_utils._extract_keys
   pdstools.utils.cdh_utils.parse_pega_date_time_formats
   pdstools.utils.cdh_utils.safe_range_auc
   pdstools.utils.cdh_utils.auc_from_probs
   pdstools.utils.cdh_utils.auc_from_bincounts
   pdstools.utils.cdh_utils.aucpr_from_probs
   pdstools.utils.cdh_utils.aucpr_from_bincounts
   pdstools.utils.cdh_utils.auc_to_gini
   pdstools.utils.cdh_utils._capitalize
   pdstools.utils.cdh_utils._polars_capitalize
   pdstools.utils.cdh_utils.from_prpc_date_time
   pdstools.utils.cdh_utils.to_prpc_date_time
   pdstools.utils.cdh_utils.weighted_average_polars
   pdstools.utils.cdh_utils.weighted_performance_polars
   pdstools.utils.cdh_utils.overlap_lists_polars
   pdstools.utils.cdh_utils.z_ratio
   pdstools.utils.cdh_utils.lift
   pdstools.utils.cdh_utils.bin_log_odds
   pdstools.utils.cdh_utils.log_odds_polars
   pdstools.utils.cdh_utils.feature_importance
   pdstools.utils.cdh_utils._apply_schema_types
   pdstools.utils.cdh_utils.gains_table
   pdstools.utils.cdh_utils.lazy_sample
   pdstools.utils.cdh_utils.legend_color_order
   pdstools.utils.cdh_utils.process_files_to_bytes
   pdstools.utils.cdh_utils.get_latest_pdstools_version
   pdstools.utils.cdh_utils.setup_logger
   pdstools.utils.cdh_utils.create_working_and_temp_dir
   pdstools.utils.cdh_utils.safe_flatten_list
   pdstools.utils.cdh_utils._get_start_end_date_args
   pdstools.utils.cdh_utils._read_pdc


Module Contents
---------------

.. py:data:: F

.. py:data:: Figure

.. py:function:: _apply_query(df: F, query: Optional[pdstools.utils.types.QUERY] = None, allow_empty: bool = False) -> F

.. py:function:: _combine_queries(existing_query: pdstools.utils.types.QUERY, new_query: polars.Expr) -> pdstools.utils.types.QUERY

.. py:function:: default_predictor_categorization(x: Union[str, polars.Expr] = pl.col('PredictorName')) -> polars.Expr

   Function to determine the 'category' of a predictor.

   It is possible to supply a custom function.
   This function can accept an optional column as input
   And as output should be a Polars expression.
   The most straight-forward way to implement this is with
   pl.when().then().otherwise(), which you can chain.

   By default, this function returns "Primary" whenever
   there is no '.' anywhere in the name string,
   otherwise returns the first string before the first period

   :param x: The column to parse
   :type x: Union[str, pl.Expr], default = pl.col('PredictorName')


.. py:function:: _extract_keys(df: F, key: str = 'Name', capitalize: bool = True) -> F

   Extracts keys out of the pyName column

   This is not a lazy operation as we don't know the possible keys
   in advance. For that reason, we select only the key column,
   extract the keys from that, and then collect the resulting dataframe.
   This dataframe is then joined back to the original dataframe.

   This is relatively efficient, but we still do need the whole
   pyName column in memory to do this, so it won't work completely
   lazily from e.g. s3. That's why it only works with eager mode.

   The data in column for which the JSON is extract is normalized a
   little by taking out non-space, non-printable characters. Not just
   ASCII of course. This may be relatively expensive.

   JSON extraction only happens on the unique values so saves a lot
   of time with multiple snapshots of the same models, it also only
   processes rows for which the key column appears to be valid JSON.
   It will break when you "trick" it with malformed JSON.

   Column values for columns that are also encoded in the key column
   will be overwritten with values from the key column, but only for
   rows that are JSON. In previous versions all values were overwritten
   resulting in many nulls.

   :param df: The dataframe to extract the keys from
   :type df: Union[pl.DataFrame, pl.LazyFrame]
   :param key: The column with embedded JSON
   :type key: str
   :param capitalize: If True (default) normalizes the names of the embedded columns
                      otherwise keeps the names as-is.
   :type capitalize: bool


.. py:function:: parse_pega_date_time_formats(timestamp_col='SnapshotTime', timestamp_fmt: Optional[str] = None, timestamp_dtype: Optional[polars._typing.PolarsTemporalType] = pl.Datetime)

   Parses Pega DateTime formats.

   Supports commonly used formats:

   - "%Y-%m-%d %H:%M:%S"
   - "%Y%m%dT%H%M%S.%f %Z"
   - "%d-%b-%y"
   - "%d%b%Y:%H:%M:%S"
   - "%Y%m%d"

   Removes timezones, and rounds to seconds, with a 'ns' time unit.

   In the implementation, the last expression uses timestamp_fmt or %Y.
   This is a bit of a hack, because if we pass None, it tries to infer automatically.
   Inferring raises when it can't find an appropriate format, so that's not good.

   :param timestampCol: The column to parse
   :type timestampCol: str, default = 'SnapshotTime'
   :param timestamp_fmt: An optional format to use rather than the default formats
   :type timestamp_fmt: str, default = None
   :param timestamp_dtype: The data type to convert into. Can be either Date, Datetime, or Time.
   :type timestamp_dtype: PolarsTemporalType, default = pl.Datetime


.. py:function:: safe_range_auc(auc: float) -> float

   Internal helper to keep auc a safe number between 0.5 and 1.0 always.

   :param auc: The AUC (Area Under the Curve) score
   :type auc: float

   :returns: 'Safe' AUC score, between 0.5 and 1.0
   :rtype: float


.. py:function:: auc_from_probs(groundtruth: List[int], probs: List[float]) -> List[float]

   Calculates AUC from an array of truth values and predictions.
   Calculates the area under the ROC curve from an array of truth values and
   predictions, making sure to always return a value between 0.5 and 1.0 and
   returns 0.5 when there is just one groundtruth label.

   :param groundtruth: The 'true' values, Positive values must be represented as
                       True or 1. Negative values must be represented as False or 0.
   :type groundtruth: List[int]
   :param probs: The predictions, as a numeric vector of the same length as groundtruth
   :type probs: List[float]
   :param Returns: The AUC as a value between 0.5 and 1.
   :type Returns: List[float]
   :param Examples: >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])


.. py:function:: auc_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) -> float

   Calculates AUC from counts of positives and negatives directly
   This is an efficient calculation of the area under the ROC curve directly from an array of positives
   and negatives. It makes sure to always return a value between 0.5 and 1.0
   and will return 0.5 when there is just one groundtruth label.

   :param pos: Vector with counts of the positive responses
   :type pos: List[int]
   :param neg: Vector with counts of the negative responses
   :type neg: List[int]
   :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).
   :type probs: List[float]

   :returns: * *float* -- The AUC as a value between 0.5 and 1.
             * *Examples* -- >>> auc_from_bincounts([3,1,0], [2,0,1])


.. py:function:: aucpr_from_probs(groundtruth: List[int], probs: List[float]) -> List[float]

   Calculates PR AUC (precision-recall) from an array of truth values and predictions.
   Calculates the area under the PR curve from an array of truth values and
   predictions. Returns 0.0 when there is just one groundtruth label.

   :param groundtruth: The 'true' values, Positive values must be represented as
                       True or 1. Negative values must be represented as False or 0.
   :type groundtruth: List[int]
   :param probs: The predictions, as a numeric vector of the same length as groundtruth
   :type probs: List[float]
   :param Returns: The AUC as a value between 0.5 and 1.
   :type Returns: List[float]
   :param Examples: >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])


.. py:function:: aucpr_from_bincounts(pos: List[int], neg: List[int], probs: List[float] = None) -> float

   Calculates PR AUC (precision-recall) from counts of positives and negatives directly.
   This is an efficient calculation of the area under the PR curve directly from an
   array of positives and negatives. Returns 0.0 when there is just one
   groundtruth label.

   :param pos: Vector with counts of the positive responses
   :type pos: List[int]
   :param neg: Vector with counts of the negative responses
   :type neg: List[int]
   :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).
   :type probs: List[float]

   :returns: * *float* -- The PR AUC as a value between 0.0 and 1.
             * *Examples* -- >>> aucpr_from_bincounts([3,1,0], [2,0,1])


.. py:function:: auc_to_gini(auc: float) -> float

   Convert AUC performance metric to GINI

   :param auc: The AUC (number between 0.5 and 1)
   :type auc: float

   :returns: * *float* -- GINI metric, a number between 0 and 1
             * *Examples* -- >>> auc2GINI(0.8232)


.. py:function:: _capitalize(fields: Union[str, Iterable[str]], extra_endwords: Optional[Iterable[str]] = None) -> List[str]

   Applies automatic capitalization, aligned with the R couterpart.

   :param fields: A list of names
   :type fields: list

   :returns: **fields** -- The input list, but each value properly capitalized
   :rtype: list


.. py:function:: _polars_capitalize(df: F, extra_endwords: Optional[Iterable[str]] = None) -> F

.. py:function:: from_prpc_date_time(x: str, return_string: bool = False, use_timezones: bool = True) -> Union[datetime.datetime, str]

   Convert from a Pega date-time string.

   :param x: String of Pega date-time
   :type x: str
   :param return_string: If True it will return the date in string format. If
                         False it will return in datetime type
   :type return_string: bool, default=False

   :returns: * *Union[datetime.datetime, str]* -- The converted date in datetime format or string.
             * *Examples* -- >>> fromPRPCDateTime("20180316T134127.847 GMT")
               >>> fromPRPCDateTime("20180316T134127.847 GMT", True)
               >>> fromPRPCDateTime("20180316T184127.846")
               >>> fromPRPCDateTime("20180316T184127.846", True)


.. py:function:: to_prpc_date_time(dt: datetime.datetime) -> str

   Convert to a Pega date-time string

   :param x: A datetime object
   :type x: datetime.datetime

   :returns: * *str* -- A string representation in the format used by Pega
             * *Examples* -- >>> toPRPCDateTime(datetime.datetime.now())


.. py:function:: weighted_average_polars(vals: Union[str, polars.Expr], weights: Union[str, polars.Expr]) -> polars.Expr

.. py:function:: weighted_performance_polars(vals: Union[str, polars.Expr] = 'Performance', weights: Union[str, polars.Expr] = 'ResponseCount') -> polars.Expr

   Polars function to return a weighted performance


.. py:function:: overlap_lists_polars(col: polars.Series) -> polars.Series

   Calculate the overlap of each of the elements (must be a list) with all the other elements


.. py:function:: z_ratio(pos_col: Union[str, polars.Expr] = pl.col('BinPositives'), neg_col: Union[str, polars.Expr] = pl.col('BinNegatives')) -> polars.Expr

   Calculates the Z-Ratio for predictor bins.

   The Z-ratio is a measure of how the propensity in a bin differs from the average,
   but takes into account the size of the bin and thus is statistically more relevant.
   It represents the number of standard deviations from the avreage,
   so centers around 0. The wider the spread, the better the predictor is.

   To recreate the OOTB ZRatios from the datamart, use in a group_by.
   See `examples`.

   :param posCol: The (Polars) column of the bin positives
   :type posCol: pl.Expr
   :param negCol: The (Polars) column of the bin positives
   :type negCol: pl.Expr

   .. rubric:: Examples

   >>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()


.. py:function:: lift(pos_col: Union[str, polars.Expr] = pl.col('BinPositives'), neg_col: Union[str, polars.Expr] = pl.col('BinNegatives')) -> polars.Expr

   Calculates the Lift for predictor bins.

   The Lift is the ratio of the propensity in a particular bin over the average
   propensity. So a value of 1 is the average, larger than 1 means higher
   propensity, smaller means lower propensity.

   :param posCol: The (Polars) column of the bin positives
   :type posCol: pl.Expr
   :param negCol: The (Polars) column of the bin positives
   :type negCol: pl.Expr

   .. rubric:: Examples

   >>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()


.. py:function:: bin_log_odds(bin_pos: List[float], bin_neg: List[float]) -> List[float]

.. py:function:: log_odds_polars(positives: Union[polars.Expr, str] = pl.col('Positives'), negatives: Union[polars.Expr, str] = pl.col('ResponseCount') - pl.col('Positives'))

.. py:function:: feature_importance(over=['PredictorName', 'ModelID'])

.. py:function:: _apply_schema_types(df: F, definition, verbose=False, **timestamp_opts) -> F

   This function is used to convert the data types of columns in a DataFrame to a desired types.
   The desired types are defined in a `PegaDefaultTables` class.

   :param df: The DataFrame whose columns' data types need to be converted.
   :type df: pl.LazyFrame
   :param definition: A `PegaDefaultTables` object that contains the desired data types for the columns.
   :type definition: PegaDefaultTables
   :param verbose: If True, the function will print a message when a column is not in the default table schema.
   :type verbose: bool
   :param timestamp_opts: Additional arguments for timestamp parsing.
   :type timestamp_opts: str

   :returns: A list with polars expressions for casting data types.
   :rtype: List


.. py:function:: gains_table(df, value: str, index=None, by=None)

   Calculates cumulative gains from any data frame.

   The cumulative gains are the cumulative values expressed
   as a percentage vs the size of the population, also expressed
   as a percentage.

   :param df: The (Polars) dataframe with the raw values
   :type df: pl.DataFrame
   :param value: The name of the field with the values (plotted on y-axis)
   :type value: str
   :param index = None: Optional name of the field for the x-axis. If not passed in
                        all records are used and weighted equally.
   :param by = None: Grouping field(s), can also be None

   :returns: A (Polars) dataframe with cum_x and cum_y columns and optionally
             the grouping column(s). Values for cum_x and cum_y are relative
             so expressed as values 0-1.
   :rtype: pl.DataFrame

   .. rubric:: Examples

   >>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])


.. py:function:: lazy_sample(df: F, n_rows: int, with_replacement: bool = True) -> F

.. py:function:: legend_color_order(fig)

   Orders legend colors alphabetically in order to provide pega color
   consistency among different categories


.. py:function:: process_files_to_bytes(file_paths: List[Union[str, pathlib.Path]], base_file_name: Union[str, pathlib.Path]) -> Tuple[bytes, str]

   Processes a list of file paths, returning file content as bytes and a corresponding file name.
   Useful for zipping muliple model reports and the byte object is used for downloading files in
   Streamlit app.

   This function handles three scenarios:
   1. Single file: Returns the file's content as bytes and the provided base file name.
   2. Multiple files: Creates a zip file containing all files, returns the zip file's content as bytes
      and a generated zip file name.
   3. No files: Returns empty bytes and an empty string.

   :param file_paths: A list of file paths to process. Can be empty, contain a single path, or multiple paths.
   :type file_paths: List[Union[str, Path]]
   :param base_file_name: The base name to use for the output file. For a single file, this name is returned as is.
                          For multiple files, this is used as part of the generated zip file name.
   :type base_file_name: Union[str, Path]

   :returns: A tuple containing:
             - bytes: The content of the single file or the created zip file, or empty bytes if no files.
             - str: The file name (either base_file_name or a generated zip file name), or an empty string if no files.
   :rtype: Tuple[bytes, str]


.. py:function:: get_latest_pdstools_version()

.. py:function:: setup_logger()

   Returns a logger and log buffer in root level


.. py:function:: create_working_and_temp_dir(name: Optional[str] = None, working_dir: Optional[os.PathLike] = None) -> Tuple[pathlib.Path, pathlib.Path]

   Creates a working directory for saving files and a temp_dir


.. py:function:: safe_flatten_list(alist: List) -> List

.. py:function:: _get_start_end_date_args(data: Union[polars.Series, polars.LazyFrame, polars.DataFrame], start_date: Optional[datetime.datetime] = None, end_date: Optional[datetime.datetime] = None, window: Optional[Union[int, datetime.timedelta]] = None, datetime_field='SnapshotTime')

.. py:function:: _read_pdc(pdc_data: polars.LazyFrame)