pdstools.utils.cdh_utils._metrics
=================================

.. py:module:: pdstools.utils.cdh_utils._metrics

.. autoapi-nested-parse::

   Performance metrics: AUC, lift, log-odds, gains, feature importance.


Functions
---------

.. autoapisummary::

   pdstools.utils.cdh_utils._metrics.safe_range_auc
   pdstools.utils.cdh_utils._metrics.auc_from_probs
   pdstools.utils.cdh_utils._metrics.auc_from_bincounts
   pdstools.utils.cdh_utils._metrics.aucpr_from_probs
   pdstools.utils.cdh_utils._metrics.aucpr_from_bincounts
   pdstools.utils.cdh_utils._metrics.auc_to_gini
   pdstools.utils.cdh_utils._metrics.z_ratio
   pdstools.utils.cdh_utils._metrics.lift
   pdstools.utils.cdh_utils._metrics.bin_log_odds
   pdstools.utils.cdh_utils._metrics.log_odds_polars
   pdstools.utils.cdh_utils._metrics.feature_importance
   pdstools.utils.cdh_utils._metrics.gains_table


Module Contents
---------------

.. py:function:: safe_range_auc(auc: float) -> float

   Internal helper to keep auc a safe number between 0.5 and 1.0 always.

   :param auc: The AUC (Area Under the Curve) score
   :type auc: float

   :returns: 'Safe' AUC score, between 0.5 and 1.0
   :rtype: float


.. py:function:: auc_from_probs(groundtruth: list[int], probs: list[float]) -> float

   Calculates AUC from an array of truth values and predictions.
   Calculates the area under the ROC curve from an array of truth values and
   predictions, making sure to always return a value between 0.5 and 1.0 and
   returns 0.5 when there is just one groundtruth label.

   :param groundtruth: The 'true' values, Positive values must be represented as
                       True or 1. Negative values must be represented as False or 0.
   :type groundtruth: list[int]
   :param probs: The predictions, as a numeric vector of the same length as groundtruth
   :type probs: list[float]
   :param Returns: The AUC as a value between 0.5 and 1.
   :type Returns: float

   .. rubric:: Examples

   >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])


.. py:function:: auc_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) -> float

   Calculates AUC from counts of positives and negatives directly
   This is an efficient calculation of the area under the ROC curve directly from an array of positives
   and negatives. It makes sure to always return a value between 0.5 and 1.0
   and will return 0.5 when there is just one groundtruth label.

   :param pos: Vector with counts of the positive responses
   :type pos: list[int]
   :param neg: Vector with counts of the negative responses
   :type neg: list[int]
   :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).
   :type probs: list[float]

   :returns: The AUC as a value between 0.5 and 1.
   :rtype: float

   .. rubric:: Examples

   >>> auc_from_bincounts([3,1,0], [2,0,1])


.. py:function:: aucpr_from_probs(groundtruth: list[int], probs: list[float]) -> float

   Calculates PR AUC (precision-recall) from an array of truth values and predictions.
   Calculates the area under the PR curve from an array of truth values and
   predictions. Returns 0.0 when there is just one groundtruth label.

   :param groundtruth: The 'true' values, Positive values must be represented as
                       True or 1. Negative values must be represented as False or 0.
   :type groundtruth: list[int]
   :param probs: The predictions, as a numeric vector of the same length as groundtruth
   :type probs: list[float]
   :param Returns: The AUC as a value between 0.5 and 1.
   :type Returns: float

   .. rubric:: Examples

   >>> auc_from_probs( [1,1,0], [0.6,0.2,0.2])


.. py:function:: aucpr_from_bincounts(pos: collections.abc.Sequence[int] | polars.Series, neg: collections.abc.Sequence[int] | polars.Series, probs: collections.abc.Sequence[float] | polars.Series | None = None) -> float

   Calculates PR AUC (precision-recall) from counts of positives and negatives directly.
   This is an efficient calculation of the area under the PR curve directly from an
   array of positives and negatives. Returns 0.0 when there is just one
   groundtruth label.

   :param pos: Vector with counts of the positive responses
   :type pos: list[int]
   :param neg: Vector with counts of the negative responses
   :type neg: list[int]
   :param probs: Optional list with probabilities which will be used to set the order of the bins. If missing defaults to pos/(pos+neg).
   :type probs: list[float]

   :returns: The PR AUC as a value between 0.0 and 1.
   :rtype: float

   .. rubric:: Examples

   >>> aucpr_from_bincounts([3,1,0], [2,0,1])


.. py:function:: auc_to_gini(auc: float) -> float

   Convert AUC performance metric to GINI

   :param auc: The AUC (number between 0.5 and 1)
   :type auc: float

   :returns: GINI metric, a number between 0 and 1
   :rtype: float

   .. rubric:: Examples

   >>> auc2GINI(0.8232)


.. py:function:: z_ratio(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) -> polars.Expr

   Calculates the Z-Ratio for predictor bins.

   The Z-ratio is a measure of how the propensity in a bin differs from the average,
   but takes into account the size of the bin and thus is statistically more relevant.
   It represents the number of standard deviations from the avreage,
   so centers around 0. The wider the spread, the better the predictor is.

   To recreate the OOTB ZRatios from the datamart, use in a group_by.
   See `examples`.

   :param posCol: The (Polars) column of the bin positives
   :type posCol: pl.Expr
   :param negCol: The (Polars) column of the bin positives
   :type negCol: pl.Expr

   .. rubric:: Examples

   >>> df.group_by(['ModelID', 'PredictorName']).agg([zRatio()]).explode()


.. py:function:: lift(pos_col: str | polars.Expr = pl.col('BinPositives'), neg_col: str | polars.Expr = pl.col('BinNegatives')) -> polars.Expr

   Calculates the Lift for predictor bins.

   The Lift is the ratio of the propensity in a particular bin over the average
   propensity. So a value of 1 is the average, larger than 1 means higher
   propensity, smaller means lower propensity.

   :param posCol: The (Polars) column of the bin positives
   :type posCol: pl.Expr
   :param negCol: The (Polars) column of the bin positives
   :type negCol: pl.Expr

   .. rubric:: Examples

   >>> df.group_by(['ModelID', 'PredictorName']).agg([lift()]).explode()


.. py:function:: bin_log_odds(bin_pos: list[float], bin_neg: list[float]) -> list[float]

.. py:function:: log_odds_polars(positives: polars.Expr | str = pl.col('Positives'), negatives: polars.Expr | str = pl.col('ResponseCount') - pl.col('Positives')) -> polars.Expr

   Calculate log odds per bin with correct Laplace smoothing.

   Formula (per bin i in predictor p):
       log(pos_i + 1/nBins) - log(sum(pos) + 1)
       - [log(neg_i + 1/nBins) - log(sum(neg) + 1)]

   Laplace smoothing uses 1/nBins where nBins is the number of bins
   for that specific predictor. This matches the platform implementation
   in GroupedPredictor.java.

   Must be used with .over() to calculate nBins per predictor group:
       .with_columns(log_odds_polars().over("PredictorName", "ModelID"))

   :param positives: Column with positive response counts per bin
   :type positives: pl.Expr or str
   :param negatives: Column with negative response counts per bin
   :type negatives: pl.Expr or str

   :returns: Log odds expression (use with .over() for correct grouping)
   :rtype: pl.Expr

   .. seealso::

      :py:obj:`feature_importance`
          Calculate predictor importance from log odds
      
      :py:obj:`bin_log_odds`
          Pure Python version (reference implementation)

   .. rubric:: References

   - ADM Explained: Log Odds calculation section
   - Issue #263: https://github.com/pegasystems/pega-datascientist-tools/issues/263
   - Platform: GroupedPredictor.java lines 603-606

   .. rubric:: Examples

   >>> # For propensity calculation in classifier
   >>> df.with_columns(
   ...     log_odds_polars(
   ...         pl.col("BinPositives"),
   ...         pl.col("BinNegatives")
   ...     ).over("PredictorName", "ModelID")
   ... )


.. py:function:: feature_importance(over: list[str] | None = None, scaled: bool = True) -> polars.Expr

   Calculate feature importance for Naive Bayes predictors.

   Feature importance represents the weighted average of absolute log odds
   values across all bins, weighted by bin response counts. This measures
   how strongly the predictor differentiates between positive and negative
   outcomes.

   Algorithm (matches platform GroupedPredictor.calculatePredictorImportance()):
   1. Calculate log odds per bin with Laplace smoothing (1/nBins)
   2. Take absolute value of each bin's log odds
   3. Calculate weighted average: Sum(|logOdds(bin)| × binResponses) / totalResponses
   4. Optional: Scale to 0-100 range (scaled=True, default)

   This matches the Pega platform implementation in:
   adaptive-learning-core-lib/.../GroupedPredictor.java lines 371-382

   Formula:
       Feature Importance = Σ |logOdds(bin)| × (binResponses / totalResponses)

   :param over: Grouping columns. Defaults to ``["PredictorName", "ModelID"]``.
   :type over: list[str], optional
   :param scaled: If True, scale importance to 0-100 where max predictor = 100
   :type scaled: bool, default True

   :returns: Feature importance expression
   :rtype: pl.Expr

   .. rubric:: Examples

   >>> df.with_columns(
   ...     feature_importance().over("PredictorName", "ModelID")
   ... )

   .. rubric:: Notes

   This implementation matches the platform calculation exactly. Issue #263
   incorrectly suggested "diff from mean" based on R implementation, but
   the platform actually uses weighted average of absolute log odds.

   .. seealso::

      :py:obj:`log_odds_polars`
          Calculate per-bin log odds

   .. rubric:: References

   - Issue #263: Calculation of Feature Importance incorrect
   - Issue #404: Add feature importance explanation to ADM Explained
   - Platform: GroupedPredictor.java calculatePredictorImportance()
   - ADM Explained: Feature Importance section


.. py:function:: gains_table(df, value: str, index=None, by=None)

   Calculates cumulative gains from any data frame.

   The cumulative gains are the cumulative values expressed
   as a percentage vs the size of the population, also expressed
   as a percentage.

   :param df: The (Polars) dataframe with the raw values
   :type df: pl.DataFrame
   :param value: The name of the field with the values (plotted on y-axis)
   :type value: str
   :param index = None: Optional name of the field for the x-axis. If not passed in
                        all records are used and weighted equally.
   :param by = None: Grouping field(s), can also be None

   :returns: A (Polars) dataframe with cum_x and cum_y columns and optionally
             the grouping column(s). Values for cum_x and cum_y are relative
             so expressed as values 0-1.
   :rtype: pl.DataFrame

   .. rubric:: Examples

   >>> gains_data = gains_table(df, 'ResponseCount', by=['Channel','Direction])