pdstools.utils.report_utils._polars_helpers

Small Polars helpers and aggregations used by Quarto reports.

Functions

polars_col_exists(df, col)

polars_subset_to_existing_cols(all_columns, cols)

n_unique_values(dm, all_dm_cols, fld)

max_by_hierarchy(dm, all_dm_cols, fld, grouping)

avg_by_hierarchy(dm, all_dm_cols, fld, grouping)

sample_values(dm, all_dm_cols, fld[, n])

gains_table(→ polars.DataFrame)

Calculate cumulative gains for visualization.

Module Contents

polars_col_exists(df, col)
polars_subset_to_existing_cols(all_columns, cols)
n_unique_values(dm, all_dm_cols, fld)
max_by_hierarchy(dm, all_dm_cols, fld, grouping)
avg_by_hierarchy(dm, all_dm_cols, fld, grouping)
sample_values(dm, all_dm_cols, fld, n=6)
gains_table(df: polars.LazyFrame | polars.DataFrame, value: str, index: str | None = None, by: str | list[str] | None = None) polars.DataFrame

Calculate cumulative gains for visualization.

Computes cumulative distribution of a value metric, sorted by the ratio of value to index (or by value alone if no index). Used for gains charts to show model response skewness.

Parameters:
  • df (pl.LazyFrame | pl.DataFrame) – Input data containing the value and optional index columns

  • value (str) – Column name containing the metric to compute gains for (e.g., “ResponseCount”)

  • index (str, optional) – Column name to normalize by (e.g., population size). If None, uses row count.

  • by (str | list[str], optional) – Column(s) to group by for separate gain curves. If None, computes single curve.

Returns:

DataFrame with columns: - cum_x: Cumulative proportion of index (or models) - cum_y: Cumulative proportion of value - by columns: if by is specified

Return type:

pl.DataFrame

Examples

>>> # Single gains curve for response count
>>> gains = gains_table(df, value="ResponseCount")
>>> # Gains curves by channel, normalized by population
>>> gains = gains_table(df, value="Positives", index="Population", by="Channel")