pdstools.utils.report_utils._polars_helpers¶

Small Polars helpers and aggregations used by Quarto reports.

Functions¶

`polars_col_exists`(df, col)
`polars_subset_to_existing_cols`(all_columns, cols)
`n_unique_values`(dm, all_dm_cols, fld)
`max_by_hierarchy`(dm, all_dm_cols, fld, grouping)
`avg_by_hierarchy`(dm, all_dm_cols, fld, grouping)
`sample_values`(dm, all_dm_cols, fld[, n])
`gains_table`(→ polars.DataFrame)	Calculate cumulative gains for visualization.

Module Contents¶

polars_col_exists(df, col)¶

polars_subset_to_existing_cols(all_columns, cols)¶

n_unique_values(dm, all_dm_cols, fld)¶

max_by_hierarchy(dm, all_dm_cols, fld, grouping)¶

avg_by_hierarchy(dm, all_dm_cols, fld, grouping)¶

sample_values(dm, all_dm_cols, fld, n=6)¶

gains_table(df: polars.LazyFrame | polars.DataFrame, value: str, index: str | None = None, by: str | list[str] | None = None) → polars.DataFrame¶

Calculate cumulative gains for visualization.

Computes cumulative distribution of a value metric, sorted by the ratio of value to index (or by value alone if no index). Used for gains charts to show model response skewness.

Parameters:

df (pl.LazyFrame | pl.DataFrame) – Input data containing the value and optional index columns
value (str) – Column name containing the metric to compute gains for (e.g., “ResponseCount”)
index (str, optional) – Column name to normalize by (e.g., population size). If None, uses row count.
by (str | list[str], optional) – Column(s) to group by for separate gain curves. If None, computes single curve.

Returns:

DataFrame with columns: - cum_x: Cumulative proportion of index (or models) - cum_y: Cumulative proportion of value - by columns: if by is specified

Return type:

pl.DataFrame

Examples

>>> # Single gains curve for response count
>>> gains = gains_table(df, value="ResponseCount")

>>> # Gains curves by channel, normalized by population
>>> gains = gains_table(df, value="Positives", index="Population", by="Channel")