pdstools.explanations.Aggregate

Classes

Module Contents

class Aggregate(explanations: pdstools.explanations.Explanations.Explanations)

Bases: pdstools.utils.namespaces.LazyNamespace

Parameters:

explanations (pdstools.explanations.Explanations.Explanations)

dependencies = ['polars']
dependency_group = 'explanations'
explanations
data_folderpath
data_pattern = None
df_contextual: polars.LazyFrame | None = None
df_overall: polars.LazyFrame | None = None
context_operations
initialized = False
get_df_contextual() polars.LazyFrame

Get the contextual dataframe, loading it if not already loaded.

Return type:

polars.LazyFrame

get_df_overall() polars.LazyFrame

Get the overall dataframe, loading it if not already loaded.

Return type:

polars.LazyFrame

get_predictor_contributions(context: dict[str, str] | None = None, top_n: int = 20, *, sort_by: pdstools.explanations.ExplanationsUtils.SortBy = 'contribution_abs', descending: bool = True, missing: bool = True, remaining: bool = True, include_numeric_single_bin: bool = False) polars.DataFrame

Get the top-n predictor contributions for a given context or overall.

Args:
context (dict[str, str] | None):

The context to filter contributions by. If None, contributions for all contexts will be returned.

top_n (int):

Number of top predictors.

sort_by (str, keyword-only):

Column to rank/select top predictors. One of contribution, contribution_abs, contribution_weighted, contribution_weighted_abs. Default: "contribution_abs".

descending (bool, keyword-only):

Sort most- or least-impactful first. Default: True.

missing (bool, keyword-only):

Include missing-value bins. Default: True.

remaining (bool, keyword-only):

Include an aggregated “remaining” row for predictors outside the top-n. Default: True.

include_numeric_single_bin (bool, keyword-only):

Include numeric predictors that have only a single bin. Default: False.

Parameters:
Return type:

polars.DataFrame

get_predictor_value_contributions(predictors: list[str], context: dict[str, str] | None = None, top_k: int = 20, *, sort_by: pdstools.explanations.ExplanationsUtils.SortBy = 'contribution_abs', descending: bool = True, missing: bool = True, remaining: bool = True, include_numeric_single_bin: bool = False) polars.DataFrame

Get the top-k predictor value contributions for a given context or overall.

Args:
predictors (list[str]): Required.

list of predictors to get the contributions for.

context (dict[str, str] | None):

The context to filter contributions by. If None, contributions for all contexts will be returned.

top_k (int):

Number of unique categorical predictor values to return.

sort_by (str, keyword-only):

Column to rank/select top predictors. One of contribution, contribution_abs, contribution_weighted, contribution_weighted_abs. Default: "contribution_abs".

descending (bool, keyword-only):

Sort most- or least-impactful first. Default: True.

missing (bool, keyword-only):

Include missing-value bins. Default: True.

remaining (bool, keyword-only):

Include an aggregated “remaining” row for values outside the top-k. Default: True.

include_numeric_single_bin (bool, keyword-only):

Include numeric predictors that have only a single bin. Default: False.

Parameters:
Return type:

polars.DataFrame

validate_folder()

Check if the aggregates folder exists.

Raises:

FileNotFoundError: If the aggregates folder does not exist or is empty.

get_unique_contexts_list(context_infos: list[pdstools.explanations.ExplanationsUtils.ContextInfo] | None = None, with_partition_col: bool = False) list[pdstools.explanations.ExplanationsUtils.ContextInfo]
Parameters:
Return type:

list[pdstools.explanations.ExplanationsUtils.ContextInfo]

_load_data()
_get_predictor_contributions(contexts: list[pdstools.explanations.ExplanationsUtils.ContextInfo] | None = None, predictors: list[str] | None = None, limit: int = 20, descending: bool = True, missing: bool = True, remaining: bool = True, include_numeric_single_bin: bool = False, sort_by: str = 'contribution_abs') polars.DataFrame
Parameters:
Return type:

polars.DataFrame

_get_predictor_value_contributions(contexts: list[pdstools.explanations.ExplanationsUtils.ContextInfo] | None = None, predictors: list[str] | None = None, limit: int = 20, descending: bool = True, missing: bool = True, remaining: bool = True, include_numeric_single_bin: bool = False, sort_by: str = 'contribution_abs') polars.DataFrame
Parameters:
Return type:

polars.DataFrame

_get_df_with_sort_info(df: polars.LazyFrame, sort_by: str = 'contribution_abs') polars.LazyFrame

Add a sort column and value to the dataframe based on the predictor type. # Sort logic: # - numeric predictors are sorted by bin order # - symbolic predictors are sorted by contribution type

Parameters:
  • df (polars.LazyFrame)

  • sort_by (str)

Return type:

polars.LazyFrame

_filter_for_predictors(df: polars.LazyFrame, predictors: list[str]) polars.LazyFrame
Parameters:
  • df (polars.LazyFrame)

  • predictors (list[str])

Return type:

polars.LazyFrame

_get_df_with_top_limit(df: polars.LazyFrame, over: list[str], sort_by: str = 'contribution_abs', limit: int = 20, descending: bool = True) polars.LazyFrame

Return the top limit rows per group, ranked by sort_by.

For each unique combination of values in over, keeps only the limit rows with the highest (or lowest) value in sort_by.

When descending=True (the default), the rows with the largest values are kept — i.e. the most impactful contributions rise to the top. When descending=False, the rows with the smallest values are kept instead, which is useful when selecting the least influential predictors.

Note: Polars’ top_k_by uses a reverse parameter whose semantics are the opposite of descending. reverse=False returns the k largest values, while reverse=True returns the k smallest. To keep the caller-facing API intuitive (descending=True → largest values), we pass reverse=not descending to Polars internally.

Parameters:
Return type:

polars.LazyFrame

_get_missing_predictor_values_df(df: polars.LazyFrame) polars.LazyFrame
Parameters:

df (polars.LazyFrame)

Return type:

polars.LazyFrame

_get_df(contexts: list[pdstools.explanations.ExplanationsUtils.ContextInfo] | None = None)
Parameters:

contexts (list[pdstools.explanations.ExplanationsUtils.ContextInfo] | None)

_get_base_df(df_filtered_contexts: polars.DataFrame | None = None) polars.LazyFrame
Parameters:

df_filtered_contexts (polars.DataFrame | None)

Return type:

polars.LazyFrame

_get_group_by_columns(predictors: list[str] | None = None) list[str]
Parameters:

predictors (list[str] | None)

Return type:

list[str]

_get_sort_over_columns(predictors: list[str] | None = None) list[str]
Parameters:

predictors (list[str] | None)

Return type:

list[str]

_calculate_remaining_aggregates(df_all: polars.LazyFrame, df_anti: polars.LazyFrame, anti_on: list[str], frequency_over: list[str], aggregate_over: list[str]) polars.LazyFrame

Anti-join to isolate non-top rows, aggregate, and label as ‘remaining’.

Parameters:
  • df_all (polars.LazyFrame)

  • df_anti (polars.LazyFrame)

  • anti_on (list[str])

  • frequency_over (list[str])

  • aggregate_over (list[str])

Return type:

polars.LazyFrame

static _label_remaining(df: polars.LazyFrame, aggregate_over: list[str]) polars.LazyFrame

Add ‘remaining’ labels based on aggregation granularity.

Parameters:
  • df (polars.LazyFrame)

  • aggregate_over (list[str])

Return type:

polars.LazyFrame

_calculate_aggregates(df: polars.LazyFrame, frequency_over: list[str], aggregate_over: list[str]) polars.LazyFrame

Enrich with total_frequency at frequency_over level, then aggregate at aggregate_over level.

Parameters:
  • df (polars.LazyFrame)

  • frequency_over (list[str])

  • aggregate_over (list[str])

Return type:

polars.LazyFrame

static _add_total_frequency_to_df(df, group_by)
add_frequency_pct_to_df(df, group_by)

Add a frequency percentage column to the dataframe based on the total frequency per group.

static _get_mean_aggregates()

Get mean contribution aggregates.

static _get_weighted_aggregates()

Get frequency-weighted contribution aggregates normalized by total frequency.

static _get_frequency_aggregate()

Get frequency sum aggregate.

static _get_bounds_aggregates()

Get min and max contribution bounds.

_agg_over_columns_in_df(df, group_by)

Aggregate contribution metrics over specified columns.

static _filter_single_bin_numeric_predictors(df: polars.LazyFrame) polars.LazyFrame

Remove numeric predictors that have only a single non-missing bin.

Parameters:

df (polars.LazyFrame)

Return type:

polars.LazyFrame