pdstools.decision_analyzer.utils

Attributes

Classes

ColumnResolver

Resolves column mappings between raw data and a standardized schema.

Functions

apply_filter(df[, filters])

Apply a global set of filters. Kept outside of the DecisionData class as

area_under_curve(df, col_x, col_y)

gini_coefficient(df, col_x, col_y)

get_first_level_stats(interaction_data[, filters])

Returns first-level stats of a dataframe for the filter summary.

resolve_aliases(→ polars.LazyFrame)

Rename alias columns to their canonical raw key names before validation.

determine_extract_type(raw_data)

Detect whether the data is a Decision Analyzer (v2) or Explainability Extract (v1).

rename_and_cast_types(→ polars.LazyFrame)

Rename columns and cast data types based on table definition.

_cast_columns(→ polars.LazyFrame)

Cast columns to their target types.

get_table_definition(table)

create_hierarchical_selectors(→ dict[str, dict[str, ...)

Create hierarchical filter options and calculate indices for selectbox widgets.

get_scope_config(→ dict[str, ...)

Generate scope configuration for lever application and plotting based on user selections.

_get_interaction_id_candidates(→ list[str])

Build the set of possible interaction ID column names from the schema.

_find_interaction_id_column(→ str)

Return the first matching interaction ID column name from the data.

sample_interactions(→ polars.LazyFrame)

Sample interactions from a LazyFrame before ingestion.

sample_and_save(→ polars.LazyFrame)

Sample interactions and persist the result as a parquet file.

parse_sample_flag(→ dict[str, int | float])

Parse the --sample CLI flag value into keyword arguments.

Module Contents

class ColumnResolver

Resolves column mappings between raw data and a standardized schema.

Raw decision data can come from multiple sources with different schemas: - Explainability Extract vs Decision Analyzer exports - Inbound vs Outbound channel data

For example, channel information may appear as: - ‘Channel’ (already using the display name) - ‘pyChannel’ (an alias for the display name) - ‘Primary_ContainerPayload_Channel’ (raw name needing rename) - Both raw key and display_name present (conflict requiring resolution)

This class normalizes these variations by: - Mapping raw column names to standardized display names - Resolving conflicts when both raw and display_name columns exist - Building the final schema with consistent column names

table_definition

Column definitions with ‘display_name’, ‘default’, and ‘type’ keys

Type:

dict

raw_columns

Column names present in the raw data

Type:

set[str]

table_definition: dict
raw_columns: set[str]
rename_mapping: dict[str, str]
type_mapping: dict[str, type[polars.DataType]]
columns_to_drop: list[str] = []
final_columns: list[str] = []
_resolved: bool = False
__post_init__()
resolve() ColumnResolver

Resolve all column mappings and conflicts.

Returns:

Self, for method chaining

Return type:

ColumnResolver

get_missing_columns() list[str]

Get list of required columns missing from the raw data.

Returns:

Column names that are marked as default but not found in raw data

Return type:

list[str]

SCOPE_HIERARCHY = ['Issue', 'Group', 'Action']
PRIO_FACTORS = ['Propensity', 'Value', 'Context Weight', 'Levers']
PRIO_COMPONENTS = ['Propensity', 'Value', 'Context Weight', 'Levers', 'Priority']
apply_filter(df: polars.LazyFrame, filters: polars.Expr | list[polars.Expr] | None = None)

Apply a global set of filters. Kept outside of the DecisionData class as this is really more of a utility function, not bound to that class at all.

Parameters:
  • df (polars.LazyFrame)

  • filters (polars.Expr | list[polars.Expr] | None)

area_under_curve(df: polars.DataFrame, col_x: str, col_y: str)
Parameters:
  • df (polars.DataFrame)

  • col_x (str)

  • col_y (str)

gini_coefficient(df: polars.DataFrame, col_x: str, col_y: str)
Parameters:
  • df (polars.DataFrame)

  • col_x (str)

  • col_y (str)

get_first_level_stats(interaction_data: polars.LazyFrame, filters: list[polars.Expr] | None = None)

Returns first-level stats of a dataframe for the filter summary.

Shows unique actions (Issue/Group/Action combinations), unique interactions (decisions), and total rows so users understand the impact of their filters.

Parameters:
  • interaction_data (polars.LazyFrame)

  • filters (list[polars.Expr] | None)

resolve_aliases(df: polars.LazyFrame, *table_definitions: dict) polars.LazyFrame

Rename alias columns to their canonical raw key names before validation.

Scans all table definitions for aliases entries. If an alias is found in the data but neither the raw key nor the display_name is present, the column is renamed to the raw key so downstream processing can find it.

Parameters:
  • df (pl.LazyFrame) – Raw data that may use alternative column names.

  • *table_definitions (dict) – One or more table definition dicts (DecisionAnalyzer, ExplainabilityExtract).

Returns:

Data with alias columns renamed to canonical raw key names.

Return type:

pl.LazyFrame

determine_extract_type(raw_data)

Detect whether the data is a Decision Analyzer (v2) or Explainability Extract (v1).

The heuristic is: if any column name matches the raw key, display name, or aliases for the pxStrategyName entry in the DecisionAnalyzer table definition, the data is v2.

rename_and_cast_types(df: polars.LazyFrame, table_definition: dict) polars.LazyFrame

Rename columns and cast data types based on table definition.

Performs a single-pass rename from raw column keys to display names, then casts types for default columns.

Parameters:
  • df (pl.LazyFrame) – The input dataframe to process

  • table_definition (dict) – Dictionary containing column definitions with ‘display_name’, ‘default’, and ‘type’ keys

Returns:

Processed dataframe with renamed columns and cast types

Return type:

pl.LazyFrame

_cast_columns(df: polars.LazyFrame, type_mapping: dict[str, type[polars.DataType]]) polars.LazyFrame

Cast columns to their target types.

Parameters:
  • df (pl.LazyFrame) – The dataframe to process

  • type_mapping (dict[str, type[pl.DataType]]) – Mapping of column names to their target types

Returns:

Dataframe with columns cast to target types

Return type:

pl.LazyFrame

get_table_definition(table: str)
Parameters:

table (str)

create_hierarchical_selectors(data: polars.LazyFrame, selected_issue: str | None = None, selected_group: str | None = None, selected_action: str | None = None) dict[str, dict[str, list[str] | int]]

Create hierarchical filter options and calculate indices for selectbox widgets.

Args:

data: LazyFrame with hierarchical data (should be pre-filtered to desired stage) selected_issue: Currently selected issue (optional) selected_group: Currently selected group (optional) selected_action: Currently selected action (optional)

Returns:

dict with structure: {

“issues”: {“options”: […], “index”: 0}, “groups”: {“options”: [“All”, …], “index”: 0}, “actions”: {“options”: [“All”, …], “index”: 0}

}

Parameters:
  • data (polars.LazyFrame)

  • selected_issue (str | None)

  • selected_group (str | None)

  • selected_action (str | None)

Return type:

dict[str, dict[str, list[str] | int]]

get_scope_config(selected_issue: str, selected_group: str, selected_action: str) dict[str, str | polars.Expr | list[str]]

Generate scope configuration for lever application and plotting based on user selections.

Parameters:
  • selected_issue (str) – Selected issue value from dropdown (can be “All”)

  • selected_group (str) – Selected group value from dropdown (can be “All”)

  • selected_action (str) – Selected action value from dropdown (can be “All”)

Returns:

Configuration dictionary containing: - level: “Action”, “Group”, or “Issue” indicating scope level - lever_condition: Polars expression for filtering selected actions - group_cols: List of column names for grouping operations - x_col: Column name to use for x-axis in plots - selected_value: The actual selected value for highlighting - plot_title_prefix: Prefix for plot titles

Return type:

dict[str, str | pl.Expr | list[str]]

logger
_INTERACTION_ID_RAW_KEY = 'pxInteractionID'
_get_interaction_id_candidates() list[str]

Build the set of possible interaction ID column names from the schema.

Collects the raw key, display name, and aliases from both table definitions so this stays in sync with column_schema.py.

Return type:

list[str]

_find_interaction_id_column(columns: set[str]) str

Return the first matching interaction ID column name from the data.

Parameters:

columns (set[str])

Return type:

str

sample_interactions(df: polars.LazyFrame, n: int | None = None, fraction: float | None = None, id_column: str | None = None) polars.LazyFrame

Sample interactions from a LazyFrame before ingestion.

Uses deterministic hash-based filtering so the same data and limit always produce the same sample. All rows belonging to a selected interaction are kept (stratified on interaction ID).

Exactly one of n or fraction must be provided.

Parameters:
  • df (pl.LazyFrame) – Raw data to sample from.

  • n (int, optional) – Maximum number of unique interactions to keep.

  • fraction (float, optional) – Fraction of interactions to keep (0.0–1.0).

  • id_column (str, optional) – Name of the interaction ID column. Auto-detected if not given.

Returns:

Filtered LazyFrame containing only the sampled interactions.

Return type:

pl.LazyFrame

sample_and_save(df: polars.LazyFrame, n: int | None = None, fraction: float | None = None, output_dir: str | None = None) polars.LazyFrame

Sample interactions and persist the result as a parquet file.

Writes decision_analyzer_sample.parquet into output_dir (defaults to the current working directory). Returns a LazyFrame scanning the written file so downstream code benefits from a fast parquet scan.

If the data is smaller than the requested sample, sampling is skipped and the original LazyFrame is returned unchanged (no file is written).

Parameters:
  • df (pl.LazyFrame) – Raw data to sample from.

  • n (int, optional) – Maximum number of unique interactions to keep.

  • fraction (float, optional) – Fraction of interactions to keep (0.0–1.0).

  • output_dir (str, optional) – Directory for the sample parquet file. Defaults to ".".

Returns:

Either a scan of the written parquet file, or the original LazyFrame when sampling was skipped.

Return type:

pl.LazyFrame

parse_sample_flag(value: str) dict[str, int | float]

Parse the --sample CLI flag value into keyword arguments.

Supports absolute counts ("100000") and percentages ("10%").

Returns:

Either {"n": <int>} or {"fraction": <float>}.

Return type:

dict

Parameters:

value (str)