pdstools.explanations.Schema

Polars schemas for explanations input parquet files and aggregate outputs.

Mirrors the pattern used by pdstools.adm.Schema: each class is a collection of class-level attributes naming the expected columns and their polars dtypes. Apply with cdh_utils._apply_schema_types.

The raw explanation parquet schema is the public contract between Pega and the Explanations module. Validating against it up front (in Preprocess.generate) turns malformed inputs into a clear ValueError instead of a cryptic DuckDB error mid-processing.

Attributes

REQUIRED_RAW_COLUMNS

Columns that must be present in every raw explanation parquet file.

Classes

RawExplanationData

Schema for a single explanation parquet file produced by Pega.

ContextualAggregate

Schema for the per-context aggregate parquet (*_BATCH_*.parquet).

OverallAggregate

Schema for the per-model aggregate parquet (*_OVERALL.parquet).

Module Contents

class RawExplanationData

Schema for a single explanation parquet file produced by Pega.

Each row is one (sample, predictor) shap-coefficient observation. Context columns (pyChannel, pyDirection, pyIssue, pyGroup, pyName, pyTreatment) are user-configurable and not all of them are required to be present, so they are not part of the strict required-columns check. The partition column (JSON-encoded context dict) is required because every downstream SQL aggregation groups by it.

pySubjectID
pyInteractionID
predictor_name
predictor_type
symbolic_value
numeric_value
shap_coeff
score
partition
REQUIRED_RAW_COLUMNS: tuple[str, Ellipsis] = ('pyInteractionID', 'predictor_name', 'predictor_type', 'shap_coeff', 'partition')

Columns that must be present in every raw explanation parquet file.

symbolic_value and numeric_value are technically optional per row (one is null depending on predictor_type), but at least one must exist as a column or the SQL queries fail. We check this separately in _validate_raw_data.

class ContextualAggregate

Schema for the per-context aggregate parquet (*_BATCH_*.parquet).

Produced by Preprocess._parquet_in_batches from resources/queries/numeric.sql or symbolic.sql.

partition
predictor_name
predictor_type
bin_contents
bin_order
contribution_abs
contribution
contribution_min
contribution_max
frequency
class OverallAggregate

Schema for the per-model aggregate parquet (*_OVERALL.parquet).

Same shape as ContextualAggregate but partition is always the literal string 'whole_model'.

partition
predictor_name
predictor_type
bin_contents
bin_order
contribution_abs
contribution
contribution_min
contribution_max
frequency