pdstools.explanations.Schema
============================

.. py:module:: pdstools.explanations.Schema

.. autoapi-nested-parse::

   Polars schemas for explanations input parquet files and aggregate outputs.

   Mirrors the pattern used by ``pdstools.adm.Schema``: each class is a
   collection of class-level attributes naming the expected columns and
   their polars dtypes. Apply with ``cdh_utils._apply_schema_types``.

   The raw explanation parquet schema is the public contract between Pega
   and the Explanations module. Validating against it up front (in
   ``Preprocess.generate``) turns malformed inputs into a clear
   ``ValueError`` instead of a cryptic DuckDB error mid-processing.


Attributes
----------

.. autoapisummary::

   pdstools.explanations.Schema.REQUIRED_RAW_COLUMNS


Classes
-------

.. autoapisummary::

   pdstools.explanations.Schema.RawExplanationData
   pdstools.explanations.Schema.ContextualAggregate
   pdstools.explanations.Schema.OverallAggregate


Module Contents
---------------

.. py:class:: RawExplanationData

   Schema for a single explanation parquet file produced by Pega.

   Each row is one (sample, predictor) shap-coefficient observation.
   Context columns (``pyChannel``, ``pyDirection``, ``pyIssue``,
   ``pyGroup``, ``pyName``, ``pyTreatment``) are user-configurable and
   not all of them are required to be present, so they are not part of
   the strict required-columns check. The ``partition`` column
   (JSON-encoded context dict) is required because every downstream
   SQL aggregation groups by it.


   .. py:attribute:: pySubjectID


   .. py:attribute:: pyInteractionID


   .. py:attribute:: predictor_name


   .. py:attribute:: predictor_type


   .. py:attribute:: symbolic_value


   .. py:attribute:: numeric_value


   .. py:attribute:: shap_coeff


   .. py:attribute:: score


   .. py:attribute:: partition


.. py:data:: REQUIRED_RAW_COLUMNS
   :type:  tuple[str, Ellipsis]
   :value: ('pyInteractionID', 'predictor_name', 'predictor_type', 'shap_coeff', 'partition')


   Columns that must be present in every raw explanation parquet file.

   ``symbolic_value`` and ``numeric_value`` are technically optional per
   row (one is null depending on ``predictor_type``), but at least one
   must exist as a column or the SQL queries fail. We check this
   separately in ``_validate_raw_data``.

.. py:class:: ContextualAggregate

   Schema for the per-context aggregate parquet (``*_BATCH_*.parquet``).

   Produced by ``Preprocess._parquet_in_batches`` from
   ``resources/queries/numeric.sql`` or ``symbolic.sql``.


   .. py:attribute:: partition


   .. py:attribute:: predictor_name


   .. py:attribute:: predictor_type


   .. py:attribute:: bin_contents


   .. py:attribute:: bin_order


   .. py:attribute:: contribution_abs


   .. py:attribute:: contribution


   .. py:attribute:: contribution_min


   .. py:attribute:: contribution_max


   .. py:attribute:: frequency


.. py:class:: OverallAggregate

   Schema for the per-model aggregate parquet (``*_OVERALL.parquet``).

   Same shape as ``ContextualAggregate`` but ``partition`` is always the
   literal string ``'whole_model'``.


   .. py:attribute:: partition


   .. py:attribute:: predictor_name


   .. py:attribute:: predictor_type


   .. py:attribute:: bin_contents


   .. py:attribute:: bin_order


   .. py:attribute:: contribution_abs


   .. py:attribute:: contribution


   .. py:attribute:: contribution_min


   .. py:attribute:: contribution_max


   .. py:attribute:: frequency