pdstools.adm.trees
==================

.. py:module:: pdstools.adm.trees

.. autoapi-nested-parse::

   ADM Gradient Boosting (AGB) model parsing, scoring, and diagnostics.

   This package provides:

   - :class:`Split` and :class:`Node` — small dataclasses describing a parsed
     split condition and a tree node.
   - :class:`ADMTreesModel` — load and analyse a single AGB model.
   - :class:`MultiTrees` — collection of snapshots of the same configuration
     over time.
   - :class:`AGB` — Datamart helper for discovering and extracting AGB models.

   Construction uses explicit factory classmethods
   (``ADMTreesModel.from_file``, ``from_url``, ``from_datamart_blob``,
   ``from_dict``, and ``MultiTrees.from_datamart``).  The legacy
   ``ADMTrees(file, ...)`` polymorphic factory was removed in v5; see
   ``docs/migration-v4-to-v5.md``.


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/pdstools/adm/trees/_agb/index
   /autoapi/pdstools/adm/trees/_model/index
   /autoapi/pdstools/adm/trees/_multi/index
   /autoapi/pdstools/adm/trees/_nodes/index


Attributes
----------

.. autoapisummary::

   pdstools.adm.trees.SplitOperator


Classes
-------

.. autoapisummary::

   pdstools.adm.trees.AGB
   pdstools.adm.trees.ADMTreesModel
   pdstools.adm.trees.MultiTrees
   pdstools.adm.trees.Node
   pdstools.adm.trees.Split


Functions
---------

.. autoapisummary::

   pdstools.adm.trees.parse_split


Package Contents
----------------

.. py:class:: AGB(datamart: pdstools.adm.ADMDatamart.ADMDatamart)

   Datamart helper for discovering and extracting AGB models.

   Reachable as ``ADMDatamart.agb``; not intended to be instantiated
   directly.


   .. py:attribute:: datamart


   .. py:method:: discover_model_types(df: polars.LazyFrame, by: str = 'Configuration') -> dict[str, str]

      Discover the type of model embedded in the ``Modeldata`` column.

      Groups by ``by`` (typically Configuration, since one model rule
      contains one model type) and decodes the first ``Modeldata`` blob
      per group to extract its ``_serialClass``.

      :param df: Datamart slice including ``Modeldata``.  Collected internally.
      :type df: pl.LazyFrame
      :param by: Grouping column.  ``Configuration`` is recommended.
      :type by: str


   .. py:method:: get_agb_models(last: bool = False, n_threads: int = 6, query: pdstools.utils.types.QUERY | None = None) -> dict[str, pdstools.adm.trees._multi.MultiTrees]

      Get all AGB models in the datamart, indexed by Configuration.

      Filters down to models whose ``_serialClass`` ends with
      ``GbModel`` and decodes them via :class:`MultiTrees`.

      :param last: If True, use only the latest snapshot per model.
      :type last: bool
      :param n_threads: Worker count for parallel blob decoding.
      :type n_threads: int
      :param query: Optional pre-filter applied before discovery.
      :type query: QUERY | None


.. py:class:: ADMTreesModel(trees: dict, model: list[dict], *, raw_input: Any = None, properties: dict[str, Any] | None = None, learning_rate: float | None = None, context_keys: list | None = None)

   Functions for ADM Gradient boosting

   ADM Gradient boosting models consist of multiple trees, which build
   upon each other in a 'boosting' fashion.  This class provides
   functions to extract data from these trees: the features on which
   the trees split, important values for these features, statistics
   about the trees, or visualising each individual tree.

   Construct via :meth:`from_file`, :meth:`from_url`,
   :meth:`from_datamart_blob`, or :meth:`from_dict`.

   .. rubric:: Notes

   The "save model" action in Prediction Studio exports a JSON file
   that this class can load directly.  The Datamart's ``pyModelData``
   column also contains this information, but compressed and with
   encoded split values; the "save model" button decompresses and
   decodes that data.


   .. py:attribute:: trees
      :type:  dict

      The full parsed model JSON.


   .. py:attribute:: model
      :type:  list[dict]

      The list of boosted trees (each a nested dict).


   .. py:attribute:: raw_input
      :type:  Any

      The raw input used to construct this instance (path, bytes, or dict).


   .. py:attribute:: learning_rate
      :type:  float | None
      :value: None


   .. py:attribute:: context_keys
      :type:  list | None
      :value: None


   .. py:attribute:: _properties
      :type:  dict[str, Any]


   .. py:method:: from_dict(data: dict, *, context_keys: list | None = None) -> ADMTreesModel
      :classmethod:


      Build from an already-parsed model dict.


   .. py:method:: from_file(path: str | pathlib.Path, *, context_keys: list | None = None) -> ADMTreesModel
      :classmethod:


      Load a model from a local JSON file (Prediction Studio "save model" output).


   .. py:method:: from_url(url: str, *, timeout: float = 30.0, context_keys: list | None = None) -> ADMTreesModel
      :classmethod:


      Load a model from a URL pointing at the JSON export.

      ``timeout`` is the per-request timeout in seconds (default 30).


   .. py:method:: from_datamart_blob(blob: str | bytes, *, context_keys: list | None = None) -> ADMTreesModel
      :classmethod:


      Load from a base64-encoded zlib-compressed datamart ``Modeldata`` blob.


   .. py:method:: _decode_trees()


   .. py:method:: _post_import_cleanup(decode: bool, *, context_keys: list | None = None)


   .. py:method:: _locate_boosters() -> list[dict]

      Find the boosters/trees list in the model JSON.

      Different Pega versions place the boosters at different paths;
      try them in order.


   .. py:method:: _safe_numeric_compare(left: float, operator: str, right: float) -> bool
      :staticmethod:


      Safely compare two numeric values without using ``eval()``.


   .. py:attribute:: _safe_eval_seen_errors
      :type:  set[tuple[str, str]]


   .. py:attribute:: _safe_eval_lock
      :type:  threading.Lock


   .. py:method:: _safe_condition_evaluate(value: Any, operator: str, comparison_set: set | float | str | frozenset) -> bool
      :staticmethod:


      Safely evaluate split conditions without using ``eval()``.

      Returns ``False`` on type-conversion errors after logging the
      first occurrence per (operator, error-type) pair at INFO level.
      Subsequent matching failures log at DEBUG only — we don't want
      per-row scoring to swamp the application logs, but the first
      failure for each error class is worth surfacing.


   .. py:property:: metrics
      :type: dict[str, Any]


      Compute CDH_ADM005-style diagnostic metrics for this model.

      Returns a flat dictionary of key/value pairs aligned with the
      CDH_ADM005 telemetry event specification.  Metrics that cannot be
      computed from an exported model (e.g. saturation counts that
      require bin-level data) are omitted.

      See :meth:`metric_descriptions` for human-readable descriptions
      of every key.


   .. py:method:: metric_descriptions() -> dict[str, str]
      :staticmethod:


      Return a dictionary mapping metric names to human-readable descriptions.


   .. py:method:: _classify_predictor(name: str) -> str
      :staticmethod:


      Classify a predictor as 'ih', 'context_key', or 'other'.


   .. py:method:: _compute_metrics() -> dict[str, Any]

      Walk all trees once and assemble the metrics dictionary.


   .. py:attribute:: _ENCODER_PATHS
      :type:  tuple[tuple[str, Ellipsis], Ellipsis]
      :value: (('model', 'inputsEncoder', 'encoders'), ('model', 'model', 'inputsEncoder', 'encoders'))


   .. py:method:: _get_encoder_info() -> dict[str, dict[str, Any]] | None

      Extract predictor metadata from the inputsEncoder if present.

      Returns ``None`` when no encoder metadata is available (e.g. for
      exported/decoded models).


   .. py:property:: predictors
      :type: dict[str, str] | None


   .. py:property:: tree_stats
      :type: polars.DataFrame


   .. py:property:: splits_per_tree
      :type: dict[int, list[str]]


   .. py:property:: gains_per_tree
      :type: dict[int, list[float]]


   .. py:property:: gains_per_split
      :type: polars.DataFrame


   .. py:property:: grouped_gains_per_split
      :type: polars.DataFrame


   .. py:property:: all_values_per_split
      :type: dict[str, set]


   .. py:property:: splits_per_variable_type
      :type: tuple[list[collections.Counter], list[float]]


      Per-tree counts of splits grouped by predictor category.

      Equivalent to calling
      :meth:`compute_categorization_over_time` with no arguments.


   .. py:method:: get_predictors() -> dict[str, str] | None

      Extract predictor names and types from model metadata.

      Tries explicit metadata first (``configuration.predictors`` then
      ``predictors``); falls back to inferring from tree splits when
      neither is present.


   .. py:method:: _infer_predictors_from_splits() -> dict[str, str] | None

      Infer predictor names + types by walking all tree splits.


   .. py:property:: _splits_and_gains
      :type: tuple[dict[int, list[str]], dict[int, list[float]], polars.DataFrame]


      Compute (splits_per_tree, gains_per_tree, gains_per_split) once.

      Backs the public ``splits_per_tree`` / ``gains_per_tree`` /
      ``gains_per_split`` properties via a single tree-walk per tree.
      Implemented as a ``cached_property`` rather than ``@lru_cache``
      because lru_cache holds a strong reference to ``self`` and would
      leak the entire ADMTreesModel instance for the lifetime of the
      cache.

      Zero-gain splits are kept (with ``gains == 0.0``) in
      ``gains_per_split`` so the per-split DataFrame is always aligned
      with ``splits_per_tree``.  ``gains_per_tree`` continues to keep
      only positive gains for backward compatibility.


   .. py:method:: get_grouped_gains_per_split() -> polars.DataFrame

      Gains per split, grouped by split string with helpful aggregates.


   .. py:method:: plot_splits_per_variable(subset: set | None = None, show: bool = True)

      Box-plot of gains per split for each variable.


   .. py:method:: get_tree_stats() -> polars.DataFrame

      Generate a dataframe with useful stats for each tree.


   .. py:method:: get_all_values_per_split() -> dict[str, set]

      All distinct split values seen for each predictor.


   .. py:method:: get_tree_representation(tree_number: int) -> dict[int, dict]

      Build a flat node-id-keyed representation of one tree.

      Walks ``self.model[tree_number]`` in pre-order (left subtree
      before right) and returns a dict keyed by 1-based node id.

      Each entry has ``score``; internal nodes additionally carry
      ``split``, ``gain``, ``left_child`` and ``right_child``; non-root
      nodes carry ``parent_node``.

      This replaces an earlier implementation that mutated three
      accumulator parameters and relied on a final ``del`` to drop a
      spurious trailing entry.


   .. py:method:: plot_tree(tree_number: int, highlighted: dict | list | None = None, show: bool = True) -> pydot.Graph

      Plot the chosen decision tree.


   .. py:method:: get_visited_nodes(treeID: int, x: dict, save_all: bool = False) -> tuple[list, float, list]

      Trace the path through one tree for the given feature values.


   .. py:method:: get_all_visited_nodes(x: dict) -> polars.DataFrame

      Score every tree against ``x`` and return per-tree visit info.


   .. py:method:: score(x: dict) -> float

      Compute the (sigmoid-normalised) propensity score for ``x``.

      Calls :meth:`get_visited_nodes` per tree and sums the resulting
      leaf scores; avoids building the full per-tree DataFrame that
      :meth:`get_all_visited_nodes` would produce.


   .. py:method:: plot_contribution_per_tree(x: dict, show: bool = True)

      Plot the per-tree contribution toward the final propensity.


   .. py:method:: predictor_categorization(x: str, context_keys: list | None = None) -> str

      Default predictor categorisation function.


   .. py:method:: compute_categorization_over_time(predictor_categorization: collections.abc.Callable | None = None, context_keys: list | None = None) -> tuple[list[collections.Counter], list[float]]

      Per-tree categorisation counts plus per-tree absolute scores.


   .. py:method:: plot_splits_per_variable_type(predictor_categorization: collections.abc.Callable | None = None, **kwargs)

      Stacked-area chart of categorised split counts per tree.


.. py:class:: MultiTrees

   A collection of :class:`ADMTreesModel` snapshots indexed by timestamp.

   Construct via :meth:`from_datamart`.


   .. py:attribute:: trees
      :type:  dict[str, pdstools.adm.trees._model.ADMTreesModel]


   .. py:attribute:: model_name
      :type:  str | None
      :value: None


   .. py:attribute:: context_keys
      :type:  list | None
      :value: None


   .. py:method:: __repr__() -> str


   .. py:method:: __getitem__(index: int | str) -> pdstools.adm.trees._model.ADMTreesModel

      Return the :class:`ADMTreesModel` at ``index``.

      Integer indices select by insertion order; string indices select
      by snapshot timestamp.  Use :meth:`items` if you need both keys
      and values together.


   .. py:method:: __len__() -> int


   .. py:method:: items()

      Iterate ``(timestamp, model)`` pairs in insertion order.


   .. py:method:: values()

      Iterate :class:`ADMTreesModel` instances in insertion order.


   .. py:method:: keys()

      Iterate snapshot timestamps in insertion order.


   .. py:method:: __iter__()


   .. py:method:: __add__(other: MultiTrees | pdstools.adm.trees._model.ADMTreesModel) -> MultiTrees


   .. py:property:: first
      :type: pdstools.adm.trees._model.ADMTreesModel


   .. py:property:: last
      :type: pdstools.adm.trees._model.ADMTreesModel


   .. py:method:: from_datamart(df: polars.DataFrame, n_threads: int = 1, configuration: str | None = None) -> MultiTrees
      :classmethod:


      Decode every Modeldata blob in ``df`` for a single configuration.

      Returns one :class:`MultiTrees` containing one
      :class:`ADMTreesModel` per snapshot.

      :param df: Datamart slice.  Must contain ``Modeldata``, ``SnapshotTime``
                 and ``Configuration`` columns and cover exactly one
                 Configuration.  Use :meth:`from_datamart_grouped` if ``df``
                 spans multiple configurations.
      :type df: pl.DataFrame
      :param n_threads: Worker count for parallel base64+zlib decoding.
      :type n_threads: int
      :param configuration: Optional explicit Configuration name; required if ``df``
                            doesn't already contain a single Configuration.
      :type configuration: str | None


   .. py:method:: from_datamart_grouped(df: polars.DataFrame, n_threads: int = 1) -> dict[str, MultiTrees]
      :classmethod:


      Decode every Modeldata blob in ``df``, grouped by Configuration.

      Returns a mapping of configuration name to :class:`MultiTrees`.
      Use :meth:`from_datamart` instead when the input has only one
      configuration.


   .. py:method:: _decode_datamart_frame(df: polars.DataFrame, n_threads: int = 1) -> list[tuple[str, str, pdstools.adm.trees._model.ADMTreesModel]]
      :staticmethod:


      Decode every blob in ``df`` and return ``(config, timestamp, model)`` rows.


   .. py:method:: compute_over_time(predictor_categorization: collections.abc.Callable | None = None) -> polars.DataFrame

      Return per-tree categorisation counts across snapshots, with a
      ``SnapshotTime`` column per row.


.. py:class:: Node

   A single node in an AGB tree.

   All nodes carry a ``score`` (the leaf prediction or root prior).
   Internal nodes additionally carry a parsed :class:`Split` and a
   ``gain``.  Leaves have ``split=None`` and ``gain=0.0``.


   .. py:attribute:: depth
      :type:  int


   .. py:attribute:: score
      :type:  float


   .. py:attribute:: is_leaf
      :type:  bool


   .. py:attribute:: split
      :type:  Split | None


   .. py:attribute:: gain
      :type:  float


.. py:class:: Split

   A parsed tree split condition.

   .. attribute:: variable

      Predictor name being split on.

      :type: str

   .. attribute:: operator

      Comparison operator: ``"<"`` and ``">"`` for numeric thresholds,
      ``"=="`` for single-category equality, ``"in"`` for set membership,
      ``"is"`` for missing-value checks.

      :type: SplitOperator

   .. attribute:: value

      Right-hand side of the split.  ``float`` for numeric thresholds,
      ``tuple[str, ...]`` for ``in``-splits, ``str`` for ``==``/``is``.

      :type: float | str | tuple[str, ...]

   .. attribute:: raw

      Original split string, useful for diagnostics or display.

      :type: str


   .. py:attribute:: variable
      :type:  str


   .. py:attribute:: operator
      :type:  SplitOperator


   .. py:attribute:: value
      :type:  float | str | tuple[str, Ellipsis]


   .. py:attribute:: raw
      :type:  str


   .. py:property:: is_numeric
      :type: bool


   .. py:property:: is_symbolic
      :type: bool


.. py:data:: SplitOperator

.. py:function:: parse_split(raw: str) -> Split

   Parse a tree-split string into a :class:`Split`.

   .. rubric:: Examples

   >>> parse_split("Age < 42.5").operator
   '<'
   >>> sorted(parse_split("Color in { red, blue }").value)
   ['blue', 'red']
   >>> parse_split("Status is Missing").value
   'Missing'