pdstools.adm.trees ================== .. py:module:: pdstools.adm.trees .. autoapi-nested-parse:: ADM Gradient Boosting (AGB) model parsing, scoring, and diagnostics. This package provides: - :class:`Split` and :class:`Node` — small dataclasses describing a parsed split condition and a tree node. - :class:`ADMTreesModel` — load and analyse a single AGB model. - :class:`MultiTrees` — collection of snapshots of the same configuration over time. - :class:`AGB` — Datamart helper for discovering and extracting AGB models. Construction uses explicit factory classmethods (``ADMTreesModel.from_file``, ``from_url``, ``from_datamart_blob``, ``from_dict``, and ``MultiTrees.from_datamart``). The legacy ``ADMTrees(file, ...)`` polymorphic factory was removed in v5; see ``docs/migration-v4-to-v5.md``. Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/pdstools/adm/trees/_agb/index /autoapi/pdstools/adm/trees/_model/index /autoapi/pdstools/adm/trees/_multi/index /autoapi/pdstools/adm/trees/_nodes/index Attributes ---------- .. autoapisummary:: pdstools.adm.trees.SplitOperator Classes ------- .. autoapisummary:: pdstools.adm.trees.AGB pdstools.adm.trees.ADMTreesModel pdstools.adm.trees.MultiTrees pdstools.adm.trees.Node pdstools.adm.trees.Split Functions --------- .. autoapisummary:: pdstools.adm.trees.parse_split Package Contents ---------------- .. py:class:: AGB(datamart: pdstools.adm.ADMDatamart.ADMDatamart) Datamart helper for discovering and extracting AGB models. Reachable as ``ADMDatamart.agb``; not intended to be instantiated directly. .. py:attribute:: datamart .. py:method:: discover_model_types(df: polars.LazyFrame, by: str = 'Configuration') -> dict[str, str] Discover the type of model embedded in the ``Modeldata`` column. Groups by ``by`` (typically Configuration, since one model rule contains one model type) and decodes the first ``Modeldata`` blob per group to extract its ``_serialClass``. :param df: Datamart slice including ``Modeldata``. Collected internally. :type df: pl.LazyFrame :param by: Grouping column. ``Configuration`` is recommended. :type by: str .. py:method:: get_agb_models(last: bool = False, n_threads: int = 6, query: pdstools.utils.types.QUERY | None = None) -> dict[str, pdstools.adm.trees._multi.MultiTrees] Get all AGB models in the datamart, indexed by Configuration. Filters down to models whose ``_serialClass`` ends with ``GbModel`` and decodes them via :class:`MultiTrees`. :param last: If True, use only the latest snapshot per model. :type last: bool :param n_threads: Worker count for parallel blob decoding. :type n_threads: int :param query: Optional pre-filter applied before discovery. :type query: QUERY | None .. py:class:: ADMTreesModel(trees: dict, model: list[dict], *, raw_input: Any = None, properties: dict[str, Any] | None = None, learning_rate: float | None = None, context_keys: list | None = None) Functions for ADM Gradient boosting ADM Gradient boosting models consist of multiple trees, which build upon each other in a 'boosting' fashion. This class provides functions to extract data from these trees: the features on which the trees split, important values for these features, statistics about the trees, or visualising each individual tree. Construct via :meth:`from_file`, :meth:`from_url`, :meth:`from_datamart_blob`, or :meth:`from_dict`. .. rubric:: Notes The "save model" action in Prediction Studio exports a JSON file that this class can load directly. The Datamart's ``pyModelData`` column also contains this information, but compressed and with encoded split values; the "save model" button decompresses and decodes that data. .. py:attribute:: trees :type: dict The full parsed model JSON. .. py:attribute:: model :type: list[dict] The list of boosted trees (each a nested dict). .. py:attribute:: raw_input :type: Any The raw input used to construct this instance (path, bytes, or dict). .. py:attribute:: learning_rate :type: float | None :value: None .. py:attribute:: context_keys :type: list | None :value: None .. py:attribute:: _properties :type: dict[str, Any] .. py:method:: from_dict(data: dict, *, context_keys: list | None = None) -> ADMTreesModel :classmethod: Build from an already-parsed model dict. .. py:method:: from_file(path: str | pathlib.Path, *, context_keys: list | None = None) -> ADMTreesModel :classmethod: Load a model from a local JSON file (Prediction Studio "save model" output). .. py:method:: from_url(url: str, *, timeout: float = 30.0, context_keys: list | None = None) -> ADMTreesModel :classmethod: Load a model from a URL pointing at the JSON export. ``timeout`` is the per-request timeout in seconds (default 30). .. py:method:: from_datamart_blob(blob: str | bytes, *, context_keys: list | None = None) -> ADMTreesModel :classmethod: Load from a base64-encoded zlib-compressed datamart ``Modeldata`` blob. .. py:method:: _decode_trees() .. py:method:: _post_import_cleanup(decode: bool, *, context_keys: list | None = None) .. py:method:: _locate_boosters() -> list[dict] Find the boosters/trees list in the model JSON. Different Pega versions place the boosters at different paths; try them in order. .. py:method:: _safe_numeric_compare(left: float, operator: str, right: float) -> bool :staticmethod: Safely compare two numeric values without using ``eval()``. .. py:attribute:: _safe_eval_seen_errors :type: set[tuple[str, str]] .. py:attribute:: _safe_eval_lock :type: threading.Lock .. py:method:: _safe_condition_evaluate(value: Any, operator: str, comparison_set: set | float | str | frozenset) -> bool :staticmethod: Safely evaluate split conditions without using ``eval()``. Returns ``False`` on type-conversion errors after logging the first occurrence per (operator, error-type) pair at INFO level. Subsequent matching failures log at DEBUG only — we don't want per-row scoring to swamp the application logs, but the first failure for each error class is worth surfacing. .. py:property:: metrics :type: dict[str, Any] Compute CDH_ADM005-style diagnostic metrics for this model. Returns a flat dictionary of key/value pairs aligned with the CDH_ADM005 telemetry event specification. Metrics that cannot be computed from an exported model (e.g. saturation counts that require bin-level data) are omitted. See :meth:`metric_descriptions` for human-readable descriptions of every key. .. py:method:: metric_descriptions() -> dict[str, str] :staticmethod: Return a dictionary mapping metric names to human-readable descriptions. .. py:method:: _classify_predictor(name: str) -> str :staticmethod: Classify a predictor as 'ih', 'context_key', or 'other'. .. py:method:: _compute_metrics() -> dict[str, Any] Walk all trees once and assemble the metrics dictionary. .. py:attribute:: _ENCODER_PATHS :type: tuple[tuple[str, Ellipsis], Ellipsis] :value: (('model', 'inputsEncoder', 'encoders'), ('model', 'model', 'inputsEncoder', 'encoders')) .. py:method:: _get_encoder_info() -> dict[str, dict[str, Any]] | None Extract predictor metadata from the inputsEncoder if present. Returns ``None`` when no encoder metadata is available (e.g. for exported/decoded models). .. py:property:: predictors :type: dict[str, str] | None .. py:property:: tree_stats :type: polars.DataFrame .. py:property:: splits_per_tree :type: dict[int, list[str]] .. py:property:: gains_per_tree :type: dict[int, list[float]] .. py:property:: gains_per_split :type: polars.DataFrame .. py:property:: grouped_gains_per_split :type: polars.DataFrame .. py:property:: all_values_per_split :type: dict[str, set] .. py:property:: splits_per_variable_type :type: tuple[list[collections.Counter], list[float]] Per-tree counts of splits grouped by predictor category. Equivalent to calling :meth:`compute_categorization_over_time` with no arguments. .. py:method:: get_predictors() -> dict[str, str] | None Extract predictor names and types from model metadata. Tries explicit metadata first (``configuration.predictors`` then ``predictors``); falls back to inferring from tree splits when neither is present. .. py:method:: _infer_predictors_from_splits() -> dict[str, str] | None Infer predictor names + types by walking all tree splits. .. py:property:: _splits_and_gains :type: tuple[dict[int, list[str]], dict[int, list[float]], polars.DataFrame] Compute (splits_per_tree, gains_per_tree, gains_per_split) once. Backs the public ``splits_per_tree`` / ``gains_per_tree`` / ``gains_per_split`` properties via a single tree-walk per tree. Implemented as a ``cached_property`` rather than ``@lru_cache`` because lru_cache holds a strong reference to ``self`` and would leak the entire ADMTreesModel instance for the lifetime of the cache. Zero-gain splits are kept (with ``gains == 0.0``) in ``gains_per_split`` so the per-split DataFrame is always aligned with ``splits_per_tree``. ``gains_per_tree`` continues to keep only positive gains for backward compatibility. .. py:method:: get_grouped_gains_per_split() -> polars.DataFrame Gains per split, grouped by split string with helpful aggregates. .. py:method:: plot_splits_per_variable(subset: set | None = None, show: bool = True) Box-plot of gains per split for each variable. .. py:method:: get_tree_stats() -> polars.DataFrame Generate a dataframe with useful stats for each tree. .. py:method:: get_all_values_per_split() -> dict[str, set] All distinct split values seen for each predictor. .. py:method:: get_tree_representation(tree_number: int) -> dict[int, dict] Build a flat node-id-keyed representation of one tree. Walks ``self.model[tree_number]`` in pre-order (left subtree before right) and returns a dict keyed by 1-based node id. Each entry has ``score``; internal nodes additionally carry ``split``, ``gain``, ``left_child`` and ``right_child``; non-root nodes carry ``parent_node``. This replaces an earlier implementation that mutated three accumulator parameters and relied on a final ``del`` to drop a spurious trailing entry. .. py:method:: plot_tree(tree_number: int, highlighted: dict | list | None = None, show: bool = True) -> pydot.Graph Plot the chosen decision tree. .. py:method:: get_visited_nodes(treeID: int, x: dict, save_all: bool = False) -> tuple[list, float, list] Trace the path through one tree for the given feature values. .. py:method:: get_all_visited_nodes(x: dict) -> polars.DataFrame Score every tree against ``x`` and return per-tree visit info. .. py:method:: score(x: dict) -> float Compute the (sigmoid-normalised) propensity score for ``x``. Calls :meth:`get_visited_nodes` per tree and sums the resulting leaf scores; avoids building the full per-tree DataFrame that :meth:`get_all_visited_nodes` would produce. .. py:method:: plot_contribution_per_tree(x: dict, show: bool = True) Plot the per-tree contribution toward the final propensity. .. py:method:: predictor_categorization(x: str, context_keys: list | None = None) -> str Default predictor categorisation function. .. py:method:: compute_categorization_over_time(predictor_categorization: collections.abc.Callable | None = None, context_keys: list | None = None) -> tuple[list[collections.Counter], list[float]] Per-tree categorisation counts plus per-tree absolute scores. .. py:method:: plot_splits_per_variable_type(predictor_categorization: collections.abc.Callable | None = None, **kwargs) Stacked-area chart of categorised split counts per tree. .. py:class:: MultiTrees A collection of :class:`ADMTreesModel` snapshots indexed by timestamp. Construct via :meth:`from_datamart`. .. py:attribute:: trees :type: dict[str, pdstools.adm.trees._model.ADMTreesModel] .. py:attribute:: model_name :type: str | None :value: None .. py:attribute:: context_keys :type: list | None :value: None .. py:method:: __repr__() -> str .. py:method:: __getitem__(index: int | str) -> pdstools.adm.trees._model.ADMTreesModel Return the :class:`ADMTreesModel` at ``index``. Integer indices select by insertion order; string indices select by snapshot timestamp. Use :meth:`items` if you need both keys and values together. .. py:method:: __len__() -> int .. py:method:: items() Iterate ``(timestamp, model)`` pairs in insertion order. .. py:method:: values() Iterate :class:`ADMTreesModel` instances in insertion order. .. py:method:: keys() Iterate snapshot timestamps in insertion order. .. py:method:: __iter__() .. py:method:: __add__(other: MultiTrees | pdstools.adm.trees._model.ADMTreesModel) -> MultiTrees .. py:property:: first :type: pdstools.adm.trees._model.ADMTreesModel .. py:property:: last :type: pdstools.adm.trees._model.ADMTreesModel .. py:method:: from_datamart(df: polars.DataFrame, n_threads: int = 1, configuration: str | None = None) -> MultiTrees :classmethod: Decode every Modeldata blob in ``df`` for a single configuration. Returns one :class:`MultiTrees` containing one :class:`ADMTreesModel` per snapshot. :param df: Datamart slice. Must contain ``Modeldata``, ``SnapshotTime`` and ``Configuration`` columns and cover exactly one Configuration. Use :meth:`from_datamart_grouped` if ``df`` spans multiple configurations. :type df: pl.DataFrame :param n_threads: Worker count for parallel base64+zlib decoding. :type n_threads: int :param configuration: Optional explicit Configuration name; required if ``df`` doesn't already contain a single Configuration. :type configuration: str | None .. py:method:: from_datamart_grouped(df: polars.DataFrame, n_threads: int = 1) -> dict[str, MultiTrees] :classmethod: Decode every Modeldata blob in ``df``, grouped by Configuration. Returns a mapping of configuration name to :class:`MultiTrees`. Use :meth:`from_datamart` instead when the input has only one configuration. .. py:method:: _decode_datamart_frame(df: polars.DataFrame, n_threads: int = 1) -> list[tuple[str, str, pdstools.adm.trees._model.ADMTreesModel]] :staticmethod: Decode every blob in ``df`` and return ``(config, timestamp, model)`` rows. .. py:method:: compute_over_time(predictor_categorization: collections.abc.Callable | None = None) -> polars.DataFrame Return per-tree categorisation counts across snapshots, with a ``SnapshotTime`` column per row. .. py:class:: Node A single node in an AGB tree. All nodes carry a ``score`` (the leaf prediction or root prior). Internal nodes additionally carry a parsed :class:`Split` and a ``gain``. Leaves have ``split=None`` and ``gain=0.0``. .. py:attribute:: depth :type: int .. py:attribute:: score :type: float .. py:attribute:: is_leaf :type: bool .. py:attribute:: split :type: Split | None .. py:attribute:: gain :type: float .. py:class:: Split A parsed tree split condition. .. attribute:: variable Predictor name being split on. :type: str .. attribute:: operator Comparison operator: ``"<"`` and ``">"`` for numeric thresholds, ``"=="`` for single-category equality, ``"in"`` for set membership, ``"is"`` for missing-value checks. :type: SplitOperator .. attribute:: value Right-hand side of the split. ``float`` for numeric thresholds, ``tuple[str, ...]`` for ``in``-splits, ``str`` for ``==``/``is``. :type: float | str | tuple[str, ...] .. attribute:: raw Original split string, useful for diagnostics or display. :type: str .. py:attribute:: variable :type: str .. py:attribute:: operator :type: SplitOperator .. py:attribute:: value :type: float | str | tuple[str, Ellipsis] .. py:attribute:: raw :type: str .. py:property:: is_numeric :type: bool .. py:property:: is_symbolic :type: bool .. py:data:: SplitOperator .. py:function:: parse_split(raw: str) -> Split Parse a tree-split string into a :class:`Split`. .. rubric:: Examples >>> parse_split("Age < 42.5").operator '<' >>> sorted(parse_split("Color in { red, blue }").value) ['blue', 'red'] >>> parse_split("Status is Missing").value 'Missing'