pdstools.adm.trees

ADM Gradient Boosting (AGB) model parsing, scoring, and diagnostics.

This package provides:

  • Split and Node — small dataclasses describing a parsed split condition and a tree node.

  • ADMTreesModel — load and analyse a single AGB model.

  • MultiTrees — collection of snapshots of the same configuration over time.

  • AGB — Datamart helper for discovering and extracting AGB models.

Construction uses explicit factory classmethods (ADMTreesModel.from_file, from_url, from_datamart_blob, from_dict, and MultiTrees.from_datamart). The legacy ADMTrees(file, ...) polymorphic factory was removed in v5; see docs/migration-v4-to-v5.md.

Submodules

Attributes

Classes

AGB

Datamart helper for discovering and extracting AGB models.

ADMTreesModel

Functions for ADM Gradient boosting

MultiTrees

A collection of ADMTreesModel snapshots indexed by timestamp.

Node

A single node in an AGB tree.

Split

A parsed tree split condition.

Functions

parse_split(→ Split)

Parse a tree-split string into a Split.

Package Contents

class AGB(datamart: pdstools.adm.ADMDatamart.ADMDatamart)

Datamart helper for discovering and extracting AGB models.

Reachable as ADMDatamart.agb; not intended to be instantiated directly.

Parameters:

datamart (pdstools.adm.ADMDatamart.ADMDatamart)

datamart
discover_model_types(df: polars.LazyFrame, by: str = 'Configuration') dict[str, str]

Discover the type of model embedded in the Modeldata column.

Groups by by (typically Configuration, since one model rule contains one model type) and decodes the first Modeldata blob per group to extract its _serialClass.

Parameters:
  • df (pl.LazyFrame) – Datamart slice including Modeldata. Collected internally.

  • by (str) – Grouping column. Configuration is recommended.

Return type:

dict[str, str]

get_agb_models(last: bool = False, n_threads: int = 6, query: pdstools.utils.types.QUERY | None = None) dict[str, pdstools.adm.trees._multi.MultiTrees]

Get all AGB models in the datamart, indexed by Configuration.

Filters down to models whose _serialClass ends with GbModel and decodes them via MultiTrees.

Parameters:
  • last (bool) – If True, use only the latest snapshot per model.

  • n_threads (int) – Worker count for parallel blob decoding.

  • query (QUERY | None) – Optional pre-filter applied before discovery.

Return type:

dict[str, pdstools.adm.trees._multi.MultiTrees]

class ADMTreesModel(trees: dict, model: list[dict], *, raw_input: Any = None, properties: dict[str, Any] | None = None, learning_rate: float | None = None, context_keys: list | None = None)

Functions for ADM Gradient boosting

ADM Gradient boosting models consist of multiple trees, which build upon each other in a ‘boosting’ fashion. This class provides functions to extract data from these trees: the features on which the trees split, important values for these features, statistics about the trees, or visualising each individual tree.

Construct via from_file(), from_url(), from_datamart_blob(), or from_dict().

Notes

The “save model” action in Prediction Studio exports a JSON file that this class can load directly. The Datamart’s pyModelData column also contains this information, but compressed and with encoded split values; the “save model” button decompresses and decodes that data.

Parameters:
  • trees (dict)

  • model (list[dict])

  • raw_input (Any)

  • properties (dict[str, Any] | None)

  • learning_rate (float | None)

  • context_keys (list | None)

trees: dict

The full parsed model JSON.

model: list[dict]

The list of boosted trees (each a nested dict).

raw_input: Any

The raw input used to construct this instance (path, bytes, or dict).

learning_rate: float | None = None
context_keys: list | None = None
_properties: dict[str, Any]
classmethod from_dict(data: dict, *, context_keys: list | None = None) ADMTreesModel

Build from an already-parsed model dict.

Parameters:
  • data (dict)

  • context_keys (list | None)

Return type:

ADMTreesModel

classmethod from_file(path: str | pathlib.Path, *, context_keys: list | None = None) ADMTreesModel

Load a model from a local JSON file (Prediction Studio “save model” output).

Parameters:
Return type:

ADMTreesModel

classmethod from_url(url: str, *, timeout: float = 30.0, context_keys: list | None = None) ADMTreesModel

Load a model from a URL pointing at the JSON export.

timeout is the per-request timeout in seconds (default 30).

Parameters:
Return type:

ADMTreesModel

classmethod from_datamart_blob(blob: str | bytes, *, context_keys: list | None = None) ADMTreesModel

Load from a base64-encoded zlib-compressed datamart Modeldata blob.

Parameters:
Return type:

ADMTreesModel

_decode_trees()
_post_import_cleanup(decode: bool, *, context_keys: list | None = None)
Parameters:
  • decode (bool)

  • context_keys (list | None)

_locate_boosters() list[dict]

Find the boosters/trees list in the model JSON.

Different Pega versions place the boosters at different paths; try them in order.

Return type:

list[dict]

static _safe_numeric_compare(left: float, operator: str, right: float) bool

Safely compare two numeric values without using eval().

Parameters:
Return type:

bool

_safe_eval_seen_errors: set[tuple[str, str]]
_safe_eval_lock: threading.Lock
static _safe_condition_evaluate(value: Any, operator: str, comparison_set: set | float | str | frozenset) bool

Safely evaluate split conditions without using eval().

Returns False on type-conversion errors after logging the first occurrence per (operator, error-type) pair at INFO level. Subsequent matching failures log at DEBUG only — we don’t want per-row scoring to swamp the application logs, but the first failure for each error class is worth surfacing.

Parameters:
Return type:

bool

property metrics: dict[str, Any]

Compute CDH_ADM005-style diagnostic metrics for this model.

Returns a flat dictionary of key/value pairs aligned with the CDH_ADM005 telemetry event specification. Metrics that cannot be computed from an exported model (e.g. saturation counts that require bin-level data) are omitted.

See metric_descriptions() for human-readable descriptions of every key.

Return type:

dict[str, Any]

static metric_descriptions() dict[str, str]

Return a dictionary mapping metric names to human-readable descriptions.

Return type:

dict[str, str]

static _classify_predictor(name: str) str

Classify a predictor as ‘ih’, ‘context_key’, or ‘other’.

Parameters:

name (str)

Return type:

str

_compute_metrics() dict[str, Any]

Walk all trees once and assemble the metrics dictionary.

Return type:

dict[str, Any]

_ENCODER_PATHS: tuple[tuple[str, Ellipsis], Ellipsis] = (('model', 'inputsEncoder', 'encoders'), ('model', 'model', 'inputsEncoder', 'encoders'))
_get_encoder_info() dict[str, dict[str, Any]] | None

Extract predictor metadata from the inputsEncoder if present.

Returns None when no encoder metadata is available (e.g. for exported/decoded models).

Return type:

dict[str, dict[str, Any]] | None

property predictors: dict[str, str] | None
Return type:

dict[str, str] | None

property tree_stats: polars.DataFrame
Return type:

polars.DataFrame

property splits_per_tree: dict[int, list[str]]
Return type:

dict[int, list[str]]

property gains_per_tree: dict[int, list[float]]
Return type:

dict[int, list[float]]

property gains_per_split: polars.DataFrame
Return type:

polars.DataFrame

property grouped_gains_per_split: polars.DataFrame
Return type:

polars.DataFrame

property all_values_per_split: dict[str, set]
Return type:

dict[str, set]

property splits_per_variable_type: tuple[list[collections.Counter], list[float]]

Per-tree counts of splits grouped by predictor category.

Equivalent to calling compute_categorization_over_time() with no arguments.

Return type:

tuple[list[collections.Counter], list[float]]

get_predictors() dict[str, str] | None

Extract predictor names and types from model metadata.

Tries explicit metadata first (configuration.predictors then predictors); falls back to inferring from tree splits when neither is present.

Return type:

dict[str, str] | None

_infer_predictors_from_splits() dict[str, str] | None

Infer predictor names + types by walking all tree splits.

Return type:

dict[str, str] | None

property _splits_and_gains: tuple[dict[int, list[str]], dict[int, list[float]], polars.DataFrame]

Compute (splits_per_tree, gains_per_tree, gains_per_split) once.

Backs the public splits_per_tree / gains_per_tree / gains_per_split properties via a single tree-walk per tree. Implemented as a cached_property rather than @lru_cache because lru_cache holds a strong reference to self and would leak the entire ADMTreesModel instance for the lifetime of the cache.

Zero-gain splits are kept (with gains == 0.0) in gains_per_split so the per-split DataFrame is always aligned with splits_per_tree. gains_per_tree continues to keep only positive gains for backward compatibility.

Return type:

tuple[dict[int, list[str]], dict[int, list[float]], polars.DataFrame]

get_grouped_gains_per_split() polars.DataFrame

Gains per split, grouped by split string with helpful aggregates.

Return type:

polars.DataFrame

plot_splits_per_variable(subset: set | None = None, show: bool = True)

Box-plot of gains per split for each variable.

Parameters:
get_tree_stats() polars.DataFrame

Generate a dataframe with useful stats for each tree.

Return type:

polars.DataFrame

get_all_values_per_split() dict[str, set]

All distinct split values seen for each predictor.

Return type:

dict[str, set]

get_tree_representation(tree_number: int) dict[int, dict]

Build a flat node-id-keyed representation of one tree.

Walks self.model[tree_number] in pre-order (left subtree before right) and returns a dict keyed by 1-based node id.

Each entry has score; internal nodes additionally carry split, gain, left_child and right_child; non-root nodes carry parent_node.

This replaces an earlier implementation that mutated three accumulator parameters and relied on a final del to drop a spurious trailing entry.

Parameters:

tree_number (int)

Return type:

dict[int, dict]

plot_tree(tree_number: int, highlighted: dict | list | None = None, show: bool = True) pydot.Graph

Plot the chosen decision tree.

Parameters:
Return type:

pydot.Graph

get_visited_nodes(treeID: int, x: dict, save_all: bool = False) tuple[list, float, list]

Trace the path through one tree for the given feature values.

Parameters:
Return type:

tuple[list, float, list]

get_all_visited_nodes(x: dict) polars.DataFrame

Score every tree against x and return per-tree visit info.

Parameters:

x (dict)

Return type:

polars.DataFrame

score(x: dict) float

Compute the (sigmoid-normalised) propensity score for x.

Calls get_visited_nodes() per tree and sums the resulting leaf scores; avoids building the full per-tree DataFrame that get_all_visited_nodes() would produce.

Parameters:

x (dict)

Return type:

float

plot_contribution_per_tree(x: dict, show: bool = True)

Plot the per-tree contribution toward the final propensity.

Parameters:
predictor_categorization(x: str, context_keys: list | None = None) str

Default predictor categorisation function.

Parameters:
  • x (str)

  • context_keys (list | None)

Return type:

str

compute_categorization_over_time(predictor_categorization: collections.abc.Callable | None = None, context_keys: list | None = None) tuple[list[collections.Counter], list[float]]

Per-tree categorisation counts plus per-tree absolute scores.

Parameters:
Return type:

tuple[list[collections.Counter], list[float]]

plot_splits_per_variable_type(predictor_categorization: collections.abc.Callable | None = None, **kwargs)

Stacked-area chart of categorised split counts per tree.

Parameters:

predictor_categorization (collections.abc.Callable | None)

class MultiTrees

A collection of ADMTreesModel snapshots indexed by timestamp.

Construct via from_datamart().

trees: dict[str, pdstools.adm.trees._model.ADMTreesModel]
model_name: str | None = None
context_keys: list | None = None
__repr__() str
Return type:

str

__getitem__(index: int | str) pdstools.adm.trees._model.ADMTreesModel

Return the ADMTreesModel at index.

Integer indices select by insertion order; string indices select by snapshot timestamp. Use items() if you need both keys and values together.

Parameters:

index (int | str)

Return type:

pdstools.adm.trees._model.ADMTreesModel

__len__() int
Return type:

int

items()

Iterate (timestamp, model) pairs in insertion order.

values()

Iterate ADMTreesModel instances in insertion order.

keys()

Iterate snapshot timestamps in insertion order.

__iter__()
__add__(other: MultiTrees | pdstools.adm.trees._model.ADMTreesModel) MultiTrees
Parameters:

other (MultiTrees | pdstools.adm.trees._model.ADMTreesModel)

Return type:

MultiTrees

property first: pdstools.adm.trees._model.ADMTreesModel
Return type:

pdstools.adm.trees._model.ADMTreesModel

property last: pdstools.adm.trees._model.ADMTreesModel
Return type:

pdstools.adm.trees._model.ADMTreesModel

classmethod from_datamart(df: polars.DataFrame, n_threads: int = 1, configuration: str | None = None) MultiTrees

Decode every Modeldata blob in df for a single configuration.

Returns one MultiTrees containing one ADMTreesModel per snapshot.

Parameters:
  • df (pl.DataFrame) – Datamart slice. Must contain Modeldata, SnapshotTime and Configuration columns and cover exactly one Configuration. Use from_datamart_grouped() if df spans multiple configurations.

  • n_threads (int) – Worker count for parallel base64+zlib decoding.

  • configuration (str | None) – Optional explicit Configuration name; required if df doesn’t already contain a single Configuration.

Return type:

MultiTrees

classmethod from_datamart_grouped(df: polars.DataFrame, n_threads: int = 1) dict[str, MultiTrees]

Decode every Modeldata blob in df, grouped by Configuration.

Returns a mapping of configuration name to MultiTrees. Use from_datamart() instead when the input has only one configuration.

Parameters:
  • df (polars.DataFrame)

  • n_threads (int)

Return type:

dict[str, MultiTrees]

static _decode_datamart_frame(df: polars.DataFrame, n_threads: int = 1) list[tuple[str, str, pdstools.adm.trees._model.ADMTreesModel]]

Decode every blob in df and return (config, timestamp, model) rows.

Parameters:
  • df (polars.DataFrame)

  • n_threads (int)

Return type:

list[tuple[str, str, pdstools.adm.trees._model.ADMTreesModel]]

compute_over_time(predictor_categorization: collections.abc.Callable | None = None) polars.DataFrame

Return per-tree categorisation counts across snapshots, with a SnapshotTime column per row.

Parameters:

predictor_categorization (collections.abc.Callable | None)

Return type:

polars.DataFrame

class Node

A single node in an AGB tree.

All nodes carry a score (the leaf prediction or root prior). Internal nodes additionally carry a parsed Split and a gain. Leaves have split=None and gain=0.0.

depth: int
score: float
is_leaf: bool
split: Split | None
gain: float
class Split

A parsed tree split condition.

variable

Predictor name being split on.

Type:

str

operator

Comparison operator: "<" and ">" for numeric thresholds, "==" for single-category equality, "in" for set membership, "is" for missing-value checks.

Type:

SplitOperator

value

Right-hand side of the split. float for numeric thresholds, tuple[str, ...] for in-splits, str for ==/is.

Type:

float | str | tuple[str, …]

raw

Original split string, useful for diagnostics or display.

Type:

str

variable: str
operator: SplitOperator
value: float | str | tuple[str, Ellipsis]
raw: str
property is_numeric: bool
Return type:

bool

property is_symbolic: bool
Return type:

bool

SplitOperator
parse_split(raw: str) Split

Parse a tree-split string into a Split.

Examples

>>> parse_split("Age < 42.5").operator
'<'
>>> sorted(parse_split("Color in { red, blue }").value)
['blue', 'red']
>>> parse_split("Status is Missing").value
'Missing'
Parameters:

raw (str)

Return type:

Split