pdstools.adm.ADMTrees

Classes

AGB

ADMTrees

ADMTreesModel

Functions for ADM Gradient boosting

MultiTrees

Module Contents

class AGB(datamart: pdstools.adm.ADMDatamart.ADMDatamart)
Parameters:

datamart (pdstools.adm.ADMDatamart.ADMDatamart)

datamart
discover_model_types(df: polars.LazyFrame, by: str = 'Configuration') Dict

Discovers the type of model embedded in the pyModelData column.

By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict.

Parameters:
  • df (pl.LazyFrame) – The dataframe to search for model types

  • by (str) – The column to look for types in. Configuration is recommended.

  • allow_collect (bool, default = False) – Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration.

Return type:

Dict

get_agb_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: pdstools.utils.types.QUERY | None = None, verbose: bool = True, **kwargs) ADMTrees

Method to automatically extract AGB models.

Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models.

Parameters:
  • last (bool, default = False) – Whether to only look at the last snapshot for each model

  • by (str, default = 'Configuration') – Which column to determine unique models with

  • n_threads (int, default = 6) – The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import.

  • query (Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]]) – Please refer to _apply_query()

  • verbose (bool, default = False) – Whether to print out information while importing

Return type:

ADMTrees

class ADMTrees
static get_multi_trees(file: polars.DataFrame, n_threads=1, verbose=True, **kwargs)
Parameters:

file (polars.DataFrame)

class ADMTreesModel(file: str, **kwargs)

Functions for ADM Gradient boosting

ADM Gradient boosting models consist of multiple trees, which build upon each other in a ‘boosting’ fashion. This class provides some functions to extract data from these trees, such as the features on which the trees split, important values for these features, statistics about the trees, or visualising each individual tree.

Parameters:

file (str) – The input file as a json (see notes)

trees
Type:

Dict

properties
Type:

Dict

learning_rate
Type:

float

model
Type:

Dict

treeStats
Type:

Dict

splitsPerTree
Type:

Dict

gainsPerTree
Type:

Dict

gainsPerSplit
Type:

pl.DataFrame

groupedGainsPerSplit
Type:

Dict

predictors
Type:

Set

allValuesPerSplit
Type:

Dict

Notes

The input file is the extracted json file of the ‘save model’ action in Prediction Studio. The Datamart column ‘pyModelData’ also contains this information, but it is compressed and the values for each split is encoded. Using the ‘save model’ button, only that data is decompressed and decoded.

nospaces = True
_read_model(file, **kwargs)
_decode_trees()
_post_import_cleanup(decode, **kwargs)
_depth(d: Dict) int

Calculates the depth of the tree, used in TreeStats.

Parameters:

d (Dict)

Return type:

int

property predictors
property tree_stats
property splits_per_tree
property gains_per_tree
property gains_per_split
property grouped_gains_per_split
property all_values_per_split
property splits_per_variable_type
parse_split_values(value) Tuple[str, str, str]

Parses the raw ‘split’ string into its three components.

Once the split is parsed, Python can use it to evaluate.

Parameters:
  • value (str) – The raw ‘split’ string

  • Returns – Tuple[str, str, str] The variable on which the split is done, The direction of the split (< or ‘in’) The value on which to split

Return type:

Tuple[str, str, str]

static parse_split_values_with_spaces(value) Tuple[str, str, str]
Return type:

Tuple[str, str, str]

get_predictors() Dict | None
Return type:

Optional[Dict]

get_gains_per_split() Tuple[Dict, Dict, polars.DataFrame]

Function to compute the gains of each split in each tree.

Return type:

Tuple[Dict, Dict, polars.DataFrame]

get_grouped_gains_per_split() polars.DataFrame

Function to get the gains per split, grouped by split.

It adds some additional information, such as the possible values, the mean gains, and the number of times the split is performed.

Return type:

polars.DataFrame

get_splits_recursively(tree: Dict, splits: List, gains: List) Tuple[List, List]

Recursively finds splits and their gains for each node.

By Python’s mutatable list mechanic, the easiest way to achieve this is to explicitly supply the function with empty lists. Therefore, the ‘splits’ and ‘gains’ parameter expect empty lists when initially called.

Parameters:
  • tree (Dict)

  • splits (List)

  • gains (List)

Returns:

  • Tuple[List, List]

  • Each split, and its corresponding gain

Return type:

Tuple[List, List]

plot_splits_per_variable(subset: Set | None = None, show=True)

Plots the splits for each variable in the tree.

Parameters:
  • subset (Optional[Set]) – Optional parameter to subset the variables to plot

  • show (bool) – Whether to display each plot

Return type:

plt.figure

get_tree_stats() polars.DataFrame

Generate a dataframe with useful stats for each tree

Return type:

polars.DataFrame

get_all_values_per_split() Dict

Generate a dictionary with the possible values for each split

Return type:

Dict

get_nodes_recursively(tree: Dict, nodelist: Dict, counter: List, childs: Dict) Tuple[Dict, Dict]

Recursively walks through each node, used for tree representation.

Again, nodelist, counter and childs expects empty dict, dict and list parameters.

Parameters:
  • tree (Dict)

  • nodelist (Dict)

  • counter (Dict)

  • childs (List)

Returns:

  • Tuple[Dict, List]

  • The dictionary of nodes and the list of child nodes

Return type:

Tuple[Dict, Dict]

static _fill_child_node_ids(nodeinfo: Dict, childs: Dict) Dict

Utility function to add child info to nodes

Parameters:
  • nodeinfo (Dict)

  • childs (Dict)

Return type:

Dict

get_tree_representation(tree_number: int) Dict

Generates a more usable tree representation.

In this tree representation, each node has an ID, and its attributes are the attributes, with parent and child nodes added as well.

Parameters:
  • tree_number (int) – The number of the tree, in order of the original json

  • returns (Dict)

Return type:

Dict

plot_tree(tree_number: int, highlighted: Dict | List | None = None, show=True) pydot.Graph

Plots the chosen decision tree.

Parameters:
  • tree_number (int) – The number of the tree to visualise

  • highlighted (Optional[Dict, List]) – Optional parameter to highlight nodes in green If a dictionary, it expects an ‘x’: i.e., features with their corresponding values. If a list, expects a list of node IDs for that tree.

Return type:

pydot.Graph

get_visited_nodes(treeID: int, x: Dict, save_all: bool = False) Tuple[List, float, List]

Finds all visited nodes for a given tree, given an x

Parameters:
  • treeID (int) – The ID of the tree

  • x (Dict) – Features to split on, with their values

  • save_all (bool, default = False) – Whether to save all gains for each individual split

Returns:

The list of visited nodes, The score of the final leaf node, The gains for each split in the visited nodes

Return type:

List, float, List

get_all_visited_nodes(x: Dict) polars.DataFrame

Loops through each tree, and records the scoring info

Parameters:

x (Dict) – Features to split on, with their values

Return type:

pl.DataFrame

score(x: Dict) float

Computes the score for a given x

Parameters:

x (Dict)

Return type:

float

plot_contribution_per_tree(x: Dict, show=True)

Plots the contribution of each tree towards the final propensity.

Parameters:

x (Dict)

predictor_categorization(x: str, context_keys=None)
Parameters:

x (str)

compute_categorization_over_time(predictorCategorization=None, context_keys=None)
plot_splits_per_variable_type(predictor_categorization=None, **kwargs)
class MultiTrees
trees: dict
model_name: str | None = None
context_keys: list | None = None
__repr__()
__getitem__(index)
__len__()
__add__(other)
property first
property last
compute_over_time(predictor_categorization=None)
plot_splits_per_variable_type(predictor_categorization=None, **kwargs)