pdstools.adm.ADMTrees ===================== .. py:module:: pdstools.adm.ADMTrees Classes ------- .. autoapisummary:: pdstools.adm.ADMTrees.AGB pdstools.adm.ADMTrees.ADMTrees pdstools.adm.ADMTrees.ADMTreesModel pdstools.adm.ADMTrees.MultiTrees Module Contents --------------- .. py:class:: AGB(datamart: pdstools.adm.ADMDatamart.ADMDatamart) .. py:attribute:: datamart .. py:method:: discover_model_types(df: polars.LazyFrame, by: str = 'Configuration') -> Dict Discovers the type of model embedded in the pyModelData column. By default, we do a group_by Configuration, because a model rule can only contain one type of model. Then, for each configuration, we look into the pyModelData blob and find the _serialClass, returning it in a dict. :param df: The dataframe to search for model types :type df: pl.LazyFrame :param by: The column to look for types in. Configuration is recommended. :type by: str :param allow_collect: Set to True to allow discovering modelTypes, even if in lazy strategy. It will fetch one modelData string per configuration. :type allow_collect: bool, default = False .. py:method:: get_agb_models(last: bool = False, by: str = 'Configuration', n_threads: int = 1, query: Optional[pdstools.utils.types.QUERY] = None, verbose: bool = True, **kwargs) -> ADMTrees Method to automatically extract AGB models. Recommended to subset using the querying functionality to cut down on execution time, because it checks for each model ID. If you only have AGB models remaining after the query, it will only return proper AGB models. :param last: Whether to only look at the last snapshot for each model :type last: bool, default = False :param by: Which column to determine unique models with :type by: str, default = 'Configuration' :param n_threads: The number of threads to use for extracting the models. Since we use multithreading, setting this to a reasonable value helps speed up the import. :type n_threads: int, default = 6 :param query: Please refer to :meth:`._apply_query` :type query: Optional[Union[pl.Expr, List[pl.Expr], str, Dict[str, list]]] :param verbose: Whether to print out information while importing :type verbose: bool, default = False .. py:class:: ADMTrees .. py:method:: get_multi_trees(file: polars.DataFrame, n_threads=1, verbose=True, **kwargs) :staticmethod: .. py:class:: ADMTreesModel(file: str, **kwargs) Functions for ADM Gradient boosting ADM Gradient boosting models consist of multiple trees, which build upon each other in a 'boosting' fashion. This class provides some functions to extract data from these trees, such as the features on which the trees split, important values for these features, statistics about the trees, or visualising each individual tree. :param file: The input file as a json (see notes) :type file: str .. attribute:: trees :type: Dict .. attribute:: properties :type: Dict .. attribute:: learning_rate :type: float .. attribute:: model :type: Dict .. attribute:: treeStats :type: Dict .. attribute:: splitsPerTree :type: Dict .. attribute:: gainsPerTree :type: Dict .. attribute:: gainsPerSplit :type: pl.DataFrame .. attribute:: groupedGainsPerSplit :type: Dict .. attribute:: predictors :type: Set .. attribute:: allValuesPerSplit :type: Dict .. rubric:: Notes The input file is the extracted json file of the 'save model' action in Prediction Studio. The Datamart column 'pyModelData' also contains this information, but it is compressed and the values for each split is encoded. Using the 'save model' button, only that data is decompressed and decoded. .. py:attribute:: nospaces :value: True .. py:method:: _read_model(file, **kwargs) .. py:method:: _decode_trees() .. py:method:: _post_import_cleanup(decode, **kwargs) .. py:method:: _depth(d: Dict) -> int Calculates the depth of the tree, used in TreeStats. .. py:property:: predictors .. py:property:: tree_stats .. py:property:: splits_per_tree .. py:property:: gains_per_tree .. py:property:: gains_per_split .. py:property:: grouped_gains_per_split .. py:property:: all_values_per_split .. py:property:: splits_per_variable_type .. py:method:: parse_split_values(value) -> Tuple[str, str, str] Parses the raw 'split' string into its three components. Once the split is parsed, Python can use it to evaluate. :param value: The raw 'split' string :type value: str :param Returns: Tuple[str, str, str] The variable on which the split is done, The direction of the split (< or 'in') The value on which to split .. py:method:: parse_split_values_with_spaces(value) -> Tuple[str, str, str] :staticmethod: .. py:method:: get_predictors() -> Optional[Dict] .. py:method:: get_gains_per_split() -> Tuple[Dict, Dict, polars.DataFrame] Function to compute the gains of each split in each tree. .. py:method:: get_grouped_gains_per_split() -> polars.DataFrame Function to get the gains per split, grouped by split. It adds some additional information, such as the possible values, the mean gains, and the number of times the split is performed. .. py:method:: get_splits_recursively(tree: Dict, splits: List, gains: List) -> Tuple[List, List] Recursively finds splits and their gains for each node. By Python's mutatable list mechanic, the easiest way to achieve this is to explicitly supply the function with empty lists. Therefore, the 'splits' and 'gains' parameter expect empty lists when initially called. :param tree: :type tree: Dict :param splits: :type splits: List :param gains: :type gains: List :returns: * *Tuple[List, List]* * *Each split, and its corresponding gain* .. py:method:: plot_splits_per_variable(subset: Optional[Set] = None, show=True) Plots the splits for each variable in the tree. :param subset: Optional parameter to subset the variables to plot :type subset: Optional[Set] :param show: Whether to display each plot :type show: bool :rtype: plt.figure .. py:method:: get_tree_stats() -> polars.DataFrame Generate a dataframe with useful stats for each tree .. py:method:: get_all_values_per_split() -> Dict Generate a dictionary with the possible values for each split .. py:method:: get_nodes_recursively(tree: Dict, nodelist: Dict, counter: List, childs: Dict) -> Tuple[Dict, Dict] Recursively walks through each node, used for tree representation. Again, nodelist, counter and childs expects empty dict, dict and list parameters. :param tree: :type tree: Dict :param nodelist: :type nodelist: Dict :param counter: :type counter: Dict :param childs: :type childs: List :returns: * *Tuple[Dict, List]* * *The dictionary of nodes and the list of child nodes* .. py:method:: _fill_child_node_ids(nodeinfo: Dict, childs: Dict) -> Dict :staticmethod: Utility function to add child info to nodes .. py:method:: get_tree_representation(tree_number: int) -> Dict Generates a more usable tree representation. In this tree representation, each node has an ID, and its attributes are the attributes, with parent and child nodes added as well. :param tree_number: The number of the tree, in order of the original json :type tree_number: int :param returns: :type returns: Dict .. py:method:: plot_tree(tree_number: int, highlighted: Optional[Union[Dict, List]] = None, show=True) -> pydot.Graph Plots the chosen decision tree. :param tree_number: The number of the tree to visualise :type tree_number: int :param highlighted: Optional parameter to highlight nodes in green If a dictionary, it expects an 'x': i.e., features with their corresponding values. If a list, expects a list of node IDs for that tree. :type highlighted: Optional[Dict, List] :rtype: pydot.Graph .. py:method:: get_visited_nodes(treeID: int, x: Dict, save_all: bool = False) -> Tuple[List, float, List] Finds all visited nodes for a given tree, given an x :param treeID: The ID of the tree :type treeID: int :param x: Features to split on, with their values :type x: Dict :param save_all: Whether to save all gains for each individual split :type save_all: bool, default = False :returns: The list of visited nodes, The score of the final leaf node, The gains for each split in the visited nodes :rtype: List, float, List .. py:method:: get_all_visited_nodes(x: Dict) -> polars.DataFrame Loops through each tree, and records the scoring info :param x: Features to split on, with their values :type x: Dict :rtype: pl.DataFrame .. py:method:: score(x: Dict) -> float Computes the score for a given x .. py:method:: plot_contribution_per_tree(x: Dict, show=True) Plots the contribution of each tree towards the final propensity. .. py:method:: predictor_categorization(x: str, context_keys=None) .. py:method:: compute_categorization_over_time(predictorCategorization=None, context_keys=None) .. py:method:: plot_splits_per_variable_type(predictor_categorization=None, **kwargs) .. py:class:: MultiTrees .. py:attribute:: trees :type: dict .. py:attribute:: model_name :type: Optional[str] :value: None .. py:attribute:: context_keys :type: Optional[list] :value: None .. py:method:: __repr__() .. py:method:: __getitem__(index) .. py:method:: __len__() .. py:method:: __add__(other) .. py:property:: first .. py:property:: last .. py:method:: compute_over_time(predictor_categorization=None) .. py:method:: plot_splits_per_variable_type(predictor_categorization=None, **kwargs)