{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# ADM NB Explained\n", "\n", "This notebook shows exactly how all the values in an ADM model report\n", "are calculated. It also shows how the propensity is calculated for a\n", "particular customer.\n", "\n", "We use one of the shipped datamart exports for the example. This is a\n", "model very similar to one used in some of the ADM PowerPoint/Excel deep\n", "dive examples. You can change this notebook to apply to your own data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "# These lines are only for rendering in the docs, and are hidden through Jupyter tags\n", "# Do not run if you're running the notebook seperately\n", "\n", "import plotly.io as pio\n", "\n", "pio.renderers.default = \"notebook_connected\"\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "import numpy as np\n", "\n", "import plotly.express as px\n", "from math import log\n", "from great_tables import GT\n", "from pdstools import datasets\n", "from pdstools.utils import cdh_utils" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_name = \"AutoNew84Months\"\n", "predictor_name = \"Customer.NetWealth\"\n", "channel = \"Web\"" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "For the example we pick one particular model over a channel.\n", "To explain the ADM model report, we use one of the active predictors as an\n", "example. Swap for any other predictor when using different data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dm = datasets.cdh_sample()\n", "\n", "model = dm.combined_data.filter(\n", " (pl.col(\"Name\") == model_name) & (pl.col(\"Channel\") == channel)\n", ")\n", "\n", "modelpredictors = (\n", " dm.combined_data.join(\n", " model.select(pl.col(\"ModelID\").unique()), on=\"ModelID\", how=\"inner\"\n", " )\n", " .filter(pl.col(\"EntryType\") != \"Inactive\")\n", " .with_columns(\n", " Action=pl.concat_str([\"Issue\", \"Group\"], separator=\"/\"),\n", " PredictorName=pl.col(\"PredictorName\").cast(pl.Utf8),\n", " )\n", " .collect()\n", ")\n", "\n", "predictorbinning = modelpredictors.filter(\n", " pl.col(\"PredictorName\") == predictor_name\n", ").sort(\"BinIndex\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbsphinx": "hidden", "tags": [ "remove_input" ] }, "outputs": [], "source": [ "model_id = None\n", "\n", "if (modelpredictors.select(pl.col(\"ModelID\").unique()).shape[0] > 1) and (\n", " model_id is None\n", "):\n", " display(\n", " model.group_by(\"ModelID\")\n", " .agg(\n", " number_of_predictors=pl.col(\"PredictorName\").n_unique(),\n", " model_performance=cdh_utils.weighted_performance_polars() * 100,\n", " response_count=pl.sum(\"ResponseCount\"),\n", " )\n", " .collect()\n", " )\n", " raise Exception(\n", " f\"**{model_name}** model has multiple instances.\"\n", " \"\\nThis could be due to the same model name being used in different configurations, directions, issues, or having multiple treatments.\"\n", " \"\\nTo ensure the selection of a unique model, please choose a model_id from the table above and update the `model_id` variable at the top of this cell.\"\n", " \"\\nAfterward, rerun this cell.\"\n", " f\"\\nSee model IDs in {model_name} model above:\"\n", " )\n", "if model_id is not None:\n", " if (\n", " model_id\n", " not in modelpredictors.select(pl.col(\"ModelID\").unique())\n", " .get_column(\"ModelID\")\n", " .to_list()\n", " ):\n", " raise Exception(\n", " f\"The {model_name} model does not have a model ID: {model_id}.\"\n", " f\"Please ensure that the spelling of the model ID is correct.\"\n", " f\"You can run `modelpredictors.select(pl.col('ModelID').unique().implode()).row(0)` to see the exact spellings of your IDs.\"\n", " \"After updating the `model_id`, you can restart the notebook from the beginning.\"\n", " )\n", "\n", " predictors_in_selected_model = (\n", " modelpredictors.filter(pl.col(\"ModelID\") == model_id)\n", " .select(pl.col(\"PredictorName\").unique())\n", " .get_column(\"PredictorName\")\n", " .to_list()\n", " )\n", " if predictor_name not in predictors_in_selected_model:\n", " raise Exception(\n", " f\"{predictor_name} is not a predictor of the model with ID: {model_id}.\"\n", " \"Please choose one of the available predictors below and update the **predictor_name** variable in the cell above:\"\n", " f\"\\nAvailable Predictors:\\n{predictors_in_selected_model}.\"\n", " )\n", "\n", " modelpredictors = modelpredictors.filter(pl.col(\"ModelID\") == model_id)\n", " predictorbinning = predictorbinning.filter(pl.col(\"ModelID\") == model_id)\n", " print(f\"{model_name} model with **{model_id}** model ID is selected successfully.\")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Model Overview" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The selected model is shown below. Only the currently active predictors are used for the propensity calculation, so only showing those.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "from great_tables import loc, style\n", "\n", "GT(\n", " modelpredictors.select(\n", " pl.col(\"Action\").unique(),\n", " pl.col(\"Channel\").unique(),\n", " pl.col(\"Name\").unique(),\n", " pl.col(\"PredictorName\")\n", " .unique()\n", " .sort()\n", " .implode()\n", " .list.join(\", \")\n", " .alias(\"Active Predictors\"),\n", " (pl.col(\"Performance\").unique() * 100).alias(\"Model Performance (AUC)\"),\n", " ).transpose(include_header=True)\n", ").tab_header(\"Overview\").tab_options(column_labels_hidden=True).tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"column\")\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Binning of the selected Predictor" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The Model Report in Prediction Studio for this model will have a predictor binning plot like below.\n", "\n", "All numbers can be derived from just the number of positives and negatives in each bin that are stored in the ADM Data Mart. The next sections will show exactly how that is done." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "display(\n", " GT(\n", " (\n", " predictorbinning.group_by(\"PredictorName\")\n", " .agg(\n", " pl.first(\"ResponseCount\").cast(pl.Utf8).alias(\"# Responses\"),\n", " pl.n_unique(\"BinIndex\").cast(pl.Utf8).alias(\"# Bins\"),\n", " (pl.first(\"PredictorPerformance\") * 100)\n", " .cast(pl.Utf8)\n", " .alias(\"Predictor Performance(AUC)\"),\n", " )\n", " .rename({\"PredictorName\": \"Predictor Name\"})\n", " .transpose(include_header=True)\n", " )\n", " )\n", " .tab_header(\"Predictor information\")\n", " .tab_options(column_labels_hidden=True)\n", " .tab_style(style=style.text(weight=\"bold\"), locations=loc.body(columns=\"column\"))\n", " .tab_options(table_margin_left=0)\n", ")\n", "\n", "fig = dm.plot.predictor_binning(\n", " model_id=modelpredictors.get_column(\"ModelID\").unique().to_list()[0],\n", " predictor_name=predictor_name,\n", ")\n", "fig.update_layout(width=600, height=400)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "BinPositives = pl.col(\"BinPositives\")\n", "BinNegatives = pl.col(\"BinNegatives\")\n", "sumPositives = pl.sum(\"BinPositives\")\n", "sumNegatives = pl.sum(\"BinNegatives\")\n", "\n", "binstats = predictorbinning.select(\n", " pl.col(\"BinSymbol\").alias(\"Range/Symbol\"),\n", " ((BinPositives + BinNegatives) / (sumPositives + sumNegatives)).alias(\n", " \"Responses (%)\"\n", " ),\n", " BinPositives.cast(pl.Int32).alias(\"Positives\"),\n", " (BinPositives / sumPositives).alias(\"Positives (%)\"),\n", " BinNegatives.cast(pl.Int32).alias(\"Negatives\"),\n", " (BinNegatives / sumNegatives).alias(\"Negatives (%)\"),\n", " (BinPositives / (BinPositives + BinNegatives)).alias(\"Propensity (%)\"),\n", " cdh_utils.z_ratio(neg_col=BinNegatives, pos_col=BinPositives),\n", " (\n", " (BinPositives / (BinPositives + BinNegatives))\n", " / (sumPositives / (BinPositives + BinNegatives).sum())\n", " ).alias(\"Lift\"),\n", ")\n", "total_positives = binstats.select(pl.sum(\"Positives\")).item()\n", "total_negatives = binstats.select(pl.sum(\"Negatives\")).item()\n", "total_row = {\n", " \"Range/Symbol\": \"Total\",\n", " \"Responses (%)\": binstats.select(pl.sum(\"Responses (%)\")).item(),\n", " \"Positives\": total_positives,\n", " \"Positives (%)\": binstats.select(pl.sum(\"Positives (%)\")).item(),\n", " \"Negatives\": total_negatives,\n", " \"Negatives (%)\": binstats.select(pl.sum(\"Negatives (%)\")).item(),\n", " \"Propensity (%)\": total_positives / (total_positives + total_negatives),\n", " \"ZRatio\": 0.0,\n", " \"Lift\": 1.0,\n", "}\n", "\n", "total_df = pl.DataFrame(total_row, schema=binstats.schema)\n", "binstats = binstats.vstack(total_df)\n", "\n", "\n", "GT(binstats).tab_header(\"Binning statistics\").tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Range/Symbol\")\n", ").fmt_percent(pl.selectors.ends_with(\"(%)\")).fmt_number([\"ZRatio\", \"Lift\"]).tab_options(\n", " table_margin_left=0\n", ").tab_options(table_margin_left=0)\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Bin Statistics" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Positive and Negative ratios" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "binning_derived = predictorbinning.select(\n", " pl.col(\"BinSymbol\").alias(\"Range/Symbol\"),\n", " BinPositives.alias(\"Positives\"),\n", " BinNegatives.alias(\"Negatives\"),\n", " ((BinPositives + BinNegatives) / (sumPositives + sumNegatives)).alias(\n", " \"Responses %\"\n", " ),\n", " (BinPositives / sumPositives).alias(\"Positives %\"),\n", " (BinNegatives / sumNegatives).alias(\"Negatives %\"),\n", " (BinPositives / (BinPositives + BinNegatives)).round(4).alias(\"Propensity\"),\n", ")\n", "\n", "pcts = [\"Responses %\", \"Positives %\", \"Negatives %\", \"Propensity\"]\n", "GT(binning_derived).tab_header(\"Derived binning statistics\").tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Range/Symbol\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"),\n", " locations=loc.body(columns=pcts),\n", ").fmt_percent(pcts).tab_options(table_margin_left=0)\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Lift" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "positives = pl.col(\"Positives\")\n", "negatives = pl.col(\"Negatives\")\n", "sumPositives = pl.sum(\"Positives\")\n", "sumNegatives = pl.sum(\"Negatives\")\n", "GT(\n", " binning_derived.select(\n", " \"Range/Symbol\",\n", " \"Positives\",\n", " \"Negatives\",\n", " (\n", " (positives / (positives + negatives))\n", " / (sumPositives / (positives + negatives).sum())\n", " )\n", " .round(4)\n", " .alias(\"Lift\"),\n", " )\n", ").tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Range/Symbol\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"), locations=loc.body(columns=[\"Lift\"])\n", ").tab_options(table_margin_left=0)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Z-Ratio" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.\n", "$$\\frac{posFraction-negFraction}{\\sqrt(\\frac{posFraction*(1-posFraction)}{\\sum positives}+\\frac{negFraction*(1-negFraction)}{\\sum negatives})}$$ \n", "\n", "See the calculation here, which is also included in [cdh_utils' z_ratio()](https://pegasystems.github.io/pega-datascientist-tools/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.z_ratio)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def z_ratio(\n", " pos_col: pl.Expr = pl.col(\"BinPositives\"), neg_col: pl.Expr = pl.col(\"BinNegatives\")\n", ") -> pl.Expr:\n", " def get_fracs(pos_col=pl.col(\"BinPositives\"), neg_col=pl.col(\"BinNegatives\")):\n", " return pos_col / pos_col.sum(), neg_col / neg_col.sum()\n", "\n", " def z_ratio_impl(\n", " pos_fraction_col=pl.col(\"posFraction\"),\n", " neg_fraction_col=pl.col(\"negFraction\"),\n", " positives_col=pl.sum(\"BinPositives\"),\n", " negatives_col=pl.sum(\"BinNegatives\"),\n", " ):\n", " return (\n", " (pos_fraction_col - neg_fraction_col)\n", " / (\n", " (pos_fraction_col * (1 - pos_fraction_col) / positives_col)\n", " + (neg_fraction_col * (1 - neg_fraction_col) / negatives_col)\n", " ).sqrt()\n", " ).alias(\"ZRatio\")\n", "\n", " return z_ratio_impl(*get_fracs(pos_col, neg_col), pos_col.sum(), neg_col.sum())\n", "\n", "\n", "GT(\n", " binning_derived.select(\n", " \"Range/Symbol\", \"Positives\", \"Negatives\", \"Positives %\", \"Negatives %\"\n", " ).with_columns(z_ratio(positives, negatives).round(4))\n", ").tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Range/Symbol\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"), locations=loc.body(columns=[\"ZRatio\"])\n", ").fmt_percent(pl.selectors.ends_with(\"%\")).tab_options(table_margin_left=0)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Predictor AUC\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and\n", "there is a convenient function in pdstools to calculate it directly from the positives and negatives.\n", "\n", "This function is implemented in cdh_utils: [cdh_utils.auc_from_bincounts()](https://pegasystems.github.io/pega-datascientist-tools/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.auc_from_bincounts)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pos = binning_derived.get_column(\"Positives\")\n", "neg = binning_derived.get_column(\"Negatives\")\n", "probs = binning_derived.get_column(\"Propensity\")\n", "order = probs.arg_sort()\n", "FPR = pl.Series([0.0], dtype=pl.Float32).extend(neg[order].cum_sum() / neg[order].sum())\n", "TPR = pl.Series([0.0], dtype=pl.Float32).extend(pos[order].cum_sum() / pos[order].sum())\n", "if TPR[1] < 1 - FPR[1]:\n", " FPR, TPR = TPR, FPR\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pos = binning_derived.get_column(\"Positives\").to_numpy()\n", "neg = binning_derived.get_column(\"Negatives\").to_numpy()\n", "probs = binning_derived.get_column(\"Propensity\").to_numpy()\n", "order = np.argsort(probs)\n", "\n", "FPR = np.cumsum(neg[order]) / np.sum(neg[order])\n", "TPR = np.cumsum(pos[order]) / np.sum(pos[order])\n", "TPR = np.insert(TPR, 0, 0, axis=0)\n", "FPR = np.insert(FPR, 0, 0, axis=0)\n", "# Checking whether classifier labels are correct\n", "if TPR[1] < 1 - FPR[1]:\n", " temp = FPR\n", " FPR = TPR\n", " TPR = temp\n", "auc = cdh_utils.auc_from_bincounts(pos=pos, neg=neg, probs=probs)\n", "\n", "fig = px.line(\n", " x=[1 - x for x in FPR],\n", " y=TPR,\n", " labels=dict(x=\"Specificity\", y=\"Sensitivity\"),\n", " title=f\"AUC = {auc.round(3)}\",\n", " width=700,\n", " height=700,\n", " range_x=[1, 0],\n", " template=\"none\",\n", ")\n", "fig.add_shape(type=\"line\", line=dict(dash=\"dash\"), x0=1, x1=0, y0=0, y1=1)\n", "fig.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predictor Contributions\n", "\n", "### Naive Bayes and Log Odds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The basis for the Naive Bayes algorithm is Bayes' Theorem:\n", "\n", "$$p(C_k|x) = \\frac{p(x|C_k)*p(C_k)}{p(x)}$$\n", "\n", "with $C_k$ the outcome and $x$ the customer. Bayes' theorem turns the\n", "question \"what's the probability to accept this action given a customer\" around to \n", "\"what's the probability of this customer given an action\". With the independence\n", "assumption, and after applying a log odds transformation we get a log odds score \n", "that can be calculated efficiently and in a numerically stable manner:\n", "\n", "$$log\\ odds\\ score = \\sum_{p\\ \\in\\ active\\ predictors}log(p(x_p|Positive)) + log(p_{positive}) - \\sum_plog(p(x_p|Negative)) - log(p_{negative})$$\n", "note that the _prior_ can be written as:\n", "\n", "$$log(p_{positive}) - log(p_{negative}) = log(\\frac{TotalPositives}{Total})-log(\\frac{TotalNegatives}{Total}) = log(TotalPositives) - log(TotalNegatives)$$\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Predictor Contribution" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The contribution (_conditional log odds_) of an active predictor $p$ for bin $i$ with the number\n", "of positive and negative responses in $Positives_i$ and $Negatives_i$ is calculated as (note the \"Laplace smoothing\" to avoid log 0 issues):\n", "\n", "$$contribution_p = \\log(Positives_i+\\frac{1}{nBins}) - \\log(Negatives_i+\\frac{1}{nBins}) - \\log(1+\\sum_{i\\ = 1..nBins}{Positives_i}) + \\log(1+\\sum_i{Negatives_i})$$\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "N = binning_derived.shape[0]\n", "GT(\n", " binning_derived.with_columns(\n", " LogOdds=(pl.col(\"Positives %\") / pl.col(\"Negatives %\")).log().round(5),\n", " ModifiedLogOdds=(\n", " ((positives + 1 / N).log() - (positives.sum() + 1).log())\n", " - ((negatives + 1 / N).log() - (negatives.sum() + 1).log())\n", " ).round(5),\n", " ).drop(\"Responses %\", \"Propensity\")\n", ").tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Range/Symbol\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"),\n", " locations=loc.body(columns=[\"LogOdds\", \"ModifiedLogOdds\"]),\n", ").fmt_percent(pl.selectors.ends_with(\"%\")).tab_options(table_margin_left=0)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Propensity mapping" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Log odds contribution for all the predictors" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The final score is loosely referred to as \"the average contribution\" but\n", "in fact is a little more nuanced. The final score is calculated as:\n", "\n", "$$score = \\frac{\\log(1 + TotalPositives) – \\log(1 + TotalNegatives) + \\sum_p contribution_p}{1 + nActivePredictors}$$\n", "\n", "Here, $TotalPositives$ and $TotalNegatives$ are the total number of\n", "positive and negative responses to the model.\n", "\n", "Below an example. From all the active predictors of the model \n", "we pick a value (in the middle for numerics, first symbol\n", "for symbolics) and show the (modified) log odds.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "def middle_bin():\n", " return pl.col(\"BinIndex\") == (pl.max(\"BinIndex\") / 2).floor().cast(pl.UInt32)\n", "\n", "\n", "if not all(\n", " col in modelpredictors.columns for col in [\"BinLowerBound\", \"BinUpperBound\"]\n", "):\n", "\n", " def extract_numbers_in_contents(s: str, index):\n", " numbers = re.findall(r\"[-+]?\\d*\\.\\d+|\\d+\", s)\n", " try:\n", " number = float(numbers[index])\n", " except:\n", " number = 0\n", " return number\n", "\n", " modelpredictors = modelpredictors.with_columns(\n", " pl.col(\"Contents\").cast(pl.Utf8)\n", " ).with_columns(\n", " pl.when(pl.col(\"Type\") == \"numeric\")\n", " .then(\n", " pl.col(\"Contents\").map_elements(\n", " lambda col: extract_numbers_in_contents(col, 0)\n", " )\n", " )\n", " .otherwise(pl.lit(-9999))\n", " .alias(\"BinLowerBound\")\n", " .cast(pl.Float32),\n", " pl.when(pl.col(\"Type\") == \"numeric\")\n", " .then(\n", " pl.col(\"Contents\").map_elements(\n", " lambda col: extract_numbers_in_contents(col, 1)\n", " )\n", " )\n", " .otherwise(pl.lit(-9999))\n", " .alias(\"BinUpperBound\")\n", " .cast(pl.Float32),\n", " )\n", "\n", "\n", "def row_wise_log_odds(bin, positives, negatives):\n", " bin, N = bin.list.get(0) - 1, positives.list.len()\n", " pos, neg = positives.list.get(bin), negatives.list.get(bin)\n", " pos_sum, neg_sum = positives.list.sum(), negatives.list.sum()\n", " return (\n", " ((pos + (1 / N)).log() - (pos_sum + 1).log())\n", " - (((neg + (1 / N)).log()) - (neg_sum + 1).log())\n", " ).alias(\"Modified Log odds\")\n", "\n", "\n", "df = (\n", " modelpredictors.filter(pl.col(\"PredictorName\") != \"Classifier\")\n", " .group_by(\"PredictorName\")\n", " .agg(\n", " Value=pl.when(pl.col(\"Type\").first() == \"numeric\")\n", " .then(\n", " ((pl.col(\"BinLowerBound\") + pl.col(\"BinUpperBound\")) / 2).filter(\n", " middle_bin()\n", " )\n", " )\n", " .otherwise(\n", " pl.col(\"BinSymbol\").str.split(\",\").list.first().filter(middle_bin())\n", " ),\n", " Bin=pl.col(\"BinIndex\").filter(middle_bin()),\n", " Positives=pl.col(\"BinPositives\"),\n", " Negatives=pl.col(\"BinNegatives\"),\n", " )\n", " .with_columns(\n", " pl.col([\"Positives\", \"Negatives\"]).list.get(pl.col(\"Bin\").list.get(0) - 1),\n", " pl.col(\"Bin\", \"Value\").list.get(0),\n", " LogOdds=row_wise_log_odds(\n", " pl.col(\"Bin\"), pl.col(\"Positives\"), pl.col(\"Negatives\")\n", " ).round(4),\n", " )\n", " .sort(\"PredictorName\")\n", ")\n", "\n", "classifier = (\n", " modelpredictors.filter(pl.col(\"EntryType\") == \"Classifier\")\n", " .with_columns(\n", " Propensity=(BinPositives / (BinPositives / BinNegatives)),\n", " FinalPropensity=((0.5 + BinPositives) / (1 + BinPositives + BinNegatives)),\n", " ZRatio=cdh_utils.z_ratio(neg_col=BinNegatives, pos_col=BinPositives).round(4),\n", " Lift=(\n", " (BinPositives / (BinPositives + BinNegatives))\n", " / (sumPositives / (BinPositives + BinNegatives).sum())\n", " ),\n", " )\n", " .select(\n", " [\n", " pl.col(\"BinIndex\").alias(\"Index\"),\n", " pl.col(\"BinSymbol\").alias(\"Bin\"),\n", " BinPositives.alias(\"Positives\"),\n", " BinNegatives.alias(\"Negatives\"),\n", " (pl.cum_sum(\"BinResponseCount\") / pl.sum(\"BinResponseCount\")).alias(\n", " \"Cum. Total (%)\"\n", " ),\n", " (pl.col(\"BinPropensity\")).alias(\"Propensity (%)\"),\n", " (pl.col(\"FinalPropensity\")).alias(\"Final Propensity (%)\"),\n", " (pl.cum_sum(\"BinPositives\") / pl.sum(\"BinPositives\")).alias(\n", " \"Cum Positives (%)\"\n", " ),\n", " pl.col(\"ZRatio\"),\n", " pl.col(\"Lift\").alias(\"Lift(%)\"),\n", " pl.col(\"BinResponseCount\").alias(\"Responses\"),\n", " ]\n", " )\n", ")\n", "classifierLogOffset = log(1 + classifier[\"Positives\"].sum()) - log(\n", " 1 + classifier[\"Negatives\"].sum()\n", ")\n", "\n", "propensity_mapping = df.vstack(\n", " pl.DataFrame(\n", " dict(\n", " zip(\n", " df.columns,\n", " [\"Final Score\"]\n", " + [None] * 4\n", " + [(df[\"LogOdds\"].sum() + classifierLogOffset) / (len(df) + 1)],\n", " )\n", " ),\n", " schema=df.schema,\n", " )\n", ")\n", "\n", "GT(propensity_mapping).tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"PredictorName\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"), locations=loc.body(columns=[\"LogOdds\"])\n", ").tab_options(table_margin_left=0)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### From Log Odds Score to Final Propensity" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The log odds score is mapped to a propensity using a score distribution, which is built using the [PAV(A)](https://en.wikipedia.org/wiki/Isotonic_regression) (Pool Adjacent Violators Algorithm), a form of isotonic regression that ensures the mapping is monotonically increasing.\n", "\n", "For each bin (score interval) the final propensity is calculated from the observed positive and negative counts:\n", "\n", "$$Final\\ Propensity = \\frac{0.5 + BinPositives}{1 + BinPositives + BinNegatives}$$\n", "\n", "The addition of 0.5 is for Laplace smoothing. An empty model (no responses yet) returns a neutral propensity of 0.5, rather than 0 or undefined. This quickly converges to the observed success rate as more responses are received.\n", "\n", "The transformation is a calibrated, isotonic mapping from log odds to propensity, grounded in the actual observed response rates of the model.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "# TODO see if we can port the \"getActiveRanges\" code to python so to highlight the classifier rows that are \"active\"\n", "\n", "GT(classifier.drop(\"Responses\")).tab_style(\n", " style=style.text(weight=\"bold\"), locations=loc.body(columns=\"Index\")\n", ").tab_style(\n", " style=style.text(color=\"blue\"),\n", " locations=loc.body(columns=[\"Final Propensity (%)\"]),\n", ").fmt_percent(pl.selectors.ends_with(\"(%)\")).tab_options(table_margin_left=0)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The chart below visualizes the score distribution from the table above. The score bin that contains the calculated final score for the example customer is highlighted." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "score = propensity_mapping.filter(PredictorName=\"Final Score\")[\"LogOdds\"][0]\n", "score_bin = (\n", " modelpredictors.filter(pl.col(\"EntryType\") == \"Classifier\")\n", " .select(\n", " pl.col(\"BinSymbol\").filter(\n", " pl.lit(score).is_between(pl.col(\"BinLowerBound\"), pl.col(\"BinUpperBound\"))\n", " )\n", " )\n", " .item()\n", ")\n", "score_responses = modelpredictors.filter(\n", " (pl.col(\"EntryType\") == \"Classifier\") & (pl.col(\"BinSymbol\") == score_bin)\n", ")[\"BinResponseCount\"][0]\n", "\n", "score_bin_index = (\n", " modelpredictors.filter(pl.col(\"EntryType\") == \"Classifier\")[\"BinSymbol\"]\n", " .to_list()\n", " .index(score_bin)\n", ")\n", "\n", "score_propensity = classifier.row(score_bin_index, named=True)[\n", " \"Final Propensity (%)\"\n", "]\n", "\n", "final_propensity = (\n", " modelpredictors.filter(pl.col(\"EntryType\") == \"Classifier\")\n", " .with_columns(\n", " FinalPropensity=(\n", " (0.5 + BinPositives) / (1 + BinPositives + BinNegatives)\n", " ).round(5),\n", " )\n", " .select(\n", " pl.col(\"FinalPropensity\").filter(\n", " (pl.col(\"BinLowerBound\") < score) & (pl.col(\"BinUpperBound\") > score)\n", " )\n", " )[\"FinalPropensity\"][0]\n", ")\n", "fig = dm.plot.score_distribution(\n", " model_id=modelpredictors.get_column(\"ModelID\").unique().to_list()[0]\n", ").add_annotation(\n", " x=score_bin,\n", " y=score_propensity / 100,\n", " text=f\"Returned propensity: {score_propensity*100:.2f}%\",\n", " bgcolor=\"#FFFFFF\",\n", " bordercolor=\"#000000\",\n", " showarrow=False,\n", " yref=\"y2\",\n", " opacity=0.7,\n", ")\n", "bin_index = list(fig.data[0][\"x\"]).index(score_bin)\n", "fig.data[0][\"marker_color\"] = (\n", " [\"grey\"] * bin_index\n", " + [\"#1f77b4\"]\n", " + [\"grey\"] * (classifier.shape[0] - bin_index - 1)\n", ")\n", "fig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Feature Importance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Feature importance in Naive Bayes models represents **how strongly a predictor differentiates between positive and negative outcomes**. It measures the average magnitude of log odds contributions across the predictor's bins.\n", "\n", "A predictor with high importance has bins with strong log odds values (far from zero), indicating strong predictive power. A predictor with low importance has bins with weak log odds values (close to zero), indicating weak predictive power.\n", "\n", "### Formula\n", "\n", "For a predictor with bins $i = 1..n$:\n", "\n", "**Step 1: Log odds per bin** (with Laplace smoothing $\\frac{1}{n}$):\n", "\n", "$$\\text{LogOdds}_i = \\log\\left(Pos_i + \\frac{1}{n}\\right) - \\log\\left(\\sum Pos + 1\\right) - \\left[\\log\\left(Neg_i + \\frac{1}{n}\\right) - \\log\\left(\\sum Neg + 1\\right)\\right]$$\n", "\n", "**Step 2: Absolute log odds per bin**:\n", "\n", "$$|\\text{LogOdds}_i|$$\n", "\n", "**Step 3: Feature importance** (weighted average of absolute log odds):\n", "\n", "$$\\text{Importance} = \\frac{\\sum_i |\\text{LogOdds}_i| \\times Responses_i}{\\sum_i Responses_i}$$\n", "\n", "**Step 4: Optional scaling** (default):\n", "\n", "$$\\text{Scaled Importance} = \\frac{\\text{Importance}}{\\max(\\text{Importance})} \\times 100$$\n", "\n", "### Interpretation\n", "\n", "- **Higher values**: Predictor has strong log odds values (strong predictive power)\n", "- **Lower values**: Predictor has weak log odds values (weak predictive power)\n", "- **Zero**: All bins have zero log odds (predictor adds no information)\n", "\n", "The absolute log odds captures the magnitude of each bin's contribution to the model score, weighted by how many responses are in that bin. This matches the Pega platform implementation." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate feature importance for all active predictors\n", "predictor_importance = (\n", " modelpredictors.filter(pl.col(\"PredictorName\") != \"Classifier\")\n", " .with_columns(\n", " # Calculate both unscaled and scaled importance\n", " cdh_utils.feature_importance(scaled=False).alias(\"Importance\"),\n", " cdh_utils.feature_importance(scaled=True).alias(\"ScaledImportance\"),\n", " )\n", " .group_by(\"PredictorName\", \"ModelID\")\n", " .agg(\n", " pl.first(\"Importance\"),\n", " pl.first(\"ScaledImportance\")\n", " )\n", " .sort(\"ScaledImportance\", descending=True)\n", ")\n", "\n", "GT(predictor_importance.drop(\"ModelID\")).tab_header(\"Feature Importance by Predictor\").fmt_number(\n", " [\"Importance\", \"ScaledImportance\"], decimals=4\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_input" ] }, "outputs": [], "source": [ "# Visualize feature importance\n", "fig = px.bar(\n", " predictor_importance,\n", " x=\"PredictorName\",\n", " y=\"ScaledImportance\",\n", " title=\"Predictor Feature Importance (Scaled 0-100)\",\n", " labels={\"ScaledImportance\": \"Importance\", \"PredictorName\": \"\"},\n", " template=\"pega\",\n", ")\n", "fig.update_layout(\n", " xaxis_tickangle=-45,\n", " height=500,\n", " margin=dict(b=180) # Increase bottom margin for longer rotated labels\n", ")\n", "fig.update_xaxes(tickfont=dict(size=10)) # Smaller font for x-axis labels\n", "fig.show()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3" } }, "nbformat": 4, "nbformat_minor": 2 }