📦 Optional dependencies
This article uses features from the pdstools
admextra. Install with your favorite package manager, e.g.pip install "pdstools[adm]".
Example ADM analysis¶
See this notebook for an introduction to the ADMDatamart class to get an overview of the currently implemented features in the Python version of CDH Tools. If you have any suggestions for new features, please do not hesitate to raise an issue in Git, or even better: create a pull request yourself!
Reading the data¶
Reading the data is quite simple. All you need to do is to give a directory location to the ADMSnapshot class and it will automatically detect the latest files and import them. There is also a default function to import the CDH Sample data directly from the internet, as you can see below:
[2]:
from pdstools import ADMDatamart, datasets
import polars as pl
CDHSample = datasets.cdh_sample()
Bubble Chart¶
To start out with the bubble chart, which we can simply call by calling plotPerformanceSuccessRateBubbleChart with our main class.
[3]:
fig = CDHSample.plot.bubble_chart()
fig.show()
Looks like a healthy bubble plot, but sometimes it is useful to consider only certain models in the analysis. Note that the bubble chart automatically considers only the last snapshot by default, though this is a parameter.
To reduce the information, let’s only consider models with more than 500 responses within the CreditCards group.
[4]:
query = (pl.col("ResponseCount") > 500) & (pl.col("Group") == "CreditCards")
fig = CDHSample.plot.bubble_chart(query=query)
fig.show()
Alternatively, we could only look at the top n best performing models within our query. To do this, we need to supply a list of model IDs which we can easily extract from the data as such.
Note here the alternative querying syntax you can use, which was default in the previous version of CDH Tools: if you have a list (list) to subset a column’s values with, you can simply supply a dictionary with ‘column name’:list to only get values in that list for that column.
[5]:
top30ids = (
CDHSample.aggregates.last()
.sort("Performance", descending=True)
.select("ModelID")
.head(30)
.collect()
.to_series()
.to_list()
)
fig = CDHSample.plot.bubble_chart(query={"ModelID": top30ids})
fig.show()
The bubble chart gives some information about which models perform well, but that is not always informative: if we don’t know in which channels, issues or groups our issues lie then we may not be looking in the right place. This is where the Treemap visualisation is quite handy.
[6]:
fig = CDHSample.plot.tree_map()
fig.show()
By default the Treemap shows the weighted performance, where the performance is weighted by the response count. The squares represent Model IDs: the larger a square, the more model IDs are within that combination of context keys. We can also color the Treemap by another variable, such as the SuccessRate:
[7]:
fig = CDHSample.plot.tree_map("SuccessRate")
fig.show()
Similar to the responses, the success rate over time can also be of interest. With ‘plotOverTime’, you can plot the success rate of different models as they develop over time.
[8]:
fig = CDHSample.plot.over_time("SuccessRate", by="ModelID", query=pl.col("Channel") == "Web")
fig.show()
over_time: 19 unique values for `ModelID` exceeds the cap of 10; keeping the top 10 by total ResponseCount and dropping 9.
And if it is not interesting to consider the success rate over time, there is also ‘plotPropositionSuccessRates’, which by default considers the last state of the models and plots the histogram of their success rates.
[9]:
fig = CDHSample.plot.proposition_success_rates(query=pl.col("Channel") == "Web")
fig.show()
If we want to look at the distribution of responses and their propensities for a given model, we can subset that model and call plotScoreDistribution. Note here we subset the model by its ID.
[10]:
fig = CDHSample.plot.score_distribution(model_id="08ca1302-9fc0-57bf-9031-d4179d400493")
fig.show()
Alternatively, we can also subset a model by its model name, and then further drill down by group/issue/channel/configuration. See the example below.
[11]:
figs = CDHSample.plot.multiple_score_distributions(
query=(pl.col("Name") == "HomeOwners")
& (pl.col("Group") == "Bundles")
& (pl.col("Issue") == "Sales")
& (pl.col("Channel") == "Web")
& (pl.col("Configuration") == "OmniAdaptiveModel"),
show_all=False,
)
for fig in figs:
fig.show()
Similarly, we can also display the distribution of a single predictor and its binning. This function loops through each predictor of a model and generates the binning image for that predictor. For that reason we recommend subsetting the predictor names ahead of time or, depending on how many predictors the model has, a lot of images will be generated.
[12]:
figs = CDHSample.plot.multiple_predictor_binning(
model_id="08ca1302-9fc0-57bf-9031-d4179d400493",
query=(
pl.col("PredictorName").is_in(
[
"Customer.Age",
"Customer.AnnualIncome",
"IH.Email.Outbound.Accepted.pxLastGroupID",
]
)
),
show_all=False,
);
for fig in figs:
fig.show()
Alternatively we can look at the performance of a predictor over multiple models. Again, we recommend subsetting the predictor names with a list to make it more legible.
[13]:
fig = CDHSample.plot.predictor_performance(
query=pl.col("PredictorName").is_in(
[
"Customer.Age",
"Customer.AnnualIncome",
"IH.Email.Outbound.Accepted.pxLastGroupID",
]
)
)
fig.show()
What the two previous visualisations could not represent very well is the performance of the predictors over different models. That is what the plotPredictorPerformanceHeatmap function does; again with subsetting of predictors as a recommended step.
[14]:
fig = CDHSample.plot.predictor_performance_heatmap(
query=pl.col("PredictorName").is_in(
[
"Customer.Age",
"Customer.AnnualIncome",
"IH.Email.Outbound.Accepted.pxLastGroupID",
]
)
)
fig.show()
External model scores as ADM predictors¶
When you integrate scores from third-party ML models or custom scoring pipelines as ADM predictors, you may want to analyse their contribution separately from native customer-profile or interaction-history predictors.
The apply_predictor_categorization method lets you supply your own mapping so that downstream plots automatically colour-code predictors by that label. The simplest form is a dictionary: keys become category labels and values are substrings (or regular expressions when use_regexp=True) matched against PredictorName.
Note – simulated data: The CDH sample dataset does not contain predictors with an external-score naming convention, so the example below re-tags the
IH(interaction-history) predictors as External Model Score to illustrate the workflow. In a real deployment you would replace"IH"with whatever prefix or pattern your external-score predictors use, for example"ExtScore_"or"ModelOutput.", and add as many additional categories as needed.
[15]:
# Tag predictors whose names contain "IH" as External Model Scores.
# In production, replace "IH" with your actual naming pattern, e.g. "ExtScore_".
CDHSample.apply_predictor_categorization(
categorization={"External Model Score": ["IH"]},
)
Feature importance: external scores vs. native predictors¶
plot.predictor_performance renders a box-plot of predictor performance across models and colours each predictor by its PredictorCategory, so external model scores are immediately visually distinguished from native Customer or Primary predictors.
[16]:
fig = CDHSample.plot.predictor_performance(top_n=20)
fig.show()
Lift contribution by predictor category¶
plot.predictor_contribution shows what fraction of the overall lift each category contributes per ADM configuration. This lets you answer the question “how much of our model’s predictive power comes from external scores versus our own data?”
[17]:
fig = CDHSample.plot.predictor_contribution()
fig.show()
Interpreting the results¶
Predictor performance measures how well a single predictor separates positives from negatives within an individual model (AUC on a 0.5–1.0 scale). An external model score with performance ≥ 0.6 is already a meaningful signal; above 0.7 it is a strong one. Compare it to the native predictors in the same box-plot: if the external score sits at the top, it is the dominant driver of propensity for those models.
Lift contribution (the bar chart) normalises performance across all active predictors within a configuration and expresses each category’s share as a percentage. A large External Model Score bar means those third-party scores are carrying a disproportionate amount of the model’s lift — which is fine if the scores are stable and monitored, but worth flagging if data freshness or availability could be a risk.
Typical action items:
High contribution, high performance — the external score is providing real value; ensure it is refreshed regularly and has a fallback strategy for missing values.
High contribution, mediocre performance — the native predictors are weak; consider enriching the customer profile before reducing reliance on external scores.
Low contribution — the external score is not being used by the models; check whether it is actually active and whether the feature engineering (binning, scaling) is appropriate.