Example ADM analysis

See this notebook for an introduction to the ADMDatamart class to get an overview of the currently implemented features in the Python version of CDH Tools. If you have any suggestions for new features, please do not hesitate to raise an issue in Git, or even better: create a pull request yourself!

This notebook builds upon the Getting Started guide.

Reading the data

Reading the data is quite simple. All you need to do is to give a directory location to the ADMSnapshot class and it will automatically detect the latest files and import them. There is also a default function to import the CDH Sample data directly from the internet, as you can see below:

[2]:
from pdstools import ADMDatamart, datasets
import polars as pl

CDHSample = datasets.cdh_sample()

Bubble Chart

To start out with the bubble chart, which we can simply call by calling plotPerformanceSuccessRateBubbleChart with our main class.

[3]:
fig = CDHSample.plot.bubble_chart()
fig.show()
5055606570750.000%5.000%10.000%15.000%20.000%25.000%30.000%
505560657075PerformanceBubble Chart over all models68 models: 0 (0.0%) at (50,0)PerformanceSuccessRate

Looks like a healthy bubble plot, but sometimes it is useful to consider only certain models in the analysis. Note that the bubble chart automatically considers only the last snapshot by default, though this is a parameter.

To reduce the information, let’s only consider models with more than 500 responses within the CreditCards group.

[4]:
query = (pl.col("ResponseCount") > 500) & (pl.col("Group") == "CreditCards")
fig = CDHSample.plot.bubble_chart(query=query)
fig.show()
50556065700.000%5.000%10.000%15.000%20.000%25.000%
50556065PerformanceBubble Chart over all models19 models: 0 (0.0%) at (50,0)PerformanceSuccessRate

Alternatively, we could only look at the top n best performing models within our query. To do this, we need to supply a list of model IDs which we can easily extract from the data as such.

Note here the alternative querying syntax you can use, which was default in the previous version of CDH Tools: if you have a list (list) to subset a column’s values with, you can simply supply a dictionary with ‘column name’:list to only get values in that list for that column.

[5]:
top30ids = (
    CDHSample.aggregates.last()
    .sort("Performance", descending=True)
    .select("ModelID")
    .head(30)
    .collect()
    .to_series()
    .to_list()
)

fig = CDHSample.plot.bubble_chart(query={"ModelID": top30ids})
fig.show()
606264666870727476780.000%5.000%10.000%15.000%20.000%
6264666870727476PerformanceBubble Chart over all models30 models: 0 (0.0%) at (50,0)PerformanceSuccessRate

The bubble chart gives some information about which models perform well, but that is not always informative: if we don’t know in which channels, issues or groups our issues lie then we may not be looking in the right place. This is where the Treemap visualisation is quite handy.

[6]:
fig = CDHSample.plot.tree_map()
fig.show()
All contextsSMSEmailWebOutboundOutboundInboundSalesServicesSalesServicesSalesServicesCreditCardsAutoLoansDepositAccountsHomeLoansBundlesAccountCustomerCreditCardsAutoLoansDepositAccountsHomeLoansBundles CreditCardsBundlesAutoLoansDepositAccountsHomeLoans
0.50.60.70.80.91Weighted average PerformanceWeighted average Performance by Name

By default the Treemap shows the weighted performance, where the performance is weighted by the response count. The squares represent Model IDs: the larger a square, the more model IDs are within that combination of context keys. We can also color the Treemap by another variable, such as the SuccessRate:

[7]:
fig = CDHSample.plot.tree_map("SuccessRate")
fig.show()
All contextsSMSEmailWebOutboundOutboundInboundSalesServicesSalesServicesSalesServicesCreditCardsAutoLoansDepositAccountsHomeLoansBundlesAccountCustomerCreditCardsDepositAccountsAutoLoansHomeLoansBundles CreditCardsDepositAccountsBundlesHomeLoansAutoLoans
0.050.10.15Weighted average Success RateWeighted average Success Rate by Name

Similar to the responses, the success rate over time can also be of interest. With ‘plotOverTime’, you can plot the success rate of different models as they develop over time.

[8]:
fig = CDHSample.plot.over_time("SuccessRate", by="ModelID", query=pl.col("Channel") == "Web")
fig.show()
23:59:59.999May 31, 202123:59:59.999500:00:00Jun 1, 202100:00:00.000500:00:00.0010.00%5.00%10.00%15.00%20.00%25.00%30.00%
ModelID08ca1302-9fc0-57bf-9031-d4179d4004930ad9da35-f171-5ab9-973f-1cc4b89b4e8f17571d69-f802-5122-9bfe-6f3f4ef9073b59f06605-5b39-57dd-945b-637a72fa90855cb8d642-390e-5c3e-8868-d34768f9535c68e1d164-81e3-5da0-816c-0bbc3c20ac6c692f453a-f825-5a90-ab40-f83cbb34b058708f20ef-7ca1-58f4-9dfa-b7efb5342fcf7d2046d5-be79-5f5a-a858-dcd9c65fd0868e40882a-aacf-5f03-9be8-f833c7932da295ae0f10-297a-5c01-94d8-914bb89df9699e9e8c2e-6d19-538c-8e7b-a73070717ecfa87d51fb-cfdc-598d-b84a-dcdd64e1e750a8cd5e12-5d10-5b62-95f6-c5bb6344a83aa8cde9d1-c720-537c-8fb8-8c59506bddfec29ffcfa-5210-5998-a826-f3a8db57c610c8a529fb-9d2c-5069-a50f-42ee1ea9341add502275-e98b-51f7-8935-c30c84641871ea0a225a-7d09-5d90-88f6-d0e8380d67ccSuccessRate over time, per ModelID over all modelsSnapshotTimeSuccessRate

And if it is not interesting to consider the success rate over time, there is also ‘plotPropositionSuccessRates’, which by default considers the last state of the models and plots the histogram of their success rates.

[9]:
fig = CDHSample.plot.proposition_success_rates(query=pl.col("Channel") == "Web")
fig.show()
00.050.10.150.20.250.3FirstMortgageFloatPlatinumRewardsCardPremiumBankingAutoNew36MonthsMasterCardGoldIncreaseYourCreditLineVisaGoldServices_ChannelAction_TemplateUPlusFinGoldSuperSaverAutoUsed36MonthsMoneyMarketSavingsAccountAutoNew84MonthsHomeOwnersFirstMortgageFiveOneARMSeniorCheckingFirstMortgagePaymentProtectionPremierChecking
SuccessRate of each proposition over all modelsavg of SuccessRateName

If we want to look at the distribution of responses and their propensities for a given model, we can subset that model and call plotScoreDistribution. Note here we subset the model by its ID.

[10]:
fig = CDHSample.plot.score_distribution(model_id="08ca1302-9fc0-57bf-9031-d4179d400493")
fig.show()
<-0.215[-0.215, -0.2>[-0.2, -0.165>[-0.165, -0.16>[-0.16, -0.14>[-0.14, -0.115>[-0.115, -0.105>[-0.105, -0.09>[-0.09, -0.08>[-0.08, -0.065>[-0.065, -0.045>[-0.045, -0.04>[-0.04, -0.02>[-0.02, -0.01>[-0.01, 0.0>[0.0, 0.03>[0.03, 0.04>[0.04, 0.085>[0.085, 0.09>[0.09, 0.125>[0.125, 0.995>02004006008000.000%20.000%40.000%60.000%80.000%
ResponsesPropensityClassifier score distributionOmniAdaptiveModel/Web/Inbound/Sales/Bundles/HomeOwners/MISSINGRangeResponsesPropensity

Alternatively, we can also subset a model by its model name, and then further drill down by group/issue/channel/configuration. See the example below.

[11]:
figs = CDHSample.plot.multiple_score_distributions(
    query=(pl.col("Name") == "HomeOwners")
    & (pl.col("Group") == "Bundles")
    & (pl.col("Issue") == "Sales")
    & (pl.col("Channel") == "Web")
    & (pl.col("Configuration") == "OmniAdaptiveModel"),
    show_all=False,
)
for fig in figs:
    fig.show()
<-0.215[-0.215, -0.2>[-0.2, -0.165>[-0.165, -0.16>[-0.16, -0.14>[-0.14, -0.115>[-0.115, -0.105>[-0.105, -0.09>[-0.09, -0.08>[-0.08, -0.065>[-0.065, -0.045>[-0.045, -0.04>[-0.04, -0.02>[-0.02, -0.01>[-0.01, 0.0>[0.0, 0.03>[0.03, 0.04>[0.04, 0.085>[0.085, 0.09>[0.09, 0.125>[0.125, 0.995>02004006008000.000%20.000%40.000%60.000%80.000%
ResponsesPropensityClassifier score distributionOmniAdaptiveModel/Web/Inbound/Sales/Bundles/HomeOwners/MISSINGRangeResponsesPropensity

Similarly, we can also display the distribution of a single predictor and its binning. This function loops through each predictor of a model and generates the binning image for that predictor. For that reason we recommend subsetting the predictor names ahead of time or, depending on how many predictors the model has, a lot of images will be generated.

[12]:
figs = CDHSample.plot.multiple_predictor_binning(
    model_id="08ca1302-9fc0-57bf-9031-d4179d400493",
    query=(
        pl.col("PredictorName").is_in(
            [
                "Customer.Age",
                "Customer.AnnualIncome",
                "IH.Email.Outbound.Accepted.pxLastGroupID",
            ]
        )
    ),
    show_all=False,
);
for fig in figs:
    fig.show()
<5322.5409[5322.5409, 11220.7809>[11220.7809, 14825.2609>[14825.2609, 20068.1409>[20068.1409, 25311.0209>[25311.0209, 33503.0209>[33503.0209, 38418.2209>[38418.2209, 49231.6609>[49231.6609, 58079.0209>>=58079.020901002003004005006007008005.000%10.000%15.000%20.000%25.000%30.000%35.000%
ResponsesPropensityPredictor binning for Customer.AnnualIncomeOmniAdaptiveModel/Web/Inbound/Sales/Bundles/HomeOwners/MISSINGRangeResponsesPropensity
<27.12[27.12, 32.08>[32.08, 34.16>[34.16, 39.12>[39.12, 42.16>[42.16, 44.08>[44.08, 47.12>[47.12, 49.04>[49.04, 52.08>[52.08, 57.04>>=57.0401002003004005006007008006.000%8.000%10.000%12.000%14.000%16.000%18.000%20.000%22.000%
ResponsesPropensityPredictor binning for Customer.AgeOmniAdaptiveModel/Web/Inbound/Sales/Bundles/HomeOwners/MISSINGRangeResponsesPropensity
MISSINGDepositAccounts, AutoLoansCustomer, CreditCardsHomeLoansBundlesAccount, WealthOffers05001000150012.000%14.000%16.000%18.000%20.000%
ResponsesPropensityPredictor binning for IH.Email.Outbound.Accepted.pxLastGroupIDOmniAdaptiveModel/Web/Inbound/Sales/Bundles/HomeOwners/MISSINGRangeResponsesPropensity

Alternatively we can look at the performance of a predictor over multiple models. Again, we recommend subsetting the predictor names with a list to make it more legible.

[13]:
fig = CDHSample.plot.predictor_performance(
    query=pl.col("PredictorName").is_in(
        [
            "Customer.Age",
            "Customer.AnnualIncome",
            "IH.Email.Outbound.Accepted.pxLastGroupID",
        ]
    )
)
fig.show()
0.50.550.60.650.70.750.80.85IH.Email.Outbound.Accepted.pxLastGroupIDCustomer.AgeCustomer.AnnualIncome
Predictor categoryCustomerIHPredictorPerformance over all modelsPerformancePredictor Name

What the two previous visualisations could not represent very well is the performance of the predictors over different models. That is what the plotPredictorPerformanceHeatmap function does; again with subsetting of predictors as a recommended step.

[14]:
fig = CDHSample.plot.predictor_performance_heatmap(
    query=pl.col("PredictorName").is_in(
        [
            "Customer.Age",
            "Customer.AnnualIncome",
            "IH.Email.Outbound.Accepted.pxLastGroupID",
        ]
    )
)
fig.show()
0.8640.7490.6910.6310.7190.6450.6580.6390.6260.5000.5000.5000.5000.7000.6560.6790.6050.5980.5930.5730.5980.5000.5000.5000.5000.5000.6680.5790.5910.6790.5630.5460.5480.5400.5100.5770.5760.5350.530ChannelAction_TemplateAutoNew84MonthsAutoUsed60MonthsAutoUsed48MonthsHomeOwnersAutoUsed36MonthsAutoNew48MonthsFirstMortgageAutoUsed84MonthsAutoNew36MonthsFirstMortgageFloatFirstMortgageFHAFirstMortgageFiveOneARMIH.Email.Outbound.Accepted.pxLastGroupIDCustomer.AgeCustomer.AnnualIncome
0.50.60.70.80.91Top predictors over all models