ADM Explained

Pega

2023-03-15

This notebook shows exactly how all the values in an ADM model report are calculated. It also shows how the propensity is calculated for a particular customer.

We use one of the shipped datamart exports for the example. This is a model very similar to one used in some of the ADM PowerPoint/Excel deep dive examples. You can change this notebook to apply to your own data.

[3]:
model_name = "AutoNew84Months"
predictor_name = "Customer.NetWealth"
channel= "Web"

For the example we pick one particular model over a channel. To explain the ADM model report, we use one of the active predictors as an example. Swap for any other predictor when using different data.

[4]:
dm = datasets.CDHSample(subset=False)

model = dm.combinedData.filter(
    (pl.col("Name") == model_name) & (pl.col("Channel") == channel)
)

modelpredictors = (
    dm.combinedData.join(
        model.select(pl.col("ModelID").unique()), on="ModelID", how="inner"
    )
    .filter(pl.col("EntryType") != "Inactive")
    .with_columns(Action=pl.concat_str(["Issue", "Group"], separator="/"),
                  PredictorName=pl.col("PredictorName").cast(pl.Utf8))
    .collect()
)

predictorbinning = modelpredictors.filter(
    pl.col("PredictorName") == predictor_name
).sort("BinIndex")

Model Overview

The selected model is shown below. Only the currently active predictors are used for the propensity calculation, so only showing those.

[6]:
Values
Action Sales/AutoLoans
Channel Web
Name AutoNew84Months
Active Predictors [Classifier, Customer.Age, Customer.AnnualIncome, Customer.BusinessSegment, Customer.CLV, Customer.CLV_VALUE, Customer.CreditScore, Customer.Date_of_Birth, Customer.Gender, Customer.MaritalStatus, Customer.NetWealth, Customer.NoOfDependents, Customer.Prefix, Customer.RelationshipStartDate, Customer.RiskCode, Customer.WinScore, Customer.pyCountry, IH.Email.Outbound.Accepted.pxLastGroupID, IH.Email.Outbound.Accepted.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Accepted.pyHistoricalOutcomeCount, IH.Email.Outbound.Churned.pyHistoricalOutcomeCount, IH.Email.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Rejected.pyHistoricalOutcomeCount, IH.SMS.Outbound.Accepted.pxLastGroupID, IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount, IH.SMS.Outbound.Churned.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pyHistoricalOutcomeCount, IH.SMS.Outbound.Rejected.pxLastGroupID, IH.SMS.Outbound.Rejected.pyHistoricalOutcomeCount, IH.Web.Inbound.Accepted.pxLastGroupID, IH.Web.Inbound.Accepted.pyHistoricalOutcomeCount, IH.Web.Inbound.Loyal.pxLastGroupID, IH.Web.Inbound.Loyal.pyHistoricalOutcomeCount, IH.Web.Inbound.Rejected.pxLastGroupID, IH.Web.Inbound.Rejected.pyHistoricalOutcomeCount, Param.ExtGroupCreditcards]
Model Performance (AUC) 77.4901

Binning of the selected Predictor

The Model Report in Prediction Studio for this model will have a predictor binning plot like below.

All numbers can be derived from just the number of positives and negatives in each bin that are stored in the ADM Data Mart. The next sections will show exactly how that is done.

Value
Predictor Name Customer.NetWealth
# Responses 1636
# Bins 8
Predictor Performance(AUC) 72.2077
[8]:
Responses (%) Positives Positives (%) Negatives Negatives (%) Propensity (%) ZRatio Lift
Range/Symbol
<11684.56 0.267 13.0 0.063 423.0 0.296 0.0298 -11.186877 0.236795
[11684.56, 13732.56> 0.123 24.0 0.117 178.0 0.124 0.1188 -0.332147 0.943574
[13732.56, 16845.52> 0.163 17.0 0.083 250.0 0.175 0.0637 -4.264671 0.505654
[16845.52, 19139.28> 0.141 51.0 0.248 179.0 0.125 0.2217 3.908162 1.760996
[19139.28, 20286.16> 0.055 7.0 0.034 83.0 0.058 0.0778 -1.711776 0.617692
[20286.16, 22743.76> 0.136 53.0 0.257 169.0 0.118 0.2387 4.397646 1.896003
[22743.76, 23890.64> 0.055 13.0 0.063 77.0 0.054 0.1444 0.515565 1.147141
>=23890.64 0.061 28.0 0.136 71.0 0.050 0.2828 3.512888 2.246151
Total 1.001 206.0 1.001 1430.0 1.000 1.1777 -5.161210 9.354007

Bin Statistics

Positive and Negative ratios

Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly.

[9]:
binningDerived = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    BinPositives.alias("Positives"),
    BinNegatives.alias("Negatives"),
    (((BinPositives + BinNegatives) / (sumPositives + sumNegatives)) * 100)
    .round(2)
    .alias("Responses %"),
    ((BinPositives / sumPositives) * 100).round(2).alias("Positives %"),
    ((BinNegatives / sumNegatives) * 100).round(2).alias("Negatives %"),
    (BinPositives / (BinPositives + BinNegatives)).round(4).alias("Propensity"),
)
binningDerived.to_pandas(use_pyarrow_extension_array=True).set_index("Range/Symbol").style.format(
    format_binning_derived
).set_properties(
    color="#0000FF", subset=["Responses %", "Positives %", "Negatives %", "Propensity"]
)
[9]:
  Positives Negatives Responses % Positives % Negatives % Propensity
Range/Symbol            
<11684.56 13 423 26.65 6.31 29.58 0.0298
[11684.56, 13732.56> 24 178 12.35 11.65 12.45 0.1188
[13732.56, 16845.52> 17 250 16.32 8.25 17.48 0.0637
[16845.52, 19139.28> 51 179 14.06 24.76 12.52 0.2217
[19139.28, 20286.16> 7 83 5.50 3.40 5.80 0.0778
[20286.16, 22743.76> 53 169 13.57 25.73 11.82 0.2387
[22743.76, 23890.64> 13 77 5.50 6.31 5.38 0.1444
>=23890.64 28 71 6.05 13.59 4.97 0.2828

Lift

Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:

[10]:
Positives = pl.col("Positives")
Negatives = pl.col("Negatives")
sumPositives = pl.sum("Positives")
sumNegatives = pl.sum("Negatives")
binningDerived.select(
    "Range/Symbol",
    "Positives",
    "Negatives",
    (
        (Positives / (Positives + Negatives))
        / (sumPositives / (Positives + Negatives).sum())
    ).alias("Lift"),
).to_pandas().set_index("Range/Symbol").style.format(format_lift).set_properties(
    **{"color": "blue"}, subset=["Lift"]
)
[10]:
  Positives Negatives Lift
Range/Symbol      
<11684.56 13 423 0.2368
[11684.56, 13732.56> 24 178 0.9436
[13732.56, 16845.52> 17 250 0.5057
[16845.52, 19139.28> 51 179 1.7610
[19139.28, 20286.16> 7 83 0.6177
[20286.16, 22743.76> 53 169 1.8960
[22743.76, 23890.64> 13 77 1.1471
>=23890.64 28 71 2.2462

Z-Ratio

The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.

\[\frac{posFraction-negFraction}{\sqrt(\frac{posFraction*(1-posFraction)}{\sum positives}+\frac{negFraction*(1-negFraction)}{\sum negatives})}\]

See the calculation here, which is also included in cdh_utils’ zRatio().

[11]:
def zRatio(
    posCol: pl.Expr = pl.col("BinPositives"), negCol: pl.Expr = pl.col("BinNegatives")
) -> pl.Expr:
    def getFracs(posCol=pl.col("BinPositives"), negCol=pl.col("BinNegatives")):
        return posCol / posCol.sum(), negCol / negCol.sum()

    def zRatioimpl(
        posFractionCol=pl.col("posFraction"),
        negFractionCol=pl.col("negFraction"),
        PositivesCol=pl.sum("BinPositives"),
        NegativesCol=pl.sum("BinNegatives"),
    ):
        return (
            (posFractionCol - negFractionCol)
            / (
                (posFractionCol * (1 - posFractionCol) / PositivesCol)
                + (negFractionCol * (1 - negFractionCol) / NegativesCol)
            ).sqrt()
        ).alias("ZRatio")

    return zRatioimpl(*getFracs(posCol, negCol), posCol.sum(), negCol.sum())


binningDerived.select(
    "Range/Symbol", "Positives", "Negatives", "Positives %", "Negatives %"
).with_columns(zRatio(Positives, Negatives)).to_pandas().set_index("Range/Symbol").style.format(
    format_z_ratio
).set_properties(
    **{"color": "blue"}, subset=["ZRatio"]
)
[11]:
  Positives Negatives Positives % Negatives % ZRatio
Range/Symbol          
<11684.56 13 423 6.31 29.58 -11.1869
[11684.56, 13732.56> 24 178 11.65 12.45 -0.3321
[13732.56, 16845.52> 17 250 8.25 17.48 -4.2647
[16845.52, 19139.28> 51 179 24.76 12.52 3.9082
[19139.28, 20286.16> 7 83 3.40 5.80 -1.7118
[20286.16, 22743.76> 53 169 25.73 11.82 4.3976
[22743.76, 23890.64> 13 77 6.31 5.38 0.5156
>=23890.64 28 71 13.59 4.97 3.5129

Predictor AUC

The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and there is a convenient function in pdstools to calculate it directly from the positives and negatives.

This function is implemented in cdh_utils: cdh_utils.auc_from_bincounts().

[12]:
pos=binningDerived.get_column("Positives").to_numpy()
neg=binningDerived.get_column("Negatives").to_numpy()
probs=binningDerived.get_column("Propensity").to_numpy()
order = np.argsort(probs)

FPR = np.cumsum(neg[order]) / np.sum(neg[order])
TPR = np.cumsum(pos[order]) / np.sum(pos[order])
TPR = np.insert(TPR, 0, 0, axis=0)
FPR = np.insert(FPR, 0, 0, axis=0)
# Checking whether classifier labels are correct
if TPR[1] < 1-FPR[1]:
    temp = FPR
    FPR = TPR
    TPR = temp
auc = cdh_utils.auc_from_bincounts(pos=pos, neg=neg,probs=probs)

fig = px.line(
    x=[1-x for x in FPR], y=TPR,
    labels=dict(x='Specificity', y='Sensitivity'),
    title = f"AUC = {auc.round(3)}",
    width=700, height=700,
    range_x=[1,0],
    template='none'
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=1, x1=0, y0=0, y1=1
)
fig.show()

Naive Bayes and Log Odds

The basis for the Naive Bayes algorithm is Bayes’ Theorem:

\[p(C_k|x) = \frac{p(x|C_k)*p(C_k)}{p(x)}\]

with \(C_k\) the outcome and \(x\) the customer. Bayes’ theorem turns the question “what’s the probability to accept this action given a customer” around to “what’s the probability of this customer given an action”. With the independence assumption, and after applying a log odds transformation we get a log odds score that can be calculated efficiently and in a numerically stable manner:

\[log\ odds\ score = \sum_{p\ \in\ active\ predictors}log(p(x_p|Positive)) + log(p_{positive}) - \sum_plog(p(x_p|Negative)) - log(p_{negative})\]

note that the prior can be written as:

\[log(p_{positive}) - log(p_{negative}) = log(\frac{TotalPositives}{Total})-log(\frac{TotalNegatives}{Total}) = log(TotalPositives) - log(TotalNegatives)\]

Predictor Contribution

The contribution (conditional log odds) of an active predictor \(p\) for bin \(i\) with the number of positive and negative responses in \(Positives_i\) and \(Negatives_i\) is calculated as (note the “laplace smoothing” to avoid log 0 issues):

\[contribution_p = \log(Positives_i+\frac{1}{nBins}) - \log(Negatives_i+\frac{1}{nBins}) - \log(1+\sum_{i\ = 1..nBins}{Positives_i}) + \log(1+\sum_i{Negatives_i})\]
[13]:
N = binningDerived.shape[0]
binningDerived.with_columns(
    LogOdds=(pl.col("Positives %") / pl.col("Negatives %")).log(),
    ModifiedLogOdds=(
        ((Positives + 1 / N).log() - (Positives.sum() + 1).log())
        - ((Negatives + 1 / N).log() - (Negatives.sum() + 1).log())
    ),
).drop("Responses %", "Propensity").to_pandas().set_index("Range/Symbol").style.format(
    format_log_odds
).set_properties(
    **{"color": "blue"}, subset=["LogOdds", "ModifiedLogOdds"]
)
[13]:
  Positives Negatives Positives % Negatives % LogOdds ModifiedLogOdds
Range/Symbol            
<11684.56 13 423 6.31 29.580000 -1.544963 -1.5397
[11684.56, 13732.56> 24 178 11.65 12.450000 -0.066414 -0.0658
[13732.56, 16845.52> 17 250 8.25 17.480000 -0.750844 -0.7480
[16845.52, 19139.28> 51 179 24.76 12.520000 0.681902 0.6796
[19139.28, 20286.16> 7 83 3.40 5.800000 -0.534083 -0.5233
[20286.16, 22743.76> 53 169 25.73 11.820000 0.777865 0.7754
[22743.76, 23890.64> 13 77 6.31 5.380000 0.159447 0.1625
>=23890.64 28 71 13.59 4.970000 1.005915 1.0056

Propensity mapping

Log odds contribution for all the predictors

The final score is loosely referred to as “the average contribution” but in fact is a little more nuanced. The final score is calculated as:

\[score = \frac{\log(1 + TotalPositives) – \log(1 + TotalNegatives) + \sum_p contribution_p}{1 + nActivePredictors}\]

Here, \(TotalPositives\) and \(TotalNegatives\) are the total number of positive and negative responses to the model.

Below an example. From all the active predictors of the model we pick a value (in the middle for numerics, first symbol for symbolics) and show the (modified) log odds. The final score is calculated per the above formula, and this is the value that is mapped to a propensity value by the classifier (which is constructed using the PAV(A) algorithm).

/tmp/ipykernel_3032/3077675025.py:49: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/3077675025.py:51: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/3077675025.py:52: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

[14]:
  Value Bin Positives Negatives LogOdds
PredictorName          
Customer.Age 34.56 4.000000 9.000000 198.000000 -1.145923
Customer.AnnualIncome -24043.049 1.000000 74.000000 1166.000000 -0.819651
Customer.BusinessSegment middleSegmentPlus 1.000000 96.000000 970.000000 -0.376415
Customer.CLV NON-MISSING 1.000000 111.000000 570.000000 0.300922
Customer.CLV_VALUE 1345.52 4.000000 31.000000 297.000000 -0.322731
Customer.CreditScore 518.92 3.000000 33.000000 205.000000 0.110531
Customer.Date_of_Birth 18773.504 5.000000 28.000000 152.000000 0.244642
Customer.Gender U 1.000000 52.000000 481.000000 -0.285516
Customer.MaritalStatus No Resp+ 1.000000 67.000000 745.000000 -0.470766
Customer.NetWealth 17992.398 4.000000 51.000000 179.000000 0.679600
Customer.NoOfDependents 0.0 1.000000 111.000000 850.000000 -0.099690
Customer.Prefix Mrs. 1.000000 64.000000 552.000000 -0.216664
Customer.RelationshipStartDate 1426.4596 4.000000 16.000000 117.000000 -0.050204
Customer.RiskCode R4 1.000000 36.000000 329.000000 -0.270925
Customer.WinScore 66.600006 4.000000 39.000000 102.000000 0.973755
Customer.pyCountry USA 1.000000 99.000000 776.000000 -0.122691
IH.Email.Outbound.Accepted.pxLastGroupID HomeLoans 3.000000 25.000000 218.000000 -0.227166
IH.Email.Outbound.Accepted.pxLastOutcomeTime.DaysSince -55.88436 2.000000 145.000000 881.000000 0.130525
IH.Email.Outbound.Accepted.pyHistoricalOutcomeCount 1.5 2.000000 30.000000 351.000000 -0.520104
IH.Email.Outbound.Churned.pyHistoricalOutcomeCount None 1.000000 143.000000 898.000000 0.101486
IH.Email.Outbound.Loyal.pxLastOutcomeTime.DaysSince None 1.000000 129.000000 1071.000000 -0.178813
IH.Email.Outbound.Rejected.pyHistoricalOutcomeCount 83.16 3.000000 24.000000 218.000000 -0.267751
IH.SMS.Outbound.Accepted.pxLastGroupID Account 4.000000 45.000000 316.000000 -0.013291
IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount 9.02 4.000000 6.000000 96.000000 -0.821986
IH.SMS.Outbound.Churned.pxLastOutcomeTime.DaysSince -20.5537 2.000000 9.000000 27.000000 0.849243
IH.SMS.Outbound.Loyal.pxLastOutcomeTime.DaysSince None 1.000000 165.000000 1240.000000 -0.079718
IH.SMS.Outbound.Loyal.pyHistoricalOutcomeCount None 1.000000 165.000000 1240.000000 -0.079718
IH.SMS.Outbound.Rejected.pxLastGroupID Account 2.000000 47.000000 357.000000 -0.090492
IH.SMS.Outbound.Rejected.pyHistoricalOutcomeCount 102.72 4.000000 12.000000 117.000000 -0.335590
IH.Web.Inbound.Accepted.pxLastGroupID DepositAccounts 3.000000 53.000000 397.000000 -0.077902
IH.Web.Inbound.Accepted.pyHistoricalOutcomeCount 11.04 5.000000 25.000000 164.000000 0.055802
IH.Web.Inbound.Loyal.pxLastGroupID MISSING 1.000000 100.000000 857.000000 -0.211919
IH.Web.Inbound.Loyal.pyHistoricalOutcomeCount 4.52 3.000000 30.000000 212.000000 -0.017224
IH.Web.Inbound.Rejected.pxLastGroupID Account 2.000000 81.000000 546.000000 0.027864
IH.Web.Inbound.Rejected.pyHistoricalOutcomeCount 111.08 4.000000 35.000000 306.000000 -0.231670
Param.ExtGroupCreditcards NON-MISSING 1.000000 136.000000 721.000000 0.268402
Final Score None nan nan nan -0.149329

Classifier

The success rate is defined as \(\frac{positives}{positives+negatives}\) per bin.

The adjusted propensity that is returned is a small modification (Laplace smoothing) to this and calculated as \(\frac{0.5+positives}{1+positives+negatives}\) so empty models return a propensity of 0.5.

[15]:
  Bin Positives Negatives Cum. Total (%) Propensity (%) Adjusted Propensity (%) Cum Positives (%) ZRatio Lift(%)
Index                  
1 <-0.21 17 443 28.117361 3.695652 3.796095 8.252427 -9.994484 1.956662
2 [-0.21, -0.185> 8 133 36.735943 5.673759 5.985916 12.135922 -3.495416 3.003971
3 [-0.185, -0.175> 3 48 39.853302 5.882353 6.730770 13.592233 -1.977473 3.114411
4 [-0.175, -0.105> 28 370 64.180931 7.035176 7.142858 27.184465 -4.628075 3.724772
5 [-0.105, -0.095> 4 51 67.542793 7.272727 8.035714 29.126215 -1.505372 3.850544
6 [-0.095, -0.09> 2 19 68.826408 9.523809 11.363637 30.097088 -0.478811 5.042379
7 [-0.09, -0.065> 9 77 74.083130 10.465117 10.919540 34.466019 -0.657755 5.540754
8 [-0.065, -0.02> 30 154 85.330078 16.304348 16.486486 49.029125 1.464400 8.632335
9 [-0.02, 0.03> 37 65 91.564796 36.274509 36.407764 66.990295 4.913029 19.205532
10 [0.03, 0.06> 20 29 94.559906 40.816326 41.000000 76.699028 3.664015 21.610197
11 [0.06, 0.12> 30 33 98.410759 47.619049 47.656250 91.262138 4.922851 25.211897
12 [0.12, 0.125> 2 2 98.655258 50.000000 50.000000 92.233009 1.203876 26.472490
13 [0.125, 0.13> 4 2 99.022003 66.666672 64.285713 94.174759 1.864405 35.296654
14 [0.13, 0.995> 8 3 99.694382 72.727272 70.833328 98.058250 2.718192 38.505444
15 >=0.995 4 1 100.000000 80.000000 75.000000 100.000000 1.941841 42.355988

Final Propensity

Below the classifier mapping. On the x-axis the binned scores (log odds values), on the y-axis the Propensity. Note the returned propensities are following a slightly adjusted formula, see the table above. The bin that contains the calculated final score is highlighted.

/tmp/ipykernel_3032/2151495297.py:3: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/2151495297.py:25: DeprecationWarning:

`where` is deprecated. Use `filter` instead.