ADM Explained¶

Pega

2023-03-15

This notebook shows exactly how all the values in an ADM model report are calculated. It also shows how the propensity is calculated for a particular customer.

We use one of the shipped datamart exports for the example. This is a model very similar to one used in some of the ADM PowerPoint/Excel deep dive examples. You can change this notebook to apply to your own data.

[3]:

model_name = "AutoNew84Months"
predictor_name = "Customer.NetWealth"
channel= "Web"

For the example we pick one particular model over a channel. To explain the ADM model report, we use one of the active predictors as an example. Swap for any other predictor when using different data.

[4]:

dm = datasets.CDHSample(subset=False)

model = dm.combinedData.filter(
    (pl.col("Name") == model_name) & (pl.col("Channel") == channel)
)

modelpredictors = (
    dm.combinedData.join(
        model.select(pl.col("ModelID").unique()), on="ModelID", how="inner"
    )
    .filter(pl.col("EntryType") != "Inactive")
    .with_columns(Action=pl.concat_str(["Issue", "Group"], separator="/"),
                  PredictorName=pl.col("PredictorName").cast(pl.Utf8))
    .collect()
)

predictorbinning = modelpredictors.filter(
    pl.col("PredictorName") == predictor_name
).sort("BinIndex")

Model Overview¶

The selected model is shown below. Only the currently active predictors are used for the propensity calculation, so only showing those.

[6]:

	Values
Action	Sales/AutoLoans
Channel	Web
Name	AutoNew84Months
Active Predictors	[Classifier, Customer.Age, Customer.AnnualIncome, Customer.BusinessSegment, Customer.CLV, Customer.CLV_VALUE, Customer.CreditScore, Customer.Date_of_Birth, Customer.Gender, Customer.MaritalStatus, Customer.NetWealth, Customer.NoOfDependents, Customer.Prefix, Customer.RelationshipStartDate, Customer.RiskCode, Customer.WinScore, Customer.pyCountry, IH.Email.Outbound.Accepted.pxLastGroupID, IH.Email.Outbound.Accepted.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Accepted.pyHistoricalOutcomeCount, IH.Email.Outbound.Churned.pyHistoricalOutcomeCount, IH.Email.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.Email.Outbound.Rejected.pyHistoricalOutcomeCount, IH.SMS.Outbound.Accepted.pxLastGroupID, IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount, IH.SMS.Outbound.Churned.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pxLastOutcomeTime.DaysSince, IH.SMS.Outbound.Loyal.pyHistoricalOutcomeCount, IH.SMS.Outbound.Rejected.pxLastGroupID, IH.SMS.Outbound.Rejected.pyHistoricalOutcomeCount, IH.Web.Inbound.Accepted.pxLastGroupID, IH.Web.Inbound.Accepted.pyHistoricalOutcomeCount, IH.Web.Inbound.Loyal.pxLastGroupID, IH.Web.Inbound.Loyal.pyHistoricalOutcomeCount, IH.Web.Inbound.Rejected.pxLastGroupID, IH.Web.Inbound.Rejected.pyHistoricalOutcomeCount, Param.ExtGroupCreditcards]
Model Performance (AUC)	77.4901

Binning of the selected Predictor¶

The Model Report in Prediction Studio for this model will have a predictor binning plot like below.

All numbers can be derived from just the number of positives and negatives in each bin that are stored in the ADM Data Mart. The next sections will show exactly how that is done.

	Value

Predictor Name	Customer.NetWealth
# Responses	1636
# Bins	8
Predictor Performance(AUC)	72.2077

[8]:

	Responses (%)	Positives	Positives (%)	Negatives	Negatives (%)	Propensity (%)	ZRatio	Lift
Range/Symbol
<11684.56	0.267	13.0	0.063	423.0	0.296	0.0298	-11.186877	0.236795
[11684.56, 13732.56>	0.123	24.0	0.117	178.0	0.124	0.1188	-0.332147	0.943574
[13732.56, 16845.52>	0.163	17.0	0.083	250.0	0.175	0.0637	-4.264671	0.505654
[16845.52, 19139.28>	0.141	51.0	0.248	179.0	0.125	0.2217	3.908162	1.760996
[19139.28, 20286.16>	0.055	7.0	0.034	83.0	0.058	0.0778	-1.711776	0.617692
[20286.16, 22743.76>	0.136	53.0	0.257	169.0	0.118	0.2387	4.397646	1.896003
[22743.76, 23890.64>	0.055	13.0	0.063	77.0	0.054	0.1444	0.515565	1.147141
>=23890.64	0.061	28.0	0.136	71.0	0.050	0.2828	3.512888	2.246151
Total	1.001	206.0	1.001	1430.0	1.000	1.1777	-5.161210	9.354007

Bin Statistics¶

Positive and Negative ratios¶

Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly.

[9]:

binningDerived = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    BinPositives.alias("Positives"),
    BinNegatives.alias("Negatives"),
    (((BinPositives + BinNegatives) / (sumPositives + sumNegatives)) * 100)
    .round(2)
    .alias("Responses %"),
    ((BinPositives / sumPositives) * 100).round(2).alias("Positives %"),
    ((BinNegatives / sumNegatives) * 100).round(2).alias("Negatives %"),
    (BinPositives / (BinPositives + BinNegatives)).round(4).alias("Propensity"),
)
binningDerived.to_pandas(use_pyarrow_extension_array=True).set_index("Range/Symbol").style.format(
    format_binning_derived
).set_properties(
    color="#0000FF", subset=["Responses %", "Positives %", "Negatives %", "Propensity"]
)

[9]:

	Positives	Negatives	Responses %	Positives %	Negatives %	Propensity
Range/Symbol
<11684.56	13	423	26.65	6.31	29.58	0.0298
[11684.56, 13732.56>	24	178	12.35	11.65	12.45	0.1188
[13732.56, 16845.52>	17	250	16.32	8.25	17.48	0.0637
[16845.52, 19139.28>	51	179	14.06	24.76	12.52	0.2217
[19139.28, 20286.16>	7	83	5.50	3.40	5.80	0.0778
[20286.16, 22743.76>	53	169	13.57	25.73	11.82	0.2387
[22743.76, 23890.64>	13	77	5.50	6.31	5.38	0.1444
>=23890.64	28	71	6.05	13.59	4.97	0.2828

Lift¶

Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:

[10]:

Positives = pl.col("Positives")
Negatives = pl.col("Negatives")
sumPositives = pl.sum("Positives")
sumNegatives = pl.sum("Negatives")
binningDerived.select(
    "Range/Symbol",
    "Positives",
    "Negatives",
    (
        (Positives / (Positives + Negatives))
        / (sumPositives / (Positives + Negatives).sum())
    ).alias("Lift"),
).to_pandas().set_index("Range/Symbol").style.format(format_lift).set_properties(
    **{"color": "blue"}, subset=["Lift"]
)

[10]:

	Positives	Negatives	Lift
Range/Symbol
<11684.56	13	423	0.2368
[11684.56, 13732.56>	24	178	0.9436
[13732.56, 16845.52>	17	250	0.5057
[16845.52, 19139.28>	51	179	1.7610
[19139.28, 20286.16>	7	83	0.6177
[20286.16, 22743.76>	53	169	1.8960
[22743.76, 23890.64>	13	77	1.1471
>=23890.64	28	71	2.2462

Z-Ratio¶

The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.

\[\frac{posFraction-negFraction}{\sqrt(\frac{posFraction*(1-posFraction)}{\sum positives}+\frac{negFraction*(1-negFraction)}{\sum negatives})}\]

See the calculation here, which is also included in cdh_utils’ zRatio().

[11]:

def zRatio(
    posCol: pl.Expr = pl.col("BinPositives"), negCol: pl.Expr = pl.col("BinNegatives")
) -> pl.Expr:
    def getFracs(posCol=pl.col("BinPositives"), negCol=pl.col("BinNegatives")):
        return posCol / posCol.sum(), negCol / negCol.sum()

    def zRatioimpl(
        posFractionCol=pl.col("posFraction"),
        negFractionCol=pl.col("negFraction"),
        PositivesCol=pl.sum("BinPositives"),
        NegativesCol=pl.sum("BinNegatives"),
    ):
        return (
            (posFractionCol - negFractionCol)
            / (
                (posFractionCol * (1 - posFractionCol) / PositivesCol)
                + (negFractionCol * (1 - negFractionCol) / NegativesCol)
            ).sqrt()
        ).alias("ZRatio")

    return zRatioimpl(*getFracs(posCol, negCol), posCol.sum(), negCol.sum())


binningDerived.select(
    "Range/Symbol", "Positives", "Negatives", "Positives %", "Negatives %"
).with_columns(zRatio(Positives, Negatives)).to_pandas().set_index("Range/Symbol").style.format(
    format_z_ratio
).set_properties(
    **{"color": "blue"}, subset=["ZRatio"]
)

[11]:

	Positives	Negatives	Positives %	Negatives %	ZRatio
Range/Symbol
<11684.56	13	423	6.31	29.58	-11.1869
[11684.56, 13732.56>	24	178	11.65	12.45	-0.3321
[13732.56, 16845.52>	17	250	8.25	17.48	-4.2647
[16845.52, 19139.28>	51	179	24.76	12.52	3.9082
[19139.28, 20286.16>	7	83	3.40	5.80	-1.7118
[20286.16, 22743.76>	53	169	25.73	11.82	4.3976
[22743.76, 23890.64>	13	77	6.31	5.38	0.5156
>=23890.64	28	71	13.59	4.97	3.5129

Predictor AUC¶

The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and there is a convenient function in pdstools to calculate it directly from the positives and negatives.

This function is implemented in cdh_utils: cdh_utils.auc_from_bincounts().

[12]:

pos=binningDerived.get_column("Positives").to_numpy()
neg=binningDerived.get_column("Negatives").to_numpy()
probs=binningDerived.get_column("Propensity").to_numpy()
order = np.argsort(probs)

FPR = np.cumsum(neg[order]) / np.sum(neg[order])
TPR = np.cumsum(pos[order]) / np.sum(pos[order])
TPR = np.insert(TPR, 0, 0, axis=0)
FPR = np.insert(FPR, 0, 0, axis=0)
# Checking whether classifier labels are correct
if TPR[1] < 1-FPR[1]:
    temp = FPR
    FPR = TPR
    TPR = temp
auc = cdh_utils.auc_from_bincounts(pos=pos, neg=neg,probs=probs)

fig = px.line(
    x=[1-x for x in FPR], y=TPR,
    labels=dict(x='Specificity', y='Sensitivity'),
    title = f"AUC = {auc.round(3)}",
    width=700, height=700,
    range_x=[1,0],
    template='none'
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=1, x1=0, y0=0, y1=1
)
fig.show()

Naive Bayes and Log Odds¶

The basis for the Naive Bayes algorithm is Bayes’ Theorem:

\[p(C_k|x) = \frac{p(x|C_k)*p(C_k)}{p(x)}\]

with \(C_k\) the outcome and \(x\) the customer. Bayes’ theorem turns the question “what’s the probability to accept this action given a customer” around to “what’s the probability of this customer given an action”. With the independence assumption, and after applying a log odds transformation we get a log odds score that can be calculated efficiently and in a numerically stable manner:

\[log\ odds\ score = \sum_{p\ \in\ active\ predictors}log(p(x_p|Positive)) + log(p_{positive}) - \sum_plog(p(x_p|Negative)) - log(p_{negative})\]

note that the prior can be written as:

\[log(p_{positive}) - log(p_{negative}) = log(\frac{TotalPositives}{Total})-log(\frac{TotalNegatives}{Total}) = log(TotalPositives) - log(TotalNegatives)\]

Predictor Contribution¶

The contribution (conditional log odds) of an active predictor \(p\) for bin \(i\) with the number of positive and negative responses in \(Positives_i\) and \(Negatives_i\) is calculated as (note the “laplace smoothing” to avoid log 0 issues):

\[contribution_p = \log(Positives_i+\frac{1}{nBins}) - \log(Negatives_i+\frac{1}{nBins}) - \log(1+\sum_{i\ = 1..nBins}{Positives_i}) + \log(1+\sum_i{Negatives_i})\]

[13]:

N = binningDerived.shape[0]
binningDerived.with_columns(
    LogOdds=(pl.col("Positives %") / pl.col("Negatives %")).log(),
    ModifiedLogOdds=(
        ((Positives + 1 / N).log() - (Positives.sum() + 1).log())
        - ((Negatives + 1 / N).log() - (Negatives.sum() + 1).log())
    ),
).drop("Responses %", "Propensity").to_pandas().set_index("Range/Symbol").style.format(
    format_log_odds
).set_properties(
    **{"color": "blue"}, subset=["LogOdds", "ModifiedLogOdds"]
)

[13]:

	Positives	Negatives	Positives %	Negatives %	LogOdds	ModifiedLogOdds
Range/Symbol
<11684.56	13	423	6.31	29.580000	-1.544963	-1.5397
[11684.56, 13732.56>	24	178	11.65	12.450000	-0.066414	-0.0658
[13732.56, 16845.52>	17	250	8.25	17.480000	-0.750844	-0.7480
[16845.52, 19139.28>	51	179	24.76	12.520000	0.681902	0.6796
[19139.28, 20286.16>	7	83	3.40	5.800000	-0.534083	-0.5233
[20286.16, 22743.76>	53	169	25.73	11.820000	0.777865	0.7754
[22743.76, 23890.64>	13	77	6.31	5.380000	0.159447	0.1625
>=23890.64	28	71	13.59	4.970000	1.005915	1.0056

Propensity mapping¶

Log odds contribution for all the predictors¶

The final score is loosely referred to as “the average contribution” but in fact is a little more nuanced. The final score is calculated as:

\[score = \frac{\log(1 + TotalPositives) – \log(1 + TotalNegatives) + \sum_p contribution_p}{1 + nActivePredictors}\]

Here, \(TotalPositives\) and \(TotalNegatives\) are the total number of positive and negative responses to the model.

Below an example. From all the active predictors of the model we pick a value (in the middle for numerics, first symbol for symbolics) and show the (modified) log odds. The final score is calculated per the above formula, and this is the value that is mapped to a propensity value by the classifier (which is constructed using the PAV(A) algorithm).

/tmp/ipykernel_3032/3077675025.py:49: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/3077675025.py:51: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/3077675025.py:52: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

[14]:

	Value	Bin	Positives	Negatives	LogOdds
PredictorName
Customer.Age	34.56	4.000000	9.000000	198.000000	-1.145923
Customer.AnnualIncome	-24043.049	1.000000	74.000000	1166.000000	-0.819651
Customer.BusinessSegment	middleSegmentPlus	1.000000	96.000000	970.000000	-0.376415
Customer.CLV	NON-MISSING	1.000000	111.000000	570.000000	0.300922
Customer.CLV_VALUE	1345.52	4.000000	31.000000	297.000000	-0.322731
Customer.CreditScore	518.92	3.000000	33.000000	205.000000	0.110531
Customer.Date_of_Birth	18773.504	5.000000	28.000000	152.000000	0.244642
Customer.Gender	U	1.000000	52.000000	481.000000	-0.285516
Customer.MaritalStatus	No Resp+	1.000000	67.000000	745.000000	-0.470766
Customer.NetWealth	17992.398	4.000000	51.000000	179.000000	0.679600
Customer.NoOfDependents	0.0	1.000000	111.000000	850.000000	-0.099690
Customer.Prefix	Mrs.	1.000000	64.000000	552.000000	-0.216664
Customer.RelationshipStartDate	1426.4596	4.000000	16.000000	117.000000	-0.050204
Customer.RiskCode	R4	1.000000	36.000000	329.000000	-0.270925
Customer.WinScore	66.600006	4.000000	39.000000	102.000000	0.973755
Customer.pyCountry	USA	1.000000	99.000000	776.000000	-0.122691
IH.Email.Outbound.Accepted.pxLastGroupID	HomeLoans	3.000000	25.000000	218.000000	-0.227166
IH.Email.Outbound.Accepted.pxLastOutcomeTime.DaysSince	-55.88436	2.000000	145.000000	881.000000	0.130525
IH.Email.Outbound.Accepted.pyHistoricalOutcomeCount	1.5	2.000000	30.000000	351.000000	-0.520104
IH.Email.Outbound.Churned.pyHistoricalOutcomeCount	None	1.000000	143.000000	898.000000	0.101486
IH.Email.Outbound.Loyal.pxLastOutcomeTime.DaysSince	None	1.000000	129.000000	1071.000000	-0.178813
IH.Email.Outbound.Rejected.pyHistoricalOutcomeCount	83.16	3.000000	24.000000	218.000000	-0.267751
IH.SMS.Outbound.Accepted.pxLastGroupID	Account	4.000000	45.000000	316.000000	-0.013291
IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount	9.02	4.000000	6.000000	96.000000	-0.821986
IH.SMS.Outbound.Churned.pxLastOutcomeTime.DaysSince	-20.5537	2.000000	9.000000	27.000000	0.849243
IH.SMS.Outbound.Loyal.pxLastOutcomeTime.DaysSince	None	1.000000	165.000000	1240.000000	-0.079718
IH.SMS.Outbound.Loyal.pyHistoricalOutcomeCount	None	1.000000	165.000000	1240.000000	-0.079718
IH.SMS.Outbound.Rejected.pxLastGroupID	Account	2.000000	47.000000	357.000000	-0.090492
IH.SMS.Outbound.Rejected.pyHistoricalOutcomeCount	102.72	4.000000	12.000000	117.000000	-0.335590
IH.Web.Inbound.Accepted.pxLastGroupID	DepositAccounts	3.000000	53.000000	397.000000	-0.077902
IH.Web.Inbound.Accepted.pyHistoricalOutcomeCount	11.04	5.000000	25.000000	164.000000	0.055802
IH.Web.Inbound.Loyal.pxLastGroupID	MISSING	1.000000	100.000000	857.000000	-0.211919
IH.Web.Inbound.Loyal.pyHistoricalOutcomeCount	4.52	3.000000	30.000000	212.000000	-0.017224
IH.Web.Inbound.Rejected.pxLastGroupID	Account	2.000000	81.000000	546.000000	0.027864
IH.Web.Inbound.Rejected.pyHistoricalOutcomeCount	111.08	4.000000	35.000000	306.000000	-0.231670
Param.ExtGroupCreditcards	NON-MISSING	1.000000	136.000000	721.000000	0.268402
Final Score	None	nan	nan	nan	-0.149329

Classifier¶

The success rate is defined as \(\frac{positives}{positives+negatives}\) per bin.

The adjusted propensity that is returned is a small modification (Laplace smoothing) to this and calculated as \(\frac{0.5+positives}{1+positives+negatives}\) so empty models return a propensity of 0.5.

[15]:

	Bin	Positives	Negatives	Cum. Total (%)	Propensity (%)	Adjusted Propensity (%)	Cum Positives (%)	ZRatio	Lift(%)
Index
1	<-0.21	17	443	28.117361	3.695652	3.796095	8.252427	-9.994484	1.956662
2	[-0.21, -0.185>	8	133	36.735943	5.673759	5.985916	12.135922	-3.495416	3.003971
3	[-0.185, -0.175>	3	48	39.853302	5.882353	6.730770	13.592233	-1.977473	3.114411
4	[-0.175, -0.105>	28	370	64.180931	7.035176	7.142858	27.184465	-4.628075	3.724772
5	[-0.105, -0.095>	4	51	67.542793	7.272727	8.035714	29.126215	-1.505372	3.850544
6	[-0.095, -0.09>	2	19	68.826408	9.523809	11.363637	30.097088	-0.478811	5.042379
7	[-0.09, -0.065>	9	77	74.083130	10.465117	10.919540	34.466019	-0.657755	5.540754
8	[-0.065, -0.02>	30	154	85.330078	16.304348	16.486486	49.029125	1.464400	8.632335
9	[-0.02, 0.03>	37	65	91.564796	36.274509	36.407764	66.990295	4.913029	19.205532
10	[0.03, 0.06>	20	29	94.559906	40.816326	41.000000	76.699028	3.664015	21.610197
11	[0.06, 0.12>	30	33	98.410759	47.619049	47.656250	91.262138	4.922851	25.211897
12	[0.12, 0.125>	2	2	98.655258	50.000000	50.000000	92.233009	1.203876	26.472490
13	[0.125, 0.13>	4	2	99.022003	66.666672	64.285713	94.174759	1.864405	35.296654
14	[0.13, 0.995>	8	3	99.694382	72.727272	70.833328	98.058250	2.718192	38.505444
15	>=0.995	4	1	100.000000	80.000000	75.000000	100.000000	1.941841	42.355988

Final Propensity¶

Below the classifier mapping. On the x-axis the binned scores (log odds values), on the y-axis the Propensity. Note the returned propensities are following a slightly adjusted formula, see the table above. The bin that contains the calculated final score is highlighted.

/tmp/ipykernel_3032/2151495297.py:3: DeprecationWarning:

`where` is deprecated. Use `filter` instead.

/tmp/ipykernel_3032/2151495297.py:25: DeprecationWarning:

`where` is deprecated. Use `filter` instead.