HDS (Historical Data Set) Analysis and Finding new Features¶
This notebook demonstrates how to work with a sample HDS dataset, explore its structure, and build a simple XGBoost model to show feature importance.
It then shows how to combine this data with a sample dataset from the Data Lake and perform an analysis of which features could be considered to pull in to Pega to improve the models.
[1]:
import polars as pl
import polars.selectors as cs
import plotly.express as px
import requests, zipfile, io
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score
from pdstools.utils import cdh_utils
Data Read¶
Read the HDS data. Depending on how the data was extracted and stored, you may need different reading methods using zipfile and/or Polars.
We’re also casting the data to the appropriate types and dropping some features that shouldnt be used for models.
[2]:
if Path("../../data/hds.zip").exists():
archive = zipfile.ZipFile("../../data/hds.zip", "r")
else:
# Download the dataset if it does not exist locally
archive = zipfile.ZipFile(io.BytesIO(requests.get('https://github.com/pegasystems/pega-datascientist-tools/raw/master/data/hds.zip').content))
hds_data = (
pl.concat([pl.read_ndjson(archive.open(f)) for f in archive.namelist()])
.rename({"Customer_C_CIFNBR": "Customer_ID"})
.with_columns(
cdh_utils.parse_pega_date_time_formats("Decision_DecisionTime"),
cdh_utils.parse_pega_date_time_formats("Decision_OutcomeTime"),
cs.ends_with("_DaysSince", "_pyHistoricalOutcomeCount").cast(pl.Float64),
pl.col(
[
"Customer_NetWealth",
"Customer_CreditScore",
"Customer_CLV_VALUE",
"Customer_RelationshipStartDate",
"Customer_Date_of_Birth",
"Customer_NoOfDependents"
]
).cast(pl.Float64),
cs.starts_with("Param_ExtGroup").cast(pl.Float64),
)
.drop(["Customer_Gender", "Customer_Prefix"])
.sort(["Customer_ID", "Decision_DecisionTime"])
)
hds_data.describe()
[2]:
statistic | id | IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince | IH_Web_Inbound_Churned_pyHistoricalOutcomeCount | IH_Web_Inbound_Loyal_pxLastGroupID | Customer_AnnualIncome | IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount | Customer_IsCustomerActive | IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince | Customer_ID | IH_SMS_Outbound_Churned_pxLastGroupID | IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince | negativeSampling | IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince | Customer_ReviewDate | Customer_CreditScore | Customer_City | Customer_pyRegion | IH_SMS_Outbound_Rejected_pxLastGroupID | IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince | Customer_WinScore | Customer_Deceased | Param_ExtGroupCreditcards | IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince | Customer_BalanceTransaction | Decision_DecisionTime | Decision_Rank | Context_Group | IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince | positiveSampling | IH_Web_Inbound_Rejected_pxLastGroupID | Customer_TotalLiabilities | Customer_LastReviewedDate | Customer_NextReviewDate | IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount | Context_Channel | IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount | … | IH_Web_Inbound_Accepted_pxLastGroupID | Customer_BusinessSegment | IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount | Customer_TotalAssets | IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince | IH_Email_Outbound_Accepted_pxLastGroupID | IH_SMS_Outbound_Loyal_pxLastGroupID | Decision_OutcomeTime | Context_Name | IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount | Customer_HealthMatter | Decision_Outcome | IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount | IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount | dataCenter | rulesetVersion | Context_Direction | Customer_NoOfDependents | Customer_MilitaryService | IH_SMS_Outbound_Accepted_pxLastGroupID | Decision_SubjectID | Customer_CLV_VALUE | Customer_RiskScore | Param_ExtGroupWealthoffers | IH_Email_Outbound_Loyal_pxLastGroupID | Customer_Date_of_Birth | Customer_ResidentialStatus | rulesetName | IH_Email_Outbound_Churned_pxLastGroupID | IH_Web_Inbound_Churned_pxLastGroupID | Customer_NetWealth | Customer_OrganizationLabel | Context_Issue | Customer_InCollections | Customer_Bankruptcy | IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount | Customer_State |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | f64 | f64 | str | str | f64 | str | f64 | str | str | f64 | str | f64 | str | f64 | str | str | str | f64 | str | str | f64 | f64 | str | str | str | str | f64 | str | str | str | str | str | f64 | str | f64 | … | str | str | f64 | str | f64 | str | str | str | str | f64 | str | str | f64 | f64 | str | str | str | f64 | str | str | str | f64 | str | f64 | str | f64 | str | str | str | str | f64 | str | str | str | str | f64 | str |
"count" | "12062" | 12062.0 | 5911.0 | "6076" | "12062" | 8156.0 | "12062" | 8156.0 | "12062" | "828" | 11826.0 | "12062" | 12062.0 | "12062" | 12062.0 | "12062" | "12062" | "12062" | 12062.0 | "12062" | "12062" | 12062.0 | 5911.0 | "12062" | "12062" | "12062" | "12062" | 2432.0 | "12062" | "12062" | "12062" | "12062" | "12062" | 12062.0 | "12062" | 11826.0 | … | "11971" | "12062" | 11971.0 | "12062" | 2259.0 | "8156" | "1169" | "12062" | "12062" | 12062.0 | "12062" | "12062" | 828.0 | 2432.0 | "12062" | "12062" | "12062" | 12062.0 | "12062" | "11826" | "12062" | 12062.0 | "12062" | 12062.0 | "2432" | 12062.0 | "12062" | "12062" | "2259" | "5911" | 12062.0 | "12062" | "12062" | "12062" | "12062" | 1169.0 | "12062" |
"null_count" | "0" | 0.0 | 6151.0 | "5986" | "0" | 3906.0 | "0" | 3906.0 | "0" | "11234" | 236.0 | "0" | 0.0 | "0" | 0.0 | "0" | "0" | "0" | 0.0 | "0" | "0" | 0.0 | 6151.0 | "0" | "0" | "0" | "0" | 9630.0 | "0" | "0" | "0" | "0" | "0" | 0.0 | "0" | 236.0 | … | "91" | "0" | 91.0 | "0" | 9803.0 | "3906" | "10893" | "0" | "0" | 0.0 | "0" | "0" | 11234.0 | 9630.0 | "0" | "0" | "0" | 0.0 | "0" | "236" | "0" | 0.0 | "0" | 0.0 | "9630" | 0.0 | "0" | "0" | "9803" | "6151" | 0.0 | "0" | "0" | "0" | "0" | 10893.0 | "0" |
"mean" | null | 17.320494 | 2.966165 | null | null | 1.746322 | null | 51.498683 | null | null | 32.895143 | null | 17.320495 | null | 618.270602 | null | null | null | 17.320495 | null | null | 0.207214 | 25.518859 | null | "2021-05-23 17:38:44.435893" | null | null | 51.368432 | null | null | null | null | null | 52.441801 | null | 4.787418 | … | null | null | 6.526773 | null | 57.427781 | null | null | "2021-05-24 02:01:33.432064" | null | 52.695241 | null | null | 1.07971 | 1.271793 | null | null | null | 2.091527 | null | null | null | 1251.451583 | null | 0.0 | null | 18775.43305 | null | null | null | null | 16280.306583 | null | null | null | null | 1.206159 | null |
"std" | null | 3.410703 | 1.73045 | null | null | 1.132332 | null | 36.013343 | null | null | 29.670535 | null | 3.410703 | null | 98.990058 | null | null | null | 3.410703 | null | null | 0.037934 | 22.712564 | null | null | null | null | 36.785858 | null | null | null | null | null | 31.191932 | null | 3.649131 | … | null | null | 4.816119 | null | 36.27133 | null | null | null | null | 30.951928 | null | null | 0.271008 | 0.59178 | null | null | null | 1.420209 | null | null | null | 451.723341 | null | 0.0 | null | 0.002264 | null | null | null | null | 5187.89665 | null | null | null | null | 0.549997 | null |
"min" | "00011fdb-8994-5386-9100-4f5a66… | 1.905461 | 1.0 | "Account" | "10017.7283640206" | 1.0 | "false" | 1.905466 | "Customer-1253" | "CreditCards" | 1.905461 | "10.0" | 1.905461 | "" | 450.0 | "Aaronside" | "" | "Account" | 1.905461 | "50.0" | "" | 0.095743 | 1.905466 | "" | "2021-05-21 11:53:37.996000" | "10.0" | "Account" | 1.905466 | "10.0" | "Account" | "" | "" | "" | 24.0 | "Email" | 1.0 | … | "Account" | "matureSegmentPlus" | 1.0 | "" | 1.908269 | "Account" | "CreditCards" | "2021-05-21 11:53:38.138000" | "AMEXPersonal" | 26.0 | "" | "Accepted" | 1.0 | 1.0 | "datacenter1" | "01-01-01" | "Inbound" | 0.0 | "false" | "Account" | "Customer-1253" | 507.0 | "" | 0.0 | "CreditCards" | 18775.430493 | "" | "CDHSample-Artifacts" | "CreditCards" | "Account" | 7061.0 | "Abbott, Abbott and Abbott" | "Sales" | "false" | "false" | 1.0 | "AK" |
"25%" | null | 18.07174 | 2.0 | null | null | 1.0 | null | 18.076852 | null | null | 18.072437 | null | 18.07174 | null | 531.0 | null | null | null | 18.07174 | null | null | 0.184626 | 18.071891 | null | "2021-05-21 16:29:50.996000" | null | null | 18.076627 | null | null | null | null | null | 41.0 | null | 3.0 | … | null | null | 4.0 | null | 18.077559 | null | null | "2021-05-23 05:14:34.137999" | null | 42.0 | null | null | 1.0 | 1.0 | null | null | null | 1.0 | null | null | null | 837.0 | null | 0.0 | null | 18775.430712 | null | null | null | null | 11964.0 | null | null | null | null | 1.0 | null |
"50%" | null | 18.077119 | 3.0 | null | null | 1.0 | null | 18.079942 | null | null | 18.077977 | null | 18.077119 | null | 611.0 | null | null | null | 18.077119 | null | null | 0.207049 | 18.077743 | null | "2021-05-24 11:37:24.996000" | null | null | 87.824721 | null | null | null | null | null | 46.0 | null | 4.0 | … | null | null | 6.0 | null | 87.826581 | null | null | "2021-05-24 02:35:09.137999" | null | 46.0 | null | null | 1.0 | 1.0 | null | null | null | 2.0 | null | null | null | 1250.0 | null | 0.0 | null | 18775.433905 | null | null | null | null | 16371.0 | null | null | null | null | 1.0 | null |
"75%" | null | 18.0783 | 3.0 | null | null | 2.0 | null | 87.827585 | null | null | 18.079465 | null | 18.0783 | null | 706.0 | null | null | null | 18.078301 | null | null | 0.224035 | 18.078603 | null | "2021-05-24 20:46:31.996000" | null | null | 87.827574 | null | null | null | null | null | 51.0 | null | 6.0 | … | null | null | 8.0 | null | 87.841397 | null | null | "2021-05-24 20:52:17.137999" | null | 51.0 | null | null | 1.0 | 1.0 | null | null | null | 3.0 | null | null | null | 1657.0 | null | 0.0 | null | 18775.43434 | null | null | null | null | 20552.0 | null | null | null | null | 1.0 | null |
"max" | "fffcc44c-0722-5046-b18f-3ad983… | 18.082964 | 12.0 | "CreditCards" | "98352.7071131784" | 8.0 | "true" | 87.843381 | "Customer-9979" | "CreditCards" | 87.842911 | "10.0" | 18.082964 | "" | 790.0 | "Zulaufbury" | "" | "WealthOffers" | 18.082964 | "90.0" | "" | 0.409128 | 87.842373 | "" | "2021-05-31 11:53:37.996000" | "9.0" | "WealthOffers" | 87.84298 | "10.0" | "WealthOffers" | "" | "" | "" | 223.0 | "Web" | 30.0 | … | "WealthOffers" | "youngSegmentPlus" | 39.0 | "" | 87.84292 | "HomeLoans" | "CreditCards" | "2021-05-31 11:53:38.138000" | "WhenToRefinance" | 224.0 | "" | "Rejected" | 2.0 | 4.0 | "datacenter1" | "01-01-01" | "Outbound" | 4.0 | "true" | "HomeLoans" | "Customer-9979" | 1999.0 | "" | 0.0 | "CreditCards" | 18775.441909 | "" | "CDHSample-Artifacts" | "CreditCards" | "CreditCards" | 24991.0 | "Zulauf-Zulauf" | "Services" | "true" | "true" | 4.0 | "WY" |
Available Fields in the HDS dataset¶
The HDS data contains all the payload sent to the ADM models (over a period of time) plus the outcomes (Accepted/Declined/Clicked etc). There are a few categories of fields, that can be identified by their prefix:
“Customer” fields, representing the fields/predictors configured in ADM
“Context” fields, these are Channel/Direction/Issue/Group/Name, the usual “context identifiers” of the ADM models
“IH” fields, these are Pega-generated fields derived from Interaction History
Optional “Param” fields, also user defined fields/predictors, but configured in the strategies, rather than defined in the ADM model configuration
Meta information about the decisions is in Decision and internal fields, containing info about the time of decision, the sample size etc. These are not used in the models.
[3]:
hds_data_dictionary = (
pl.DataFrame(
{"Field" : hds_data.schema.names(),
"Numeric" : [x.is_numeric() for x in hds_data.schema.dtypes()],}
)
.with_columns(
Category=pl.when(pl.col("Field").str.contains("_", literal=True))
.then(pl.col("Field").str.replace(r"([^_]+)_.*", "${1}"))
.otherwise(pl.lit("Internal"))
)
.sort("Category")
)
hds_data_dictionary.to_pandas().style.hide()
[3]:
Field | Numeric | Category |
---|---|---|
Context_Group | False | Context |
Context_Channel | False | Context |
Context_Name | False | Context |
Context_Direction | False | Context |
Context_Issue | False | Context |
Customer_AnnualIncome | False | Customer |
Customer_IsCustomerActive | False | Customer |
Customer_ID | False | Customer |
Customer_ReviewDate | False | Customer |
Customer_CreditScore | True | Customer |
Customer_City | False | Customer |
Customer_pyRegion | False | Customer |
Customer_WinScore | False | Customer |
Customer_Deceased | False | Customer |
Customer_BalanceTransaction | False | Customer |
Customer_TotalLiabilities | False | Customer |
Customer_LastReviewedDate | False | Customer |
Customer_NextReviewDate | False | Customer |
Customer_pyCountry | False | Customer |
Customer_Age | False | Customer |
Customer_NaturalDisaster | False | Customer |
Customer_Incarceration | False | Customer |
Customer_MaritalStatus | False | Customer |
Customer_RiskCode | False | Customer |
Customer_RelationshipStartDate | True | Customer |
Customer_CLV | False | Customer |
Customer_BusinessSegment | False | Customer |
Customer_TotalAssets | False | Customer |
Customer_HealthMatter | False | Customer |
Customer_NoOfDependents | True | Customer |
Customer_MilitaryService | False | Customer |
Customer_CLV_VALUE | True | Customer |
Customer_RiskScore | False | Customer |
Customer_Date_of_Birth | True | Customer |
Customer_ResidentialStatus | False | Customer |
Customer_NetWealth | True | Customer |
Customer_OrganizationLabel | False | Customer |
Customer_InCollections | False | Customer |
Customer_Bankruptcy | False | Customer |
Customer_State | False | Customer |
Decision_DecisionTime | False | Decision |
Decision_Rank | False | Decision |
Decision_InteractionID | False | Decision |
Decision_OutcomeTime | False | Decision |
Decision_Outcome | False | Decision |
Decision_SubjectID | False | Decision |
IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince | True | IH |
IH_Web_Inbound_Churned_pyHistoricalOutcomeCount | True | IH |
IH_Web_Inbound_Loyal_pxLastGroupID | False | IH |
IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount | True | IH |
IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince | True | IH |
IH_SMS_Outbound_Churned_pxLastGroupID | False | IH |
IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince | True | IH |
IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince | True | IH |
IH_SMS_Outbound_Rejected_pxLastGroupID | False | IH |
IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince | True | IH |
IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince | True | IH |
IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince | True | IH |
IH_Web_Inbound_Rejected_pxLastGroupID | False | IH |
IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount | True | IH |
IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount | True | IH |
IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince | True | IH |
IH_Email_Outbound_Churned_pyHistoricalOutcomeCount | True | IH |
IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince | True | IH |
IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince | True | IH |
IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince | True | IH |
IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount | True | IH |
IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount | True | IH |
IH_Email_Outbound_Rejected_pxLastGroupID | False | IH |
IH_Web_Inbound_Accepted_pxLastGroupID | False | IH |
IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount | True | IH |
IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince | True | IH |
IH_Email_Outbound_Accepted_pxLastGroupID | False | IH |
IH_SMS_Outbound_Loyal_pxLastGroupID | False | IH |
IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount | True | IH |
IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount | True | IH |
IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount | True | IH |
IH_SMS_Outbound_Accepted_pxLastGroupID | False | IH |
IH_Email_Outbound_Loyal_pxLastGroupID | False | IH |
IH_Email_Outbound_Churned_pxLastGroupID | False | IH |
IH_Web_Inbound_Churned_pxLastGroupID | False | IH |
IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount | True | IH |
id | False | Internal |
negativeSampling | False | Internal |
positiveSampling | False | Internal |
dataCenter | False | Internal |
rulesetVersion | False | Internal |
rulesetName | False | Internal |
Param_ExtGroupCreditcards | True | Param |
Param_ExtGroupAutoloans | True | Param |
Param_ExtGroupWealthoffers | True | Param |
[4]:
category_counts = (
hds_data_dictionary
.group_by("Category", "Numeric")
.agg(Count=pl.len())
.sort("Category")
)
fig = px.bar(
category_counts, #.to_dict(as_series=False),
y="Category",
x="Count",
title="Number of Fields by Category",
color="Numeric",
text="Count",
orientation="h",
)
fig.update_layout(
yaxis_title="Field Category",
xaxis_title="Number of Fields",
)
fig.show()
Data type cannot be displayed: application/vnd.plotly.v1+json
Create XGBoost model¶
From this HDS data we create a simple XGBoost model. We create a simple model over all channels and all actions, not split up like ADM does. This could be changed, but the goal here is to get a list of the features ranked by importance, not to create a model that is better than ADM.
First, there is data prep to do one-hot encoding and target encoding.
[5]:
def data_prep(data, data_dictionary):
categorical_fields = (
data_dictionary.filter(~pl.col("Numeric"))
.filter((pl.col("Category") != "Internal") & (pl.col("Category") != "Decision"))
.select("Field")
.to_series()
.to_list()
)
numerical_fields = (
data_dictionary.filter(pl.col("Numeric"))
.filter((pl.col("Category") != "Internal") & (pl.col("Category") != "Decision"))
.select("Field")
.to_series()
.to_list()
)
print(f"Categorical fields: {categorical_fields}")
print(f"Numerical fields: {numerical_fields}")
# Simple encoding for categorical features
for column in categorical_fields:
if column != "Decision_Outcome":
# Handle missing values
data = data.with_columns(
pl.col(column).fill_null("missing").alias(column)
)
# Create a simple label encoder
le = LabelEncoder()
encoded = le.fit_transform(data[column].to_list())
data = data.with_columns(
pl.Series(name=column + "_encoded", values=encoded)
)
# Encode target variable
le_target = LabelEncoder()
encoded_target = le_target.fit_transform(data["Decision_Outcome"].to_list())
data = data.with_columns(
pl.Series(name="target", values=encoded_target)
)
# Show target encoding
target_mapping = dict(zip(le_target.classes_, range(len(le_target.classes_))))
print(f"\nTarget encoding: {target_mapping}")
# Select features and target
feature_cols = [
col for col in data.columns if col.endswith("_encoded")
] + numerical_fields
X = data[feature_cols]
y = data["target"]
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"\nTraining set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
return X_train, X_test, y_train, y_test, le_target, feature_cols
X_train, X_test, y_train, y_test, target_encoder, feature_cols = data_prep(hds_data, hds_data_dictionary)
Categorical fields: ['Context_Group', 'Context_Channel', 'Context_Name', 'Context_Direction', 'Context_Issue', 'Customer_AnnualIncome', 'Customer_IsCustomerActive', 'Customer_ID', 'Customer_ReviewDate', 'Customer_City', 'Customer_pyRegion', 'Customer_WinScore', 'Customer_Deceased', 'Customer_BalanceTransaction', 'Customer_TotalLiabilities', 'Customer_LastReviewedDate', 'Customer_NextReviewDate', 'Customer_pyCountry', 'Customer_Age', 'Customer_NaturalDisaster', 'Customer_Incarceration', 'Customer_MaritalStatus', 'Customer_RiskCode', 'Customer_CLV', 'Customer_BusinessSegment', 'Customer_TotalAssets', 'Customer_HealthMatter', 'Customer_MilitaryService', 'Customer_RiskScore', 'Customer_ResidentialStatus', 'Customer_OrganizationLabel', 'Customer_InCollections', 'Customer_Bankruptcy', 'Customer_State', 'IH_Web_Inbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Rejected_pxLastGroupID', 'IH_Email_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Accepted_pxLastGroupID', 'IH_SMS_Outbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Loyal_pxLastGroupID', 'IH_Email_Outbound_Churned_pxLastGroupID', 'IH_Web_Inbound_Churned_pxLastGroupID']
Numerical fields: ['Customer_CreditScore', 'Customer_RelationshipStartDate', 'Customer_NoOfDependents', 'Customer_CLV_VALUE', 'Customer_Date_of_Birth', 'Customer_NetWealth', 'IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount', 'Param_ExtGroupCreditcards', 'Param_ExtGroupAutoloans', 'Param_ExtGroupWealthoffers']
Target encoding: {'Accepted': 0, 'Rejected': 1}
Training set size: 9649 samples
Test set size: 2413 samples
[6]:
from xgboost import XGBClassifier
def create_classifier(X_train, X_test, y_train, y_test, target_encoder):
# Create and train the model
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
# Make predictions and evaluate
y_pred = xgb_model.predict(X_test)
print (f"Model AUC: {round(roc_auc_score(y_test,y_pred), 5)}")
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=target_encoder.classes_))
return xgb_model
classifier = create_classifier(X_train, X_test, y_train, y_test, target_encoder)
Model AUC: 0.54154
Classification Report:
precision recall f1-score support
Accepted 0.24 0.11 0.15 179
Rejected 0.93 0.97 0.95 2234
accuracy 0.91 2413
macro avg 0.58 0.54 0.55 2413
weighted avg 0.88 0.91 0.89 2413
[7]:
def plot_feature_imp(classifier, data, feature_cols):
importances = classifier.feature_importances_
# Create a DataFrame for feature importances
feature_importance_df = (
pl.DataFrame({"Feature": feature_cols, "Importance": importances.tolist()})
.with_columns(
Feature=pl.when(pl.col("Feature").str.ends_with("_encoded"))
.then(pl.col("Feature").str.replace(r"_encoded$", ""))
.otherwise(pl.col("Feature"))
)
.with_columns(
Category=pl.when(pl.col("Feature").str.contains("_", literal=True))
.then(pl.col("Feature").str.replace(r"([^_]+)_.*", "${1}"))
.otherwise(pl.lit("Internal"))
)
.sort("Importance", descending=True)
)
# Get top N features by importance
top_features_df = feature_importance_df.head(50)
# Get the ordered list of feature names
feature_order = top_features_df["Feature"].to_list()
# Correlation test of new features
features_encoded_name = [
f"{feat}_encoded" if f"{feat}_encoded" in feature_cols else feat
for feat in feature_order
]
similar_features = []
n = len(features_encoded_name)
for i in range(n):
for j in range(i + 1, n):
col1 = features_encoded_name[i]
col2 = features_encoded_name[j]
correlation = data.select(pl.corr(col1, col2)).item()
if abs(correlation) >= 0.95:
found = False
for i, tup in enumerate(similar_features):
if col1 in tup:
similar_features[i] = tup + (col2,)
found = True
break
elif col2 in tup:
similar_features[i] = tup + (col1,)
found = True
if not found:
similar_features.append((col1, col2))
# Creating the group label
group_mapping = {}
for i, group in enumerate(similar_features, 1):
for feature in group:
group_mapping[feature] = f"Group {i}"
top_features_df = top_features_df.with_columns(
pl.col("Feature").map_elements(lambda f: group_mapping.get(f, ""), return_dtype=pl.Utf8).alias("GroupLabel")
)
# Plot feature importances
fig = px.bar(
top_features_df,
x="Importance",
y="Feature",
orientation="h",
title="Feature Importance",
template="plotly_white",
color="Category",
text="GroupLabel",
color_discrete_map={
"Context": "orange",
"IH": "green",
"Customer": "blue",
"Param": "lightblue",
"DataLake": "red",
"Internal": "gray",
"Decision": "purple",
},
)
fig.update_layout(
xaxis_title="Importance",
yaxis_title="Feature",
yaxis=dict(
categoryorder="array",
categoryarray=feature_order,
autorange="reversed",
dtick=1,
),
)
fig.update_traces(
textposition="outside",
textfont=dict(color="black", size=12),
)
return fig
plot_feature_imp(classifier, X_train, feature_cols).update_layout(height=800).show()
Data type cannot be displayed: application/vnd.plotly.v1+json
Features that are strongly correlated are shown with an indication of the group (e.g. “Group 1”). Colors are used to differentiate the different sources of the features. In this demo data set, you see that Group and Channel are very important features, as is expected.
Finding new Features from the Data Lake¶
Now, suppose you have external data from your data lake that you want to consider adding to Pega to improve the performance of your models.
If you have such data, you can merge it with the HDS data and run the model again to see how these features fare against what ADM already uses.
Such data is typically time-stamped, so we need to be careful to only pull in data from before the decisions were made.
Create (fake) External Data¶
We first create an example of external data. All features are captured over time there, so there is a feature name, a timestamp, and a value.
This code (and resulting data) are just an example. You can use any data you want, we just highlight the structure.
[8]:
import random
random.seed(101)
datalake_fake_data = hds_data.with_columns(
DataLake_BadFeature=pl.Series([random.random() for _ in range(hds_data.height)]),
DataLake_GoodFeature=(pl.col("Decision_Outcome") == "Accepted") * 0.9
+ pl.Series([random.random() for _ in range(hds_data.height)]) * 0.1,
DataLake_GoodFeatureCorrelated=(pl.col("Decision_Outcome") == "Accepted") * 0.8
+ pl.Series([random.random() for _ in range(hds_data.height)]) * 0.1
).select(
[
pl.col("Customer_ID"),
pl.col("Decision_DecisionTime").dt.truncate("1d").alias("SnapshotTime"),
pl.col("DataLake_BadFeature"),
pl.col("DataLake_GoodFeature"),
pl.col("DataLake_GoodFeatureCorrelated")
]
).group_by(
["Customer_ID", "SnapshotTime"]
).agg(
cs.all().mean()
).sort(["Customer_ID", "SnapshotTime"])
datalake_fake_data.head()
[8]:
Customer_ID | SnapshotTime | DataLake_BadFeature | DataLake_GoodFeature | DataLake_GoodFeatureCorrelated |
---|---|---|---|---|
str | datetime[ns] | f64 | f64 | f64 |
"Customer-1253" | 2021-05-24 00:00:00 | 0.451482 | 0.168608 | 0.149609 |
"Customer-1254" | 2021-05-24 00:00:00 | 0.409358 | 0.170598 | 0.150844 |
"Customer-1255" | 2021-05-24 00:00:00 | 0.476272 | 0.084989 | 0.089284 |
"Customer-1256" | 2021-05-24 00:00:00 | 0.538506 | 0.054512 | 0.048855 |
"Customer-1257" | 2021-05-24 00:00:00 | 0.526969 | 0.061005 | 0.04727 |
Joining that data with the HDS data is straightforward: we match by customer ID and timestamp, but need to be careful to avoid leakage, so we only join in data for a particular customer from the data lake that is the latest snapshot before the timestamp of the HDS dataset.
Polars provides a convenient way to do this with the join_asof function.
[9]:
augmented_data = hds_data.join_asof(
datalake_fake_data,
left_on="Decision_DecisionTime",
right_on="SnapshotTime",
by="Customer_ID",
check_sortedness=False,
)
augmented_data_dictionary = pl.concat(
[
hds_data_dictionary,
pl.DataFrame(
{
"Field": ["DataLake_BadFeature", "DataLake_GoodFeature", "DataLake_GoodFeatureCorrelated"],
"Numeric": [True, True, True],
"Category": ["DataLake", "DataLake", "DataLake"],
}
),
]
)
[10]:
X_train, X_test, y_train, y_test, target_encoder, feature_cols = data_prep(augmented_data, augmented_data_dictionary)
classifier = create_classifier(X_train, X_test, y_train, y_test, target_encoder)
plot_feature_imp(classifier, X_train, feature_cols).update_layout(height=800).show()
Categorical fields: ['Context_Group', 'Context_Channel', 'Context_Name', 'Context_Direction', 'Context_Issue', 'Customer_AnnualIncome', 'Customer_IsCustomerActive', 'Customer_ID', 'Customer_ReviewDate', 'Customer_City', 'Customer_pyRegion', 'Customer_WinScore', 'Customer_Deceased', 'Customer_BalanceTransaction', 'Customer_TotalLiabilities', 'Customer_LastReviewedDate', 'Customer_NextReviewDate', 'Customer_pyCountry', 'Customer_Age', 'Customer_NaturalDisaster', 'Customer_Incarceration', 'Customer_MaritalStatus', 'Customer_RiskCode', 'Customer_CLV', 'Customer_BusinessSegment', 'Customer_TotalAssets', 'Customer_HealthMatter', 'Customer_MilitaryService', 'Customer_RiskScore', 'Customer_ResidentialStatus', 'Customer_OrganizationLabel', 'Customer_InCollections', 'Customer_Bankruptcy', 'Customer_State', 'IH_Web_Inbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Rejected_pxLastGroupID', 'IH_Email_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Accepted_pxLastGroupID', 'IH_SMS_Outbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Loyal_pxLastGroupID', 'IH_Email_Outbound_Churned_pxLastGroupID', 'IH_Web_Inbound_Churned_pxLastGroupID']
Numerical fields: ['Customer_CreditScore', 'Customer_RelationshipStartDate', 'Customer_NoOfDependents', 'Customer_CLV_VALUE', 'Customer_Date_of_Birth', 'Customer_NetWealth', 'IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount', 'Param_ExtGroupCreditcards', 'Param_ExtGroupAutoloans', 'Param_ExtGroupWealthoffers', 'DataLake_BadFeature', 'DataLake_GoodFeature', 'DataLake_GoodFeatureCorrelated']
Target encoding: {'Accepted': 0, 'Rejected': 1}
Training set size: 9649 samples
Test set size: 2413 samples
Model AUC: 0.55617
Classification Report:
precision recall f1-score support
Accepted 0.24 0.15 0.18 179
Rejected 0.93 0.96 0.95 2234
accuracy 0.90 2413
macro avg 0.59 0.56 0.57 2413
weighted avg 0.88 0.90 0.89 2413
Data type cannot be displayed: application/vnd.plotly.v1+json
Conclusions¶
The resulting feature importance shows how the new “GoodFeature” and its correlated variant are on top of the list of features, and the “BadFeature” is at the bottom. You should consider the new features with high feature importance for inclusion in Pega.
We also do a correlation check, so very similar features are labeled with their group index. In this example, you would want only the best feature of “Group 1” to be included, the other one won’t add much values since it is so highly correlated.