HDS (Historical Data Set) Analysis and Finding new Features¶

This notebook demonstrates how to work with a sample HDS dataset, explore its structure, and build a simple XGBoost model to show feature importance.

It then shows how to combine this data with a sample dataset from the Data Lake and perform an analysis of which features could be considered to pull in to Pega to improve the models.

[2]:

import polars as pl
import polars.selectors as cs
import plotly.express as px
import requests, zipfile, io
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score
from pdstools.utils import cdh_utils

Data Read¶

Read the HDS data. Depending on how the data was extracted and stored, you may need different reading methods using zipfile and/or Polars.

We’re also casting the data to the appropriate types and dropping some features that shouldnt be used for models.

[3]:

if Path("../../data/hds.zip").exists():
    archive = zipfile.ZipFile("../../data/hds.zip", "r")
else:
    # Download the dataset if it does not exist locally
    archive = zipfile.ZipFile(io.BytesIO(requests.get('https://github.com/pegasystems/pega-datascientist-tools/raw/master/data/hds.zip').content))
hds_data = (
    pl.concat([pl.read_ndjson(archive.open(f)) for f in archive.namelist()])
    .rename({"Customer_C_CIFNBR": "Customer_ID"})
    .with_columns(
        cdh_utils.parse_pega_date_time_formats("Decision_DecisionTime"),
        cdh_utils.parse_pega_date_time_formats("Decision_OutcomeTime"),
        cs.ends_with("_DaysSince", "_pyHistoricalOutcomeCount").cast(pl.Float64),
        pl.col(
            [
                "Customer_NetWealth",
                "Customer_CreditScore",
                "Customer_CLV_VALUE",
                "Customer_RelationshipStartDate",
                "Customer_Date_of_Birth",
                "Customer_NoOfDependents"
            ]
        ).cast(pl.Float64),
        cs.starts_with("Param_ExtGroup").cast(pl.Float64),
    )
    .drop(["Customer_Gender", "Customer_Prefix"])
    .sort(["Customer_ID", "Decision_DecisionTime"])
)
hds_data.describe()

[3]:

shape: (9, 92)

statistic	id	Context_Group	Customer_TotalAssets	rulesetVersion	Customer_MaritalStatus	IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount	IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount	IH_Email_Outbound_Loyal_pxLastGroupID	Customer_NextReviewDate	Customer_ReviewDate	Customer_NoOfDependents	IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince	IH_Email_Outbound_Accepted_pxLastGroupID	Decision_SubjectID	Customer_Date_of_Birth	Decision_OutcomeTime	Decision_Outcome	Customer_RelationshipStartDate	Customer_TotalLiabilities	IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince	IH_Web_Inbound_Rejected_pxLastGroupID	Customer_City	Customer_Deceased	dataCenter	IH_Web_Inbound_Accepted_pxLastGroupID	Customer_CLV_VALUE	IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount	Decision_InteractionID	Customer_ResidentialStatus	Context_Name	Customer_BalanceTransaction	IH_SMS_Outbound_Rejected_pxLastGroupID	Decision_DecisionTime	Customer_RiskCode	negativeSampling	IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince	…	IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince	Customer_AnnualIncome	IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount	Customer_Incarceration	IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince	IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount	IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount	Customer_pyRegion	Customer_NetWealth	Decision_Rank	IH_Web_Inbound_Loyal_pxLastGroupID	Customer_OrganizationLabel	IH_Web_Inbound_Churned_pyHistoricalOutcomeCount	IH_SMS_Outbound_Accepted_pxLastGroupID	IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince	Customer_LastReviewedDate	Customer_NaturalDisaster	Customer_Age	Customer_State	IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince	Param_ExtGroupCreditcards	Customer_WinScore	Customer_pyCountry	Param_ExtGroupAutoloans	Customer_IsCustomerActive	Customer_MilitaryService	IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince	IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount	Context_Channel	IH_Email_Outbound_Churned_pyHistoricalOutcomeCount	IH_Email_Outbound_Churned_pxLastGroupID	IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince	IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince	IH_SMS_Outbound_Churned_pxLastGroupID	Param_ExtGroupWealthoffers	Customer_BusinessSegment	Customer_ID
str	str	str	str	str	str	f64	f64	str	str	str	f64	f64	str	str	f64	str	str	f64	str	f64	str	str	str	str	str	f64	f64	str	str	str	str	str	str	str	str	f64	…	f64	str	f64	str	f64	f64	f64	str	f64	str	str	str	f64	str	f64	str	str	str	str	f64	f64	str	str	f64	str	str	f64	f64	str	f64	str	f64	f64	str	f64	str	str
"count"	"12062"	"12062"	"12062"	"12062"	"12062"	2432.0	11971.0	"2432"	"12062"	"12062"	12062.0	11971.0	"8156"	"12062"	12062.0	"12062"	"12062"	12062.0	"12062"	2432.0	"12062"	"12062"	"12062"	"12062"	"11971"	12062.0	12062.0	"12062"	"12062"	"12062"	"12062"	"12062"	"12062"	"12062"	"12062"	11826.0	…	6076.0	"12062"	8156.0	"12062"	12062.0	6076.0	11826.0	"12062"	12062.0	"12062"	"6076"	"12062"	5911.0	"11826"	12062.0	"12062"	"12062"	"11947"	"12062"	5911.0	12062.0	"12062"	"12062"	12062.0	"12062"	"12062"	8156.0	12062.0	"12062"	2259.0	"2259"	1169.0	12062.0	"828"	12062.0	"12062"	"12062"
"null_count"	"0"	"0"	"0"	"0"	"0"	9630.0	91.0	"9630"	"0"	"0"	0.0	91.0	"3906"	"0"	0.0	"0"	"0"	0.0	"0"	9630.0	"0"	"0"	"0"	"0"	"91"	0.0	0.0	"0"	"0"	"0"	"0"	"0"	"0"	"0"	"0"	236.0	…	5986.0	"0"	3906.0	"0"	0.0	5986.0	236.0	"0"	0.0	"0"	"5986"	"0"	6151.0	"236"	0.0	"0"	"0"	"115"	"0"	6151.0	0.0	"0"	"0"	0.0	"0"	"0"	3906.0	0.0	"0"	9803.0	"9803"	10893.0	0.0	"11234"	0.0	"0"	"0"
"mean"	null	null	null	null	null	1.271793	6.526773	null	null	null	2.091527	26.416633	null	null	18775.43305	"2021-05-24 02:01:33.432064"	null	1954.04534	null	51.368432	null	null	null	null	null	1251.451583	50.346211	null	null	null	null	null	"2021-05-23 17:38:44.435893"	null	null	32.895143	…	29.931684	null	1.746322	null	17.320494	2.948157	4.787418	null	16280.306583	null	null	null	2.966165	null	17.320495	null	null	null	null	25.518859	0.207214	null	null	0.0	null	null	51.498683	52.695241	null	1.212041	null	47.343567	17.320495	null	0.0	null	null
"std"	null	null	null	null	null	0.59178	4.816119	null	null	null	1.420209	24.034307	null	null	0.002264	null	null	1075.608663	null	36.785858	null	null	null	null	null	451.723341	29.760986	null	null	null	null	null	null	null	null	29.670535	…	27.654112	null	1.132332	null	3.410703	1.976515	3.649131	null	5187.89665	null	null	null	1.73045	null	3.410703	null	null	null	null	22.712564	0.037934	null	null	0.0	null	null	36.013343	30.951928	null	0.426862	null	37.747789	3.410703	null	0.0	null	null
"min"	"00011fdb-8994-5386-9100-4f5a66…	"Account"	""	"01-01-01"	"Married"	1.0	1.0	"CreditCards"	""	""	0.0	1.905461	"Account"	"Customer-1253"	18775.430493	"2021-05-21 11:53:38.138000"	"Accepted"	91.943732	""	1.905466	"Account"	"Aaronside"	""	"datacenter1"	"Account"	507.0	23.0	"4394977578094791970"	""	"AMEXPersonal"	""	"Account"	"2021-05-21 11:53:37.996000"	"R1"	"10.0"	1.905461	…	1.905491	"10017.7283640206"	1.0	""	1.905461	1.0	1.0	""	7061.0	"10.0"	"Account"	"Abbott, Abbott and Abbott"	1.0	"Account"	1.905461	""	""	"19.0"	"AK"	1.905466	0.095743	"50.0"	"CAN"	0.0	"false"	"false"	1.905466	26.0	"Email"	1.0	"CreditCards"	1.905463	1.905461	"CreditCards"	0.0	"matureSegmentPlus"	"Customer-1253"
"25%"	null	null	null	null	null	1.0	4.0	null	null	null	1.0	18.072171	null	null	18775.430712	"2021-05-23 05:14:34.137999"	null	1041.466832	null	18.076627	null	null	null	null	null	837.0	40.0	null	null	null	null	null	"2021-05-21 16:29:50.996000"	null	null	18.072437	…	18.076566	null	1.0	null	18.07174	2.0	3.0	null	11964.0	null	null	null	2.0	null	18.07174	null	null	null	null	18.071891	0.184626	null	null	0.0	null	null	18.076852	42.0	null	1.0	null	18.072013	18.07174	null	0.0	null	null
"50%"	null	null	null	null	null	1.0	6.0	null	null	null	2.0	18.077787	null	null	18775.433905	"2021-05-24 02:35:09.137999"	null	1957.490792	null	87.824721	null	null	null	null	null	1250.0	44.0	null	null	null	null	null	"2021-05-24 11:37:24.996000"	null	null	18.077977	…	18.078024	null	1.0	null	18.077119	3.0	4.0	null	16371.0	null	null	null	3.0	null	18.077119	null	null	null	null	18.077743	0.207049	null	null	0.0	null	null	18.079942	46.0	null	1.0	null	18.079281	18.077119	null	0.0	null	null
"75%"	null	null	null	null	null	1.0	8.0	null	null	null	3.0	18.078672	null	null	18775.43434	"2021-05-24 20:52:17.137999"	null	2936.383635	null	87.827574	null	null	null	null	null	1657.0	49.0	null	null	null	null	null	"2021-05-24 20:46:31.996000"	null	null	18.079465	…	18.079068	null	2.0	null	18.0783	3.0	6.0	null	20552.0	null	null	null	3.0	null	18.078301	null	null	null	null	18.078603	0.224035	null	null	0.0	null	null	87.827585	51.0	null	1.0	null	87.827813	18.0783	null	0.0	null	null
"max"	"fffcc44c-0722-5046-b18f-3ad983…	"WealthOffers"	""	"01-01-01"	"Unknown"	4.0	39.0	"CreditCards"	""	""	4.0	87.842919	"HomeLoans"	"Customer-9979"	18775.441909	"2021-05-31 11:53:38.138000"	"Rejected"	3737.027909	""	87.84298	"WealthOffers"	"Zulaufbury"	""	"datacenter1"	"WealthOffers"	1999.0	209.0	"8067856128443701623"	""	"WhenToRefinance"	""	"WealthOffers"	"2021-05-31 11:53:37.996000"	"R4"	"10.0"	87.842911	…	87.842578	"98352.7071131784"	8.0	""	18.082964	13.0	30.0	""	24991.0	"9.0"	"CreditCards"	"Zulauf-Zulauf"	12.0	"HomeLoans"	18.082964	""	""	"76.0"	"WY"	87.842373	0.409128	"90.0"	"USA"	0.0	"true"	"true"	87.843381	224.0	"Web"	3.0	"CreditCards"	87.843035	18.082964	"CreditCards"	0.0	"youngSegmentPlus"	"Customer-9979"

Available Fields in the HDS dataset¶

The HDS data contains all the payload sent to the ADM models (over a period of time) plus the outcomes (Accepted/Declined/Clicked etc). There are a few categories of fields, that can be identified by their prefix:

“Customer” fields, representing the fields/predictors configured in ADM
“Context” fields, these are Channel/Direction/Issue/Group/Name, the usual “context identifiers” of the ADM models
“IH” fields, these are Pega-generated fields derived from Interaction History
Optional “Param” fields, also user defined fields/predictors, but configured in the strategies, rather than defined in the ADM model configuration

Meta information about the decisions is in Decision and internal fields, containing info about the time of decision, the sample size etc. These are not used in the models.

[4]:

hds_data_dictionary = (
    pl.DataFrame(
           {"Field" : hds_data.schema.names(),
           "Numeric" : [x.is_numeric() for x in hds_data.schema.dtypes()],}
    )
    .with_columns(
        Category=pl.when(pl.col("Field").str.contains("_", literal=True))
        .then(pl.col("Field").str.replace(r"([^_]+)_.*", "${1}"))
        .otherwise(pl.lit("Internal"))
    )
    .sort("Category")
)
hds_data_dictionary.to_pandas().style.hide()

[4]:

Field	Numeric	Category
Context_Group	False	Context
Context_Name	False	Context
Context_Direction	False	Context
Context_Issue	False	Context
Context_Channel	False	Context
Customer_TotalAssets	False	Customer
Customer_MaritalStatus	False	Customer
Customer_NextReviewDate	False	Customer
Customer_ReviewDate	False	Customer
Customer_NoOfDependents	True	Customer
Customer_Date_of_Birth	True	Customer
Customer_RelationshipStartDate	True	Customer
Customer_TotalLiabilities	False	Customer
Customer_City	False	Customer
Customer_Deceased	False	Customer
Customer_CLV_VALUE	True	Customer
Customer_ResidentialStatus	False	Customer
Customer_BalanceTransaction	False	Customer
Customer_RiskCode	False	Customer
Customer_CreditScore	True	Customer
Customer_Bankruptcy	False	Customer
Customer_RiskScore	False	Customer
Customer_CLV	False	Customer
Customer_HealthMatter	False	Customer
Customer_InCollections	False	Customer
Customer_AnnualIncome	False	Customer
Customer_Incarceration	False	Customer
Customer_pyRegion	False	Customer
Customer_NetWealth	True	Customer
Customer_OrganizationLabel	False	Customer
Customer_LastReviewedDate	False	Customer
Customer_NaturalDisaster	False	Customer
Customer_Age	False	Customer
Customer_State	False	Customer
Customer_WinScore	False	Customer
Customer_pyCountry	False	Customer
Customer_IsCustomerActive	False	Customer
Customer_MilitaryService	False	Customer
Customer_BusinessSegment	False	Customer
Customer_ID	False	Customer
Decision_SubjectID	False	Decision
Decision_OutcomeTime	False	Decision
Decision_Outcome	False	Decision
Decision_InteractionID	False	Decision
Decision_DecisionTime	False	Decision
Decision_Rank	False	Decision
IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount	True	IH
IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount	True	IH
IH_Email_Outbound_Loyal_pxLastGroupID	False	IH
IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Accepted_pxLastGroupID	False	IH
IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince	True	IH
IH_Web_Inbound_Rejected_pxLastGroupID	False	IH
IH_Web_Inbound_Accepted_pxLastGroupID	False	IH
IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Rejected_pxLastGroupID	False	IH
IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Rejected_pxLastGroupID	False	IH
IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince	True	IH
IH_Web_Inbound_Churned_pxLastGroupID	False	IH
IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince	True	IH
IH_SMS_Outbound_Loyal_pxLastGroupID	False	IH
IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount	True	IH
IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince	True	IH
IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount	True	IH
IH_Web_Inbound_Loyal_pxLastGroupID	False	IH
IH_Web_Inbound_Churned_pyHistoricalOutcomeCount	True	IH
IH_SMS_Outbound_Accepted_pxLastGroupID	False	IH
IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince	True	IH
IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount	True	IH
IH_Email_Outbound_Churned_pyHistoricalOutcomeCount	True	IH
IH_Email_Outbound_Churned_pxLastGroupID	False	IH
IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince	True	IH
IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince	True	IH
IH_SMS_Outbound_Churned_pxLastGroupID	False	IH
id	False	Internal
rulesetVersion	False	Internal
dataCenter	False	Internal
negativeSampling	False	Internal
rulesetName	False	Internal
positiveSampling	False	Internal
Param_ExtGroupCreditcards	True	Param
Param_ExtGroupAutoloans	True	Param
Param_ExtGroupWealthoffers	True	Param

[5]:

category_counts = (
    hds_data_dictionary
    .group_by("Category", "Numeric")
    .agg(Count=pl.len())
    .sort("Category")
)
fig = px.bar(
    category_counts, #.to_dict(as_series=False),
    y="Category",
    x="Count",
    title="Number of Fields by Category",
    color="Numeric",
    text="Count",
    orientation="h",
)

fig.update_layout(
    yaxis_title="Field Category",
    xaxis_title="Number of Fields",
)

fig.show()

Create XGBoost model¶

From this HDS data we create a simple XGBoost model. We create a simple model over all channels and all actions, not split up like ADM does. This could be changed, but the goal here is to get a list of the features ranked by importance, not to create a model that is better than ADM.

First, there is data prep to do one-hot encoding and target encoding.

[6]:

def data_prep(data, data_dictionary):
    categorical_fields = (
        data_dictionary.filter(~pl.col("Numeric"))
        .filter((pl.col("Category") != "Internal") & (pl.col("Category") != "Decision"))
        .select("Field")
        .to_series()
        .to_list()
    )
    numerical_fields = (
        data_dictionary.filter(pl.col("Numeric"))
        .filter((pl.col("Category") != "Internal") & (pl.col("Category") != "Decision"))
        .select("Field")
        .to_series()
        .to_list()
    )

    print(f"Categorical fields: {categorical_fields}")
    print(f"Numerical fields: {numerical_fields}")

    # Simple encoding for categorical features
    for column in categorical_fields:
        if column != "Decision_Outcome":
            # Handle missing values
            data = data.with_columns(
                pl.col(column).fill_null("missing").alias(column)
            )

            # Create a simple label encoder
            le = LabelEncoder()
            encoded = le.fit_transform(data[column].to_list())
            data = data.with_columns(
                pl.Series(name=column + "_encoded", values=encoded)
            )

    # Encode target variable
    le_target = LabelEncoder()
    encoded_target = le_target.fit_transform(data["Decision_Outcome"].to_list())
    data = data.with_columns(
        pl.Series(name="target", values=encoded_target)
    )

    # Show target encoding
    target_mapping = dict(zip(le_target.classes_, range(len(le_target.classes_))))
    print(f"\nTarget encoding: {target_mapping}")

    # Select features and target
    feature_cols = [
        col for col in data.columns if col.endswith("_encoded")
    ] + numerical_fields

    X = data[feature_cols]
    y = data["target"]

    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    print(f"\nTraining set size: {X_train.shape[0]} samples")
    print(f"Test set size: {X_test.shape[0]} samples")

    return X_train, X_test, y_train, y_test, le_target, feature_cols


X_train, X_test, y_train, y_test, target_encoder, feature_cols = data_prep(hds_data, hds_data_dictionary)

Categorical fields: ['Context_Group', 'Context_Name', 'Context_Direction', 'Context_Issue', 'Context_Channel', 'Customer_TotalAssets', 'Customer_MaritalStatus', 'Customer_NextReviewDate', 'Customer_ReviewDate', 'Customer_TotalLiabilities', 'Customer_City', 'Customer_Deceased', 'Customer_ResidentialStatus', 'Customer_BalanceTransaction', 'Customer_RiskCode', 'Customer_Bankruptcy', 'Customer_RiskScore', 'Customer_CLV', 'Customer_HealthMatter', 'Customer_InCollections', 'Customer_AnnualIncome', 'Customer_Incarceration', 'Customer_pyRegion', 'Customer_OrganizationLabel', 'Customer_LastReviewedDate', 'Customer_NaturalDisaster', 'Customer_Age', 'Customer_State', 'Customer_WinScore', 'Customer_pyCountry', 'Customer_IsCustomerActive', 'Customer_MilitaryService', 'Customer_BusinessSegment', 'Customer_ID', 'IH_Email_Outbound_Loyal_pxLastGroupID', 'IH_Email_Outbound_Accepted_pxLastGroupID', 'IH_Web_Inbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Accepted_pxLastGroupID', 'IH_SMS_Outbound_Rejected_pxLastGroupID', 'IH_Email_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Loyal_pxLastGroupID', 'IH_Web_Inbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Churned_pxLastGroupID']
Numerical fields: ['Customer_NoOfDependents', 'Customer_Date_of_Birth', 'Customer_RelationshipStartDate', 'Customer_CLV_VALUE', 'Customer_CreditScore', 'Customer_NetWealth', 'IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Churned_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'Param_ExtGroupCreditcards', 'Param_ExtGroupAutoloans', 'Param_ExtGroupWealthoffers']

Target encoding: {'Accepted': 0, 'Rejected': 1}

Training set size: 9649 samples
Test set size: 2413 samples

[7]:

from xgboost import XGBClassifier
def create_classifier(X_train, X_test, y_train, y_test, target_encoder):
    # Create and train the model
    xgb_model = XGBClassifier(random_state=42)
    xgb_model.fit(X_train, y_train)

    # Make predictions and evaluate
    y_pred = xgb_model.predict(X_test)
    print (f"Model AUC: {round(roc_auc_score(y_test,y_pred), 5)}")

    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=target_encoder.classes_))

    return xgb_model

classifier = create_classifier(X_train, X_test, y_train, y_test, target_encoder)

Model AUC: 0.54154

Classification Report:
              precision    recall  f1-score   support

    Accepted       0.24      0.11      0.15       179
    Rejected       0.93      0.97      0.95      2234

    accuracy                           0.91      2413
   macro avg       0.58      0.54      0.55      2413
weighted avg       0.88      0.91      0.89      2413

[8]:

def plot_feature_imp(classifier, data, feature_cols):
    importances = classifier.feature_importances_

    # Create a DataFrame for feature importances
    feature_importance_df = (
        pl.DataFrame({"Feature": feature_cols, "Importance": importances.tolist()})
        .with_columns(
            Feature=pl.when(pl.col("Feature").str.ends_with("_encoded"))
            .then(pl.col("Feature").str.replace(r"_encoded$", ""))
            .otherwise(pl.col("Feature"))
        )
        .with_columns(
            Category=pl.when(pl.col("Feature").str.contains("_", literal=True))
            .then(pl.col("Feature").str.replace(r"([^_]+)_.*", "${1}"))
            .otherwise(pl.lit("Internal"))
        )
        .sort("Importance", descending=True)
    )

    # Get top N features by importance
    top_features_df = feature_importance_df.head(50)

    # Get the ordered list of feature names
    feature_order = top_features_df["Feature"].to_list()

    # Correlation test of new features
    features_encoded_name = [
        f"{feat}_encoded" if f"{feat}_encoded" in feature_cols else feat
        for feat in feature_order
    ]

    similar_features = []
    n = len(features_encoded_name)

    for i in range(n):
        for j in range(i + 1, n):
            col1 = features_encoded_name[i]
            col2 = features_encoded_name[j]

            correlation = data.select(pl.corr(col1, col2)).item()
            if abs(correlation) >= 0.95:
                found = False

                for i, tup in enumerate(similar_features):
                    if col1 in tup:
                        similar_features[i] = tup + (col2,)
                        found = True
                        break
                    elif col2 in tup:
                        similar_features[i] = tup + (col1,)
                        found = True

                if not found:
                    similar_features.append((col1, col2))


    # Creating the group label
    group_mapping = {}
    for i, group in enumerate(similar_features, 1):
        for feature in group:
            group_mapping[feature] = f"Group {i}"

    top_features_df = top_features_df.with_columns(
        pl.col("Feature").map_elements(lambda f: group_mapping.get(f, ""), return_dtype=pl.Utf8).alias("GroupLabel")
    )

    # Plot feature importances
    fig = px.bar(
        top_features_df,
        x="Importance",
        y="Feature",
        orientation="h",
        title="Feature Importance",
        template="plotly_white",
        color="Category",
        text="GroupLabel",
        color_discrete_map={
            "Context": "orange",
            "IH": "green",
            "Customer": "blue",
            "Param": "lightblue",
            "DataLake": "red",
            "Internal": "gray",
            "Decision": "purple",
        },
    )

    fig.update_layout(
        xaxis_title="Importance",
        yaxis_title="Feature",
        yaxis=dict(
            categoryorder="array",
            categoryarray=feature_order,
            autorange="reversed",
            dtick=1,
        ),
    )

    fig.update_traces(
        textposition="outside",
        textfont=dict(color="black", size=12),
    )
    return fig

plot_feature_imp(classifier, X_train, feature_cols).update_layout(height=800).show()

Features that are strongly correlated are shown with an indication of the group (e.g. “Group 1”). Colors are used to differentiate the different sources of the features. In this demo data set, you see that Group and Channel are very important features, as is expected.

Finding new Features from the Data Lake¶

Now, suppose you have external data from your data lake that you want to consider adding to Pega to improve the performance of your models.

If you have such data, you can merge it with the HDS data and run the model again to see how these features fare against what ADM already uses.

Such data is typically time-stamped, so we need to be careful to only pull in data from before the decisions were made.

Create (fake) External Data¶

We first create an example of external data. All features are captured over time there, so there is a feature name, a timestamp, and a value.

This code (and resulting data) are just an example. You can use any data you want, we just highlight the structure.

[9]:

import random

random.seed(101)
datalake_fake_data = hds_data.with_columns(
    DataLake_BadFeature=pl.Series([random.random() for _ in range(hds_data.height)]),
    DataLake_GoodFeature=(pl.col("Decision_Outcome") == "Accepted") * 0.9
    + pl.Series([random.random() for _ in range(hds_data.height)]) * 0.1,
    DataLake_GoodFeatureCorrelated=(pl.col("Decision_Outcome") == "Accepted") * 0.8
    + pl.Series([random.random() for _ in range(hds_data.height)]) * 0.1
).select(
    [
        pl.col("Customer_ID"),
        pl.col("Decision_DecisionTime").dt.truncate("1d").alias("SnapshotTime"),
        pl.col("DataLake_BadFeature"),
        pl.col("DataLake_GoodFeature"),
        pl.col("DataLake_GoodFeatureCorrelated")
    ]
).group_by(
    ["Customer_ID", "SnapshotTime"]
).agg(
    cs.all().mean()
).sort(["Customer_ID", "SnapshotTime"])

datalake_fake_data.head()

[9]:

shape: (5, 5)

Customer_ID	SnapshotTime	DataLake_BadFeature	DataLake_GoodFeature	DataLake_GoodFeatureCorrelated
str	datetime[ns]	f64	f64	f64
"Customer-1253"	2021-05-24 00:00:00	0.451482	0.168608	0.149609
"Customer-1254"	2021-05-24 00:00:00	0.409358	0.170598	0.150844
"Customer-1255"	2021-05-24 00:00:00	0.476272	0.084989	0.089284
"Customer-1256"	2021-05-24 00:00:00	0.538506	0.054512	0.048855
"Customer-1257"	2021-05-24 00:00:00	0.526969	0.061005	0.04727

Joining that data with the HDS data is straightforward: we match by customer ID and timestamp, but need to be careful to avoid leakage, so we only join in data for a particular customer from the data lake that is the latest snapshot before the timestamp of the HDS dataset.

Polars provides a convenient way to do this with the join_asof function.

[10]:

augmented_data = hds_data.join_asof(
    datalake_fake_data,
    left_on="Decision_DecisionTime",
    right_on="SnapshotTime",
    by="Customer_ID",
    check_sortedness=False,
)
augmented_data_dictionary = pl.concat(
    [
        hds_data_dictionary,
        pl.DataFrame(
            {
                "Field": ["DataLake_BadFeature", "DataLake_GoodFeature", "DataLake_GoodFeatureCorrelated"],
                "Numeric": [True, True, True],
                "Category": ["DataLake", "DataLake", "DataLake"],
            }
        ),
    ]
)

[11]:

X_train, X_test, y_train, y_test, target_encoder, feature_cols = data_prep(augmented_data, augmented_data_dictionary)
classifier = create_classifier(X_train, X_test, y_train, y_test, target_encoder)

plot_feature_imp(classifier, X_train, feature_cols).update_layout(height=800).show()

Categorical fields: ['Context_Group', 'Context_Name', 'Context_Direction', 'Context_Issue', 'Context_Channel', 'Customer_TotalAssets', 'Customer_MaritalStatus', 'Customer_NextReviewDate', 'Customer_ReviewDate', 'Customer_TotalLiabilities', 'Customer_City', 'Customer_Deceased', 'Customer_ResidentialStatus', 'Customer_BalanceTransaction', 'Customer_RiskCode', 'Customer_Bankruptcy', 'Customer_RiskScore', 'Customer_CLV', 'Customer_HealthMatter', 'Customer_InCollections', 'Customer_AnnualIncome', 'Customer_Incarceration', 'Customer_pyRegion', 'Customer_OrganizationLabel', 'Customer_LastReviewedDate', 'Customer_NaturalDisaster', 'Customer_Age', 'Customer_State', 'Customer_WinScore', 'Customer_pyCountry', 'Customer_IsCustomerActive', 'Customer_MilitaryService', 'Customer_BusinessSegment', 'Customer_ID', 'IH_Email_Outbound_Loyal_pxLastGroupID', 'IH_Email_Outbound_Accepted_pxLastGroupID', 'IH_Web_Inbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Accepted_pxLastGroupID', 'IH_SMS_Outbound_Rejected_pxLastGroupID', 'IH_Email_Outbound_Rejected_pxLastGroupID', 'IH_Web_Inbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Loyal_pxLastGroupID', 'IH_Web_Inbound_Loyal_pxLastGroupID', 'IH_SMS_Outbound_Accepted_pxLastGroupID', 'IH_Email_Outbound_Churned_pxLastGroupID', 'IH_SMS_Outbound_Churned_pxLastGroupID']
Numerical fields: ['Customer_NoOfDependents', 'Customer_Date_of_Birth', 'Customer_RelationshipStartDate', 'Customer_CLV_VALUE', 'Customer_CreditScore', 'Customer_NetWealth', 'IH_Email_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Rejected_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_SMS_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Loyal_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Accepted_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Churned_pyHistoricalOutcomeCount', 'IH_Web_Inbound_Rejected_pxLastOutcomeTime_DaysSince', 'IH_Web_Inbound_Churned_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Accepted_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pyHistoricalOutcomeCount', 'IH_Email_Outbound_Churned_pyHistoricalOutcomeCount', 'IH_SMS_Outbound_Loyal_pxLastOutcomeTime_DaysSince', 'IH_Email_Outbound_Rejected_pxLastOutcomeTime_DaysSince', 'Param_ExtGroupCreditcards', 'Param_ExtGroupAutoloans', 'Param_ExtGroupWealthoffers', 'DataLake_BadFeature', 'DataLake_GoodFeature', 'DataLake_GoodFeatureCorrelated']

Target encoding: {'Accepted': 0, 'Rejected': 1}

Training set size: 9649 samples
Test set size: 2413 samples
Model AUC: 0.55617

Classification Report:
              precision    recall  f1-score   support

    Accepted       0.24      0.15      0.18       179
    Rejected       0.93      0.96      0.95      2234

    accuracy                           0.90      2413
   macro avg       0.59      0.56      0.57      2413
weighted avg       0.88      0.90      0.89      2413

Conclusions¶

The resulting feature importance shows how the new “GoodFeature” and its correlated variant are on top of the list of features, and the “BadFeature” is at the bottom. You should consider the new features with high feature importance for inclusion in Pega.

We also do a correlation check, so very similar features are labeled with their group index. In this example, you would want only the best feature of “Group 1” to be included, the other one won’t add much values since it is so highly correlated.