Explainability Extract Analysis¶
Pega
2024-06-11
Welcome to the Explainability Extract Demo. This notebook is designed to guide you through the analysis of Explainability Extract v1 dataset using the DecisionAnalyzer class of pdstools library. At this point, the dataset is particularly targeted on the “Arbitration” stage, but we have intentions to widen the data scope in subsequent iterations. This dataset can be extracted from Infinity 24.1 and its preceding versions.
We developed this notebook with a dual purpose. Firstly, we aim to familiarize you with the various functions and visualizations available in the DecisionAnalyzer class. You’ll learn how to aggregate and visualize data in ways that are meaningful and insightful for your specific use cases.
Secondly, we hope this notebook will inspire you. The analysis and visualizations demonstrated here are only the tip of the iceberg. We encourage you to think creatively and explore other ways this data can be utilized. Consider this notebook a springboard for your analysis journey.
Each data point represents a decision made in real-time, providing a snapshot of the arbitration process. By examining this data, you have the opportunity to delve into the intricacies of these decisions, gaining a deeper understanding of the decision-making process.
As you navigate through this notebook, remember that it is interactive. This means you can not only run each code cell to see the results but also tweak the code and experiment as you go along.
[2]:
from pdstools.decision_analyzer.data_read_utils import read_data
from pdstools.decision_analyzer.decision_data import DecisionAnalyzer
from pdstools import read_ds_export
import polars as pl
Read the Data and create DecisionData instance¶
[3]:
df = read_ds_export(
filename="sample_explainability_extract.parquet",
path="https://raw.githubusercontent.com/pegasystems/pega-datascientist-tools/master/data",
)
decision_data = DecisionAnalyzer(df)
Overview¶
get_overview_stats
property of DecisionData
shows general statistics of the data.
[4]:
decision_data.get_overview_stats
[4]:
{'Actions': 152,
'Channels': 1,
'Duration': datetime.timedelta(days=7),
'StartDate': datetime.date(2024, 6, 12),
'Customers': 100,
'Decisions': 100,
'avgOffersAtArbitration': 77,
'avgOffersAtEligibility': 0}
Lets take a look at 1 decision. From the height of the dataframe you can see how many actions are available at the Arbitration Stage for this interaction of a customer. pxRank
column shows the ranks of actions in the arbitration.
[5]:
selected_interaction_id = (
decision_data.unfiltered_raw_decision_data.select("pxInteractionID")
.first()
.collect()
.row(0)[0]
)
print(f"{selected_interaction_id=}")
decision_data.unfiltered_raw_decision_data.filter(
pl.col("pxInteractionID") == selected_interaction_id
).sort("pxRank").collect()
selected_interaction_id='-8570391784720840265'
[5]:
pySubjectID | pxInteractionID | pyIssue | pyGroup | pyName | pyChannel | pyDirection | Value | ContextWeight | Weight | pyModelPropensity | pyPropensity | FinalPropensity | Priority | pxRank | ModelControlGroup | pxDecisionTime | day | pxEngagementStage |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | str | str | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 | f32 | i32 | str | datetime[μs] | date | cat |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Mortgage" | "pyName_476" | "Mobile" | "Inbound" | 0.58 | 1.168327 | 1.280707 | 0.5 | 0.5 | 0.487957 | 0.423471 | 1 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Mortgage" | "pyName_61" | "Mobile" | "Inbound" | 0.58 | 0.612144 | 1.592721 | 0.5 | 0.5 | 0.491959 | 0.278196 | 2 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Proactive_1" | "pyName_359" | "Mobile" | "Inbound" | 0.61 | 0.560515 | 0.96787 | 0.5 | 0.5 | 0.551238 | 0.18242 | 3 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "RetailBank_1" | "pyName_211" | "Mobile" | "Inbound" | 0.38 | 0.88016 | 1.033746 | 0.5 | 0.5 | 0.523599 | 0.181033 | 4 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Proactive_1" | "pyName_124" | "Mobile" | "Inbound" | 0.48 | 0.471498 | 0.762064 | 0.5 | 0.5 | 0.498886 | 0.086043 | 5 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Loans" | "pyName_128" | "Mobile" | "Inbound" | 0.4 | 0.508282 | 0.959186 | 0.001186 | 0.001186 | 0.001189 | 0.000232 | 83 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Proactive" | "pyName_394" | "Mobile" | "Inbound" | 0.3 | 0.355954 | 1.019246 | 0.000715 | 0.000715 | 0.000726 | 0.000079 | 84 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "Proactive" | "pyName_197" | "Mobile" | "Inbound" | 0.3 | 0.073162 | 1.368115 | 0.001919 | 0.001919 | 0.00195 | 0.000059 | 85 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Retention" | "RetailBank_1" | "pyName_379" | "Mobile" | "Inbound" | 0.42 | 0.161325 | 0.268833 | 0.001589 | 0.001589 | 0.001556 | 0.000028 | 86 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
"pySubjectID_1999" | "-8570391784720840265" | "Acquisition" | "Proactive" | "pyName_464" | "Mobile" | "Inbound" | 0.3 | 0.320456 | 1.130449 | 0.000143 | 0.000143 | 0.000146 | 0.000016 | 87 | "Test" | 2024-06-14 18:21:40.202300 | 2024-06-14 | "Arbitration" |
Action Distribution¶
Shows the overall distribution of actions at the Arbitration Stage. One can detect if a group of actions survive rarely until Arbitration.
[6]:
stage = "Arbitration"
scope_options = ["pyIssue", "pyGroup", "pyName"]
distribution_data = decision_data.getDistributionData(stage, scope_options)
fig = decision_data.plot.distribution_as_treemap(
df=distribution_data, stage=stage, scope_options=scope_options
)
fig.show()
Global Sensitivity¶
- X-Axis (Decisions): Represents the number of decisions affected by the exclusion of each factor.
Y-Axis (Prioritization Factor): Lists the Arbitration formula components.
Bars: Each bar represents the percentage of decisions affected by the absence of the corresponding factor.
By identifying the most impactful factors, stakeholders can make strategic adjustments to enhance decision-making accuracy.
It highlights which factors need more attention or refinement. For instance, if “Levers” were to show a significant percentage, it would indicate a need for closer examination and potential improvement.
[7]:
decision_data.plot.sensitivity(win_rank=1)
/home/runner/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/utils.py:41: PerformanceWarning:
Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
Wins and Losses in Arbitration¶
Displays the distribution of wins and losses for different “Issues” in the arbitration stage. You can change the level to “Group” or “Action”. Based on the win_rank
actions are classified as either winning or losing.
How to Interpret the Visual:¶
Dominant Issues: The length of the bars helps identify which issues have the highest and lowest win and loss percentages. For example, if “Retention” has a longer bar in the Wins section, it indicates a higher percentage of winning actions for that issue.
Comparative Analysis: By comparing the bars, you can quickly see which issues are performing better in terms of winning in arbitration and which are underperforming.
Resource Allocation: By understanding which issues have higher loss percentages, resources can be reallocated to improve strategies in those areas.
Decision-Making: Provides a clear visual representation of how decisions are distributed across different issues, aiding in making data-driven decisions for future actions.
[8]:
decision_data.plot.global_winloss_distribution(level="pyIssue", win_rank=1)
Personalization Analysis¶
The Personalization Analysis chart helps us understand the availability of options to present to customers during the arbitration stage. If only a limited number of actions survive to this stage, it becomes challenging to personalize offers effectively. Some customers may have only one action available, limiting the ability of our machine learning algorithm to arbitrate effectively.
In the chart:
The x-axis represents the number of actions available per customer. The left y-axis shows the number of decisions. The right y-axis shows the propensity percentage.
The bars (Optionality) indicate the number of decisions where customers had a specific number of actions available. For instance, a high bar at “2” means many customers had exactly two actions available in arbitration. The line (Propensity) represents the average propensity of the top-ranking actions within each bin.
This analysis helps in understanding the distribution of available actions. We expect the average propensity to increase as the number of available actions increase. If there are many customers with little or no actions available, it should be investigated
[9]:
decision_data.plot.propensity_vs_optionality(stage="Arbitration")
Win/Loss Analysis¶
Win Analysis¶
Let’s select an action to determine how often it wins and identify which actions it defeats in arbitration.
[10]:
win_rank = 1
selected_action = (
decision_data.unfiltered_raw_decision_data.filter(pl.col("pxRank") == 1)
.group_by("pyName")
.len()
.sort("len", descending=True)
.collect()
.get_column("pyName")
.to_list()[1]
)
filter_statement = pl.col("pyName") == selected_action
interactions_where_comparison_group_wins = (
decision_data.get_winning_or_losing_interactions(
win_rank=win_rank,
group_filter=filter_statement,
win=True,
)
)
print(
f"selected action '{selected_action}' wins(Rank{win_rank}) in {interactions_where_comparison_group_wins.collect().height} interactions."
)
selected action 'pyName_61' wins(Rank1) in 30 interactions.
/home/runner/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/utils.py:41: PerformanceWarning:
Determining the column names of a LazyFrame requires resolving its schema, which is a potentially expensive operation. Use `LazyFrame.collect_schema().names()` to get the column names without this warning.
Graph below shows the competing actions going into arbitration together with the selected action and how many times they lose. It highlights which actions are being surpassed by the chosen action.
[11]:
# Losing actions in interactions where the selected action wins.
groupby_cols = ["pyIssue", "pyGroup", "pyName"]
winning_from = decision_data.winning_from(
interactions=interactions_where_comparison_group_wins,
win_rank=win_rank,
groupby_cols=groupby_cols,
top_k=20,
)
decision_data.plot.distribution_as_treemap(
df=winning_from, stage="Arbitration", scope_options=groupby_cols
)
Loss Analysis¶
Let’s analyze which actions come out on top when the selected action fails
[12]:
interactions_where_comparison_group_loses = (
decision_data.get_winning_or_losing_interactions(
win_rank=win_rank,
group_filter=filter_statement,
win=False,
)
)
print(
f"selected action '{selected_action}' loses in {interactions_where_comparison_group_loses.collect().height} interactions."
)
# Winning actions in interactions where the selected action loses.
losing_to = decision_data.winning_from(
interactions=interactions_where_comparison_group_loses,
win_rank=win_rank,
groupby_cols=groupby_cols,
top_k=20,
)
decision_data.plot.distribution_as_treemap(
df=losing_to, stage="Arbitration", scope_options=groupby_cols
)
selected action 'pyName_61' loses in 58 interactions.
What are the Prioritization Factors that make these actions win or lose?¶
The analysis below shows the change in the number of times an action wins when each factor is individually removed from the prioritization calculation. Unlike the Global Sensitivity Analysis above, this chart can show negative numbers. A negative value means that the selected action would win more often if that component were removed from the arbitration process. Therefore, a component with a negative value is contributing to the action’s loss.
[13]:
decision_data.plot.sensitivity(
limit_xaxis_range=False, reference_group=pl.col("pyName") == selected_action
)
Why are the actions winning¶
Here we show the distribution of the various arbitration factors of the comparison group vs the other actions that make it to arbitration for the same interactions.
[14]:
fig, warning_message = decision_data.plot.prio_factor_boxplots(
reference=pl.col("pyName") == selected_action,
sample_size=10000,
)
if warning_message:
print(warning_message)
else:
fig.show()
Rank Distribution of Comparison Group¶
Showing the distribution of the prioritization rank of the selected actions. If the rank is low, the selected actions are not (often) winning.
[15]:
decision_data.plot.rank_boxplot(
reference=pl.col("pyName") == selected_action,
)