Explainability Extract Analysis¶
Pega
2024-06-11
Welcome to the Explainability Extract Demo. This notebook is designed to guide you through the analysis of Explainability Extract v1 dataset using the DecisionAnalyzer class of pdstools library. At this point, the dataset is particularly targeted on the “Arbitration” stage, but we have intentions to widen the data scope in subsequent iterations. This dataset can be extracted from Infinity 24.1 and its preceding versions.
We developed this notebook with a dual purpose. Firstly, we aim to familiarize you with the various functions and visualizations available in the DecisionAnalyzer class. You’ll learn how to aggregate and visualize data in ways that are meaningful and insightful for your specific use cases.
Secondly, we hope this notebook will inspire you. The analysis and visualizations demonstrated here are only the tip of the iceberg. We encourage you to think creatively and explore other ways this data can be utilized. Consider this notebook a springboard for your analysis journey.
Each data point represents a decision made in real-time, providing a snapshot of the arbitration process. By examining this data, you have the opportunity to delve into the intricacies of these decisions, gaining a deeper understanding of the decision-making process.
As you navigate through this notebook, remember that it is interactive. This means you can not only run each code cell to see the results but also tweak the code and experiment as you go along.
[2]:
from pdstools.decision_analyzer.data_read_utils import read_data
from pdstools.decision_analyzer.decision_data import DecisionAnalyzer
from pdstools import read_ds_export
import polars as pl
Read the Data and create DecisionData instance¶
[3]:
df = read_ds_export(
filename="sample_explainability_extract.parquet",
path="https://raw.githubusercontent.com/pegasystems/pega-datascientist-tools/master/data",
)
decision_data = DecisionAnalyzer(df)
[4]:
decision_data.decision_data.collect()
[4]:
pySubjectID | pxInteractionID | pxDecisionTime | pyIssue | pyGroup | pyName | pyChannel | pyDirection | Value | Context Weight | Levers | pyModelPropensity | pyPropensity | Propensity | Priority | ModelControlGroup | day | pxRank | StageGroup | StageOrder |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | datetime[μs] | str | str | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 | f32 | str | date | u32 | cat | i32 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "CreditCard_1" | "pyName_232" | "Mobile" | "Inbound" | 0.37 | 0.716067 | 1.019114 | 0.033673 | 0.033673 | 0.033687 | 0.009096 | "Test" | 2024-06-14 | 10 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Billing_1" | "pyName_117" | "Mobile" | "Inbound" | 0.11 | 1.320972 | 1.602437 | 0.011496 | 0.011496 | 0.011559 | 0.002691 | "Test" | 2024-06-14 | 47 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "ActivateUse" | "Proactive_1" | "pyName_362" | "Mobile" | "Inbound" | 0.46 | 0.464855 | 0.921524 | 0.008793 | 0.008793 | 0.009013 | 0.001776 | "Test" | 2024-06-14 | 59 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive_1" | "pyName_180" | "Mobile" | "Inbound" | 0.3 | 0.397316 | 0.525677 | 0.03096 | 0.03096 | 0.03152 | 0.001975 | "Test" | 2024-06-14 | 56 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive_1" | "pyName_543" | "Mobile" | "Inbound" | 0.46 | 0.7622 | 1.397362 | 0.00435 | 0.00435 | 0.004484 | 0.002197 | "Test" | 2024-06-14 | 54 | "Arbitration" | 1 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"pySubjectID_466" | "-8570391784720831070" | 2024-06-17 18:21:40.202300 | "Engagement" | "Proactive_1" | "pyName_242" | "Mobile" | "Inbound" | 0.3 | 0.100824 | 1.059814 | 0.009813 | 0.009813 | 0.009984 | 0.00032 | "Test" | 2024-06-17 | 66 | "Arbitration" | 1 |
"pySubjectID_466" | "-8570391784720831070" | 2024-06-17 18:21:40.202300 | "ActivateUse" | "Mortgage" | "pyName_137" | "Mobile" | "Inbound" | 0.58 | 1.363915 | 0.879706 | 0.002173 | 0.002173 | 0.002199 | 0.00153 | "Test" | 2024-06-17 | 50 | "Arbitration" | 1 |
"pySubjectID_466" | "-8570391784720831070" | 2024-06-17 18:21:40.202300 | "Retention" | "Mortgage" | "pyName_61" | "Mobile" | "Inbound" | 0.58 | 1.247556 | 0.788446 | 0.5 | 0.5 | 0.501949 | 0.286365 | "Test" | 2024-06-17 | 2 | "Arbitration" | 1 |
"pySubjectID_466" | "-8570391784720831070" | 2024-06-17 18:21:40.202300 | "Retention" | "Proactive" | "pyName_188" | "Mobile" | "Inbound" | 0.58 | 1.583168 | 1.221581 | 0.008399 | 0.008399 | 0.008722 | 0.009784 | "Test" | 2024-06-17 | 14 | "Arbitration" | 1 |
"pySubjectID_466" | "-8570391784720831070" | 2024-06-17 18:21:40.202300 | "Retention" | "Loans_1" | "pyName_194" | "Mobile" | "Inbound" | 0.81 | 0.770343 | 0.778408 | 0.030092 | 0.030092 | 0.029864 | 0.014505 | "Test" | 2024-06-17 | 8 | "Arbitration" | 1 |
Overview¶
get_overview_stats
property of DecisionData
shows general statistics of the data.
[5]:
decision_data.get_overview_stats
---------------------------------------------------------------------------
ColumnNotFoundError Traceback (most recent call last)
Cell In[5], line 1
----> 1 decision_data.get_overview_stats
File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:831, in DecisionAnalyzer.get_overview_stats(self)
826 @cached_property
827 def get_overview_stats(self):
828 """Creates an overview from sampled data"""
830 nOffersPerStage = (
--> 831 self.get_optionality_data(self.sample)
832 .group_by(self.level)
833 .agg(pl.col("nOffers").mean().round().cast(pl.Int16))
834 .collect()
835 )
837 def _offer_counts(stage):
838 return (
839 (
840 nOffersPerStage.filter(pl.col(self.level) == stage)
(...)
845 else 0
846 )
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:519, in DecisionAnalyzer.get_optionality_data(self, df)
503 expr = [
504 pl.len().alias("nOffers"),
505 pl.col("Propensity")
(...)
508 .alias("bestPropensity"),
509 ]
510 per_offer_count_and_stage = (
511 self.aggregate_remaining_per_stage(
512 df=df,
(...)
517 .agg(Interactions=pl.len(), AverageBestPropensity=pl.mean("bestPropensity"))
518 )
--> 519 schema = per_offer_count_and_stage.collect_schema()
520 zero_actions = (
521 per_offer_count_and_stage.group_by("StageGroup")
522 .agg(interaction_count=pl.sum("Interactions"))
(...)
530 .drop("interaction_count")
531 )
532 optionality_data = pl.concat(
533 [
534 per_offer_count_and_stage,
535 zero_actions.select(per_offer_count_and_stage.collect_schema().names()),
536 ]
537 ).sort("nOffers", descending=True)
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2446, in LazyFrame.collect_schema(self)
2416 def collect_schema(self) -> Schema:
2417 """
2418 Resolve the schema of this LazyFrame.
2419
(...)
2444 3
2445 """
-> 2446 return Schema(self._ldf.collect_schema(), check_dtypes=False)
ColumnNotFoundError: unable to find column "pxRecordType"; valid columns: ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", "pyGroup", "pyName", "pyChannel", "pyDirection", "Value", "Context Weight", "Levers", "pyModelPropensity", "pyPropensity", "Propensity", "Priority", "ModelControlGroup", "day", "pxRank", "StageGroup", "StageOrder"]
Resolved plan until failure:
---> FAILED HERE RESOLVING 'group_by' <---
DF ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", ...]; PROJECT */20 COLUMNS
Lets take a look at 1 decision. From the height of the dataframe you can see how many actions are available at the Arbitration Stage for this interaction of a customer. pxRank
column shows the ranks of actions in the arbitration.
[6]:
selected_interaction_id = (
decision_data.unfiltered_raw_decision_data.select("pxInteractionID")
.first()
.collect()
.row(0)[0]
)
print(f"{selected_interaction_id=}")
decision_data.unfiltered_raw_decision_data.filter(
pl.col("pxInteractionID") == selected_interaction_id
).sort("pxRank").collect()
selected_interaction_id='-8570391784720840265'
[6]:
pySubjectID | pxInteractionID | pxDecisionTime | pyIssue | pyGroup | pyName | pyChannel | pyDirection | Value | Context Weight | Levers | pyModelPropensity | pyPropensity | Propensity | Priority | ModelControlGroup | day | pxRank | StageGroup | StageOrder |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
str | str | datetime[μs] | str | str | str | str | str | f64 | f64 | f64 | f64 | f64 | f64 | f32 | str | date | u32 | cat | i32 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Mortgage" | "pyName_476" | "Mobile" | "Inbound" | 0.58 | 1.168327 | 1.280707 | 0.5 | 0.5 | 0.487957 | 0.423471 | "Test" | 2024-06-14 | 1 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Mortgage" | "pyName_61" | "Mobile" | "Inbound" | 0.58 | 0.612144 | 1.592721 | 0.5 | 0.5 | 0.491959 | 0.278196 | "Test" | 2024-06-14 | 2 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive_1" | "pyName_359" | "Mobile" | "Inbound" | 0.61 | 0.560515 | 0.96787 | 0.5 | 0.5 | 0.551238 | 0.18242 | "Test" | 2024-06-14 | 3 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "RetailBank_1" | "pyName_211" | "Mobile" | "Inbound" | 0.38 | 0.88016 | 1.033746 | 0.5 | 0.5 | 0.523599 | 0.181033 | "Test" | 2024-06-14 | 4 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive_1" | "pyName_124" | "Mobile" | "Inbound" | 0.48 | 0.471498 | 0.762064 | 0.5 | 0.5 | 0.498886 | 0.086043 | "Test" | 2024-06-14 | 5 | "Arbitration" | 1 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Loans" | "pyName_128" | "Mobile" | "Inbound" | 0.4 | 0.508282 | 0.959186 | 0.001186 | 0.001186 | 0.001189 | 0.000232 | "Test" | 2024-06-14 | 83 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive" | "pyName_394" | "Mobile" | "Inbound" | 0.3 | 0.355954 | 1.019246 | 0.000715 | 0.000715 | 0.000726 | 0.000079 | "Test" | 2024-06-14 | 84 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "Proactive" | "pyName_197" | "Mobile" | "Inbound" | 0.3 | 0.073162 | 1.368115 | 0.001919 | 0.001919 | 0.00195 | 0.000059 | "Test" | 2024-06-14 | 85 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Retention" | "RetailBank_1" | "pyName_379" | "Mobile" | "Inbound" | 0.42 | 0.161325 | 0.268833 | 0.001589 | 0.001589 | 0.001556 | 0.000028 | "Test" | 2024-06-14 | 86 | "Arbitration" | 1 |
"pySubjectID_1999" | "-8570391784720840265" | 2024-06-14 18:21:40.202300 | "Acquisition" | "Proactive" | "pyName_464" | "Mobile" | "Inbound" | 0.3 | 0.320456 | 1.130449 | 0.000143 | 0.000143 | 0.000146 | 0.000016 | "Test" | 2024-06-14 | 87 | "Arbitration" | 1 |
Action Distribution¶
Shows the overall distribution of actions at the Arbitration Stage. One can detect if a group of actions survive rarely until Arbitration.
[7]:
stage = "Arbitration"
scope_options = ["pyIssue", "pyGroup", "pyName"]
distribution_data = decision_data.getDistributionData(stage, scope_options)
fig = decision_data.plot.distribution_as_treemap(
df=distribution_data, stage=stage, scope_options=scope_options
)
fig.show()
---------------------------------------------------------------------------
ColumnNotFoundError Traceback (most recent call last)
Cell In[7], line 3
1 stage = "Arbitration"
2 scope_options = ["pyIssue", "pyGroup", "pyName"]
----> 3 distribution_data = decision_data.getDistributionData(stage, scope_options)
4 fig = decision_data.plot.distribution_as_treemap(
5 df=distribution_data, stage=stage, scope_options=scope_options
6 )
7 fig.show()
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:347, in DecisionAnalyzer.getDistributionData(self, stage, grouping_levels, trend, additional_filters)
339 def getDistributionData(
340 self,
341 stage: str,
(...)
344 additional_filters: Optional[Union[pl.Expr, List[pl.Expr]]] = None,
345 ) -> pl.LazyFrame:
346 distribution_data = (
--> 347 apply_filter(self.getPreaggregatedRemainingView, additional_filters)
348 .filter(pl.col(self.level) == stage)
349 .group_by(["day"] + [grouping_levels] if trend else grouping_levels)
350 .agg(pl.sum("Decisions"))
351 .sort("Decisions", descending=True)
352 .filter(pl.col("Decisions") > 0)
353 )
355 return distribution_data
File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:219, in DecisionAnalyzer.getPreaggregatedRemainingView(self)
210 @cached_property
211 def getPreaggregatedRemainingView(self):
212 """Pre-aggregates the full dataset over customers and interactions providing a view of remaining offers.
213
214 This pre-aggregation builds on the filter view and aggregates over
215 the stages remaining.
216 """
217 self.preaggregated_decision_data_remainingview = (
218 self.aggregate_remaining_per_stage(
--> 219 self.getPreaggregatedFilterView,
220 self.preaggregation_columns,
221 [
222 pl.sum("Decisions"),
223 pl.min("pxDecisionTime_min"),
224 pl.max("pxDecisionTime_max"),
225 pl.min("Value_min"),
226 pl.max("Value_max"),
227 # pl.col("Propensity").sample(
228 # n=num_samples, with_replacement=True, shuffle=True
229 # ), # a list sample values - for distribution plots
230 pl.first("Propensity"),
231 pl.min("Propensity_min"),
232 pl.max("Propensity_max"),
233 # pl.col("Priority")
234 # .sample(n=num_samples, with_replacement=True, shuffle=True)
235 # .alias("Priority"),
236 pl.first("Priority"),
237 pl.min("Priority_min"),
238 pl.max("Priority_max"),
239 ]
240 + [pl.sum(f"Win_at_rank{i}") for i in range(1, self.max_win_rank + 1)],
241 )
242 .collect()
243 .lazy()
244 )
245 return self.preaggregated_decision_data_remainingview
File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:205, in DecisionAnalyzer.getPreaggregatedFilterView(self)
182 stats_cols = ["pxDecisionTime", "Value", "Propensity", "Priority"]
183 exprs = [
184 pl.col("pxInteractionID")
185 .where(pl.col("pxRank") <= i)
(...)
195 pl.count().alias("Decisions"),
196 ]
198 self.preaggregated_decision_data_filterview = (
199 self.decision_data.group_by(
200 self.preaggregation_columns.union(
201 {self.level, "StageOrder", "pxRecordType"}
202 )
203 )
204 .agg(exprs)
--> 205 .collect()
206 .lazy()
207 )
208 return self.preaggregated_decision_data_filterview
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
89 kwargs["engine"] = "in-memory"
91 del kwargs["streaming"]
---> 93 return function(*args, **kwargs)
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2224, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
2222 # Only for testing purposes
2223 callback = _kwargs.get("post_opt_callback", callback)
-> 2224 return wrap_df(ldf.collect(engine, callback))
ColumnNotFoundError: pxRecordType
Resolved plan until failure:
---> FAILED HERE RESOLVING 'sink' <---
FILTER col("pyName").is_not_null()
FROM
WITH_COLUMNS:
[col("pyIssue").strict_cast(String), col("pyGroup").strict_cast(String), col("pyChannel").strict_cast(String), col("pyDirection").strict_cast(String)]
WITH_COLUMNS:
[col("StageGroup").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
["Arbitration".alias("StageGroup"), dyn int: 1.alias("StageOrder")]
WITH_COLUMNS:
[col("Priority").rank().over([col("pxInteractionID")]).alias("pxRank")]
WITH_COLUMNS:
[col("pxDecisionTime").dt.date().alias("day")]
SELECT [col("pySubjectID"), col("pxInteractionID"), col("pxDecisionTime"), col("pyIssue"), col("pyGroup"), col("pyName"), col("pyChannel"), col("pyDirection"), col("Value"), col("Context Weight"), col("Levers"), col("pyModelPropensity"), col("pyPropensity"), col("Propensity"), col("Priority"), col("ModelControlGroup")]
FROM
RENAME
WITH_COLUMNS:
[col("ModelControlGroup").strict_cast(String)]
WITH_COLUMNS:
[col("Priority").strict_cast(Float32)]
WITH_COLUMNS:
[col("pyChannel").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
[col("pyGroup").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
[col("pyIssue").strict_cast(Categorical(None, Physical))]
DF ["pySubjectID", "pxInteractionID", "pyIssue", "pyGroup", ...]; PROJECT */17 COLUMNS
Global Sensitivity¶
- X-Axis (Decisions): Represents the number of decisions affected by the exclusion of each factor.
Y-Axis (Prioritization Factor): Lists the Arbitration formula components.
Bars: Each bar represents the percentage of decisions affected by the absence of the corresponding factor.
By identifying the most impactful factors, stakeholders can make strategic adjustments to enhance decision-making accuracy.
It highlights which factors need more attention or refinement. For instance, if “Levers” were to show a significant percentage, it would indicate a need for closer examination and potential improvement.
[8]:
decision_data.plot.sensitivity(win_rank=1)
Wins and Losses in Arbitration¶
Displays the distribution of wins and losses for different “Issues” in the arbitration stage. You can change the level to “Group” or “Action”. Based on the win_rank
actions are classified as either winning or losing.
How to Interpret the Visual:¶
Dominant Issues: The length of the bars helps identify which issues have the highest and lowest win and loss percentages. For example, if “Retention” has a longer bar in the Wins section, it indicates a higher percentage of winning actions for that issue.
Comparative Analysis: By comparing the bars, you can quickly see which issues are performing better in terms of winning in arbitration and which are underperforming.
Resource Allocation: By understanding which issues have higher loss percentages, resources can be reallocated to improve strategies in those areas.
Decision-Making: Provides a clear visual representation of how decisions are distributed across different issues, aiding in making data-driven decisions for future actions.
[9]:
decision_data.plot.global_winloss_distribution(level="pyIssue", win_rank=1)
---------------------------------------------------------------------------
ColumnNotFoundError Traceback (most recent call last)
Cell In[9], line 1
----> 1 decision_data.plot.global_winloss_distribution(level="pyIssue", win_rank=1)
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/plots.py:122, in Plot.global_winloss_distribution(self, level, win_rank, return_df)
120 def global_winloss_distribution(self, level, win_rank, return_df=False):
121 # level, cat = getScope(level)
--> 122 df = self._decision_data.get_win_loss_distribution_data(level, win_rank)
123 if return_df:
124 return df
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:473, in DecisionAnalyzer.get_win_loss_distribution_data(self, level, win_rank)
470 def get_win_loss_distribution_data(self, level, win_rank):
471 win_col = f"Win_at_rank{win_rank}"
472 group_level_win_losses = (
--> 473 self.getPreaggregatedRemainingView.filter(
474 pl.col(self.level) == "Arbitration"
475 )
476 .group_by(level)
477 .agg(Wins=pl.sum(win_col), Decisions=pl.sum("Decisions"))
478 .with_columns(Losses=pl.col("Decisions") - pl.col("Wins"))
479 .with_columns(
480 Wins=pl.col("Wins") / pl.sum("Wins"),
481 Losses=pl.col("Losses") / pl.sum("Losses"),
482 )
483 )
485 group_level_win_losses = group_level_win_losses.melt(
486 id_vars=level,
487 value_vars=["Wins", "Losses"],
488 variable_name="Status",
489 value_name="Percentage",
490 ).sort(["Status", "Percentage"], descending=True)
492 return group_level_win_losses
File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:219, in DecisionAnalyzer.getPreaggregatedRemainingView(self)
210 @cached_property
211 def getPreaggregatedRemainingView(self):
212 """Pre-aggregates the full dataset over customers and interactions providing a view of remaining offers.
213
214 This pre-aggregation builds on the filter view and aggregates over
215 the stages remaining.
216 """
217 self.preaggregated_decision_data_remainingview = (
218 self.aggregate_remaining_per_stage(
--> 219 self.getPreaggregatedFilterView,
220 self.preaggregation_columns,
221 [
222 pl.sum("Decisions"),
223 pl.min("pxDecisionTime_min"),
224 pl.max("pxDecisionTime_max"),
225 pl.min("Value_min"),
226 pl.max("Value_max"),
227 # pl.col("Propensity").sample(
228 # n=num_samples, with_replacement=True, shuffle=True
229 # ), # a list sample values - for distribution plots
230 pl.first("Propensity"),
231 pl.min("Propensity_min"),
232 pl.max("Propensity_max"),
233 # pl.col("Priority")
234 # .sample(n=num_samples, with_replacement=True, shuffle=True)
235 # .alias("Priority"),
236 pl.first("Priority"),
237 pl.min("Priority_min"),
238 pl.max("Priority_max"),
239 ]
240 + [pl.sum(f"Win_at_rank{i}") for i in range(1, self.max_win_rank + 1)],
241 )
242 .collect()
243 .lazy()
244 )
245 return self.preaggregated_decision_data_remainingview
File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
999 val = cache.get(self.attrname, _NOT_FOUND)
1000 if val is _NOT_FOUND:
-> 1001 val = self.func(instance)
1002 try:
1003 cache[self.attrname] = val
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:205, in DecisionAnalyzer.getPreaggregatedFilterView(self)
182 stats_cols = ["pxDecisionTime", "Value", "Propensity", "Priority"]
183 exprs = [
184 pl.col("pxInteractionID")
185 .where(pl.col("pxRank") <= i)
(...)
195 pl.count().alias("Decisions"),
196 ]
198 self.preaggregated_decision_data_filterview = (
199 self.decision_data.group_by(
200 self.preaggregation_columns.union(
201 {self.level, "StageOrder", "pxRecordType"}
202 )
203 )
204 .agg(exprs)
--> 205 .collect()
206 .lazy()
207 )
208 return self.preaggregated_decision_data_filterview
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
89 kwargs["engine"] = "in-memory"
91 del kwargs["streaming"]
---> 93 return function(*args, **kwargs)
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2224, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
2222 # Only for testing purposes
2223 callback = _kwargs.get("post_opt_callback", callback)
-> 2224 return wrap_df(ldf.collect(engine, callback))
ColumnNotFoundError: pxRecordType
Resolved plan until failure:
---> FAILED HERE RESOLVING 'sink' <---
FILTER col("pyName").is_not_null()
FROM
WITH_COLUMNS:
[col("pyIssue").strict_cast(String), col("pyGroup").strict_cast(String), col("pyChannel").strict_cast(String), col("pyDirection").strict_cast(String)]
WITH_COLUMNS:
[col("StageGroup").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
["Arbitration".alias("StageGroup"), dyn int: 1.alias("StageOrder")]
WITH_COLUMNS:
[col("Priority").rank().over([col("pxInteractionID")]).alias("pxRank")]
WITH_COLUMNS:
[col("pxDecisionTime").dt.date().alias("day")]
SELECT [col("pySubjectID"), col("pxInteractionID"), col("pxDecisionTime"), col("pyIssue"), col("pyGroup"), col("pyName"), col("pyChannel"), col("pyDirection"), col("Value"), col("Context Weight"), col("Levers"), col("pyModelPropensity"), col("pyPropensity"), col("Propensity"), col("Priority"), col("ModelControlGroup")]
FROM
RENAME
WITH_COLUMNS:
[col("ModelControlGroup").strict_cast(String)]
WITH_COLUMNS:
[col("Priority").strict_cast(Float32)]
WITH_COLUMNS:
[col("pyChannel").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
[col("pyGroup").strict_cast(Categorical(None, Physical))]
WITH_COLUMNS:
[col("pyIssue").strict_cast(Categorical(None, Physical))]
DF ["pySubjectID", "pxInteractionID", "pyIssue", "pyGroup", ...]; PROJECT */17 COLUMNS
Personalization Analysis¶
The Personalization Analysis chart helps us understand the availability of options to present to customers during the arbitration stage. If only a limited number of actions survive to this stage, it becomes challenging to personalize offers effectively. Some customers may have only one action available, limiting the ability of our machine learning algorithm to arbitrate effectively.
In the chart:
The x-axis represents the number of actions available per customer. The left y-axis shows the number of decisions. The right y-axis shows the propensity percentage.
The bars (Optionality) indicate the number of decisions where customers had a specific number of actions available. For instance, a high bar at “2” means many customers had exactly two actions available in arbitration. The line (Propensity) represents the average propensity of the top-ranking actions within each bin.
This analysis helps in understanding the distribution of available actions. We expect the average propensity to increase as the number of available actions increase. If there are many customers with little or no actions available, it should be investigated
[10]:
decision_data.plot.propensity_vs_optionality(stage="Arbitration")
---------------------------------------------------------------------------
ColumnNotFoundError Traceback (most recent call last)
Cell In[10], line 1
----> 1 decision_data.plot.propensity_vs_optionality(stage="Arbitration")
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/plots.py:151, in Plot.propensity_vs_optionality(self, stage, df, return_df)
148 if df is None:
149 df = self._decision_data.sample
150 plotData = (
--> 151 self._decision_data.get_optionality_data(df)
152 .filter(pl.col(self._decision_data.level) == stage)
153 .collect()
154 )
155 if return_df:
156 return plotData
File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:519, in DecisionAnalyzer.get_optionality_data(self, df)
503 expr = [
504 pl.len().alias("nOffers"),
505 pl.col("Propensity")
(...)
508 .alias("bestPropensity"),
509 ]
510 per_offer_count_and_stage = (
511 self.aggregate_remaining_per_stage(
512 df=df,
(...)
517 .agg(Interactions=pl.len(), AverageBestPropensity=pl.mean("bestPropensity"))
518 )
--> 519 schema = per_offer_count_and_stage.collect_schema()
520 zero_actions = (
521 per_offer_count_and_stage.group_by("StageGroup")
522 .agg(interaction_count=pl.sum("Interactions"))
(...)
530 .drop("interaction_count")
531 )
532 optionality_data = pl.concat(
533 [
534 per_offer_count_and_stage,
535 zero_actions.select(per_offer_count_and_stage.collect_schema().names()),
536 ]
537 ).sort("nOffers", descending=True)
File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2446, in LazyFrame.collect_schema(self)
2416 def collect_schema(self) -> Schema:
2417 """
2418 Resolve the schema of this LazyFrame.
2419
(...)
2444 3
2445 """
-> 2446 return Schema(self._ldf.collect_schema(), check_dtypes=False)
ColumnNotFoundError: unable to find column "pxRecordType"; valid columns: ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", "pyGroup", "pyName", "pyChannel", "pyDirection", "Value", "Context Weight", "Levers", "pyModelPropensity", "pyPropensity", "Propensity", "Priority", "ModelControlGroup", "day", "pxRank", "StageGroup", "StageOrder"]
Resolved plan until failure:
---> FAILED HERE RESOLVING 'group_by' <---
DF ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", ...]; PROJECT */20 COLUMNS
Win/Loss Analysis¶
Win Analysis¶
Let’s select an action to determine how often it wins and identify which actions it defeats in arbitration.
[11]:
win_rank = 1
selected_action = (
decision_data.unfiltered_raw_decision_data.filter(pl.col("pxRank") == 1)
.group_by("pyName")
.len()
.sort("len", descending=True)
.collect()
.get_column("pyName")
.to_list()[1]
)
filter_statement = pl.col("pyName") == selected_action
interactions_where_comparison_group_wins = (
decision_data.get_winning_or_losing_interactions(
win_rank=win_rank,
group_filter=filter_statement,
win=True,
)
)
print(
f"selected action '{selected_action}' wins(Rank{win_rank}) in {interactions_where_comparison_group_wins.collect().height} interactions."
)
selected action 'pyName_61' wins(Rank1) in 30 interactions.
Graph below shows the competing actions going into arbitration together with the selected action and how many times they lose. It highlights which actions are being surpassed by the chosen action.
[12]:
# Losing actions in interactions where the selected action wins.
groupby_cols = ["pyIssue", "pyGroup", "pyName"]
winning_from = decision_data.winning_from(
interactions=interactions_where_comparison_group_wins,
win_rank=win_rank,
groupby_cols=groupby_cols,
top_k=20,
)
decision_data.plot.distribution_as_treemap(
df=winning_from, stage="Arbitration", scope_options=groupby_cols
)
Loss Analysis¶
Let’s analyze which actions come out on top when the selected action fails
[13]:
interactions_where_comparison_group_loses = (
decision_data.get_winning_or_losing_interactions(
win_rank=win_rank,
group_filter=filter_statement,
win=False,
)
)
print(
f"selected action '{selected_action}' loses in {interactions_where_comparison_group_loses.collect().height} interactions."
)
# Winning actions in interactions where the selected action loses.
losing_to = decision_data.winning_from(
interactions=interactions_where_comparison_group_loses,
win_rank=win_rank,
groupby_cols=groupby_cols,
top_k=20,
)
decision_data.plot.distribution_as_treemap(
df=losing_to, stage="Arbitration", scope_options=groupby_cols
)
selected action 'pyName_61' loses in 58 interactions.
What are the Prioritization Factors that make these actions win or lose?¶
The analysis below shows the change in the number of times an action wins when each factor is individually removed from the prioritization calculation. Unlike the Global Sensitivity Analysis above, this chart can show negative numbers. A negative value means that the selected action would win more often if that component were removed from the arbitration process. Therefore, a component with a negative value is contributing to the action’s loss.
[14]:
decision_data.plot.sensitivity(
limit_xaxis_range=False, reference_group=pl.col("pyName") == selected_action
)
Why are the actions winning¶
Here we show the distribution of the various arbitration factors of the comparison group vs the other actions that make it to arbitration for the same interactions.
[15]:
fig, warning_message = decision_data.plot.prio_factor_boxplots(
reference=pl.col("pyName") == selected_action,
sample_size=10000,
)
if warning_message:
print(warning_message)
else:
fig.show()
Rank Distribution of Comparison Group¶
Showing the distribution of the prioritization rank of the selected actions. If the rank is low, the selected actions are not (often) winning.
[16]:
decision_data.plot.rank_boxplot(
reference=pl.col("pyName") == selected_action,
)