Explainability Extract Analysis

Pega

2024-06-11

Welcome to the Explainability Extract Demo. This notebook is designed to guide you through the analysis of Explainability Extract v1 dataset using the DecisionAnalyzer class of pdstools library. At this point, the dataset is particularly targeted on the “Arbitration” stage, but we have intentions to widen the data scope in subsequent iterations. This dataset can be extracted from Infinity 24.1 and its preceding versions.

We developed this notebook with a dual purpose. Firstly, we aim to familiarize you with the various functions and visualizations available in the DecisionAnalyzer class. You’ll learn how to aggregate and visualize data in ways that are meaningful and insightful for your specific use cases.

Secondly, we hope this notebook will inspire you. The analysis and visualizations demonstrated here are only the tip of the iceberg. We encourage you to think creatively and explore other ways this data can be utilized. Consider this notebook a springboard for your analysis journey.

Each data point represents a decision made in real-time, providing a snapshot of the arbitration process. By examining this data, you have the opportunity to delve into the intricacies of these decisions, gaining a deeper understanding of the decision-making process.

As you navigate through this notebook, remember that it is interactive. This means you can not only run each code cell to see the results but also tweak the code and experiment as you go along.

[2]:
from pdstools.decision_analyzer.data_read_utils import read_data
from pdstools.decision_analyzer.decision_data import DecisionAnalyzer
from pdstools import read_ds_export
import polars as pl

Read the Data and create DecisionData instance

[3]:
df = read_ds_export(
    filename="sample_explainability_extract.parquet",
    path="https://raw.githubusercontent.com/pegasystems/pega-datascientist-tools/master/data",
)
decision_data = DecisionAnalyzer(df)
[4]:
decision_data.decision_data.collect()
[4]:
shape: (7_297, 20)
pySubjectIDpxInteractionIDpxDecisionTimepyIssuepyGrouppyNamepyChannelpyDirectionValueContext WeightLeverspyModelPropensitypyPropensityPropensityPriorityModelControlGroupdaypxRankStageGroupStageOrder
strstrdatetime[μs]strstrstrstrstrf64f64f64f64f64f64f32strdateu32cati32
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""CreditCard_1""pyName_232""Mobile""Inbound"0.370.7160671.0191140.0336730.0336730.0336870.009096"Test"2024-06-1410"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Billing_1""pyName_117""Mobile""Inbound"0.111.3209721.6024370.0114960.0114960.0115590.002691"Test"2024-06-1447"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"ActivateUse""Proactive_1""pyName_362""Mobile""Inbound"0.460.4648550.9215240.0087930.0087930.0090130.001776"Test"2024-06-1459"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive_1""pyName_180""Mobile""Inbound"0.30.3973160.5256770.030960.030960.031520.001975"Test"2024-06-1456"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive_1""pyName_543""Mobile""Inbound"0.460.76221.3973620.004350.004350.0044840.002197"Test"2024-06-1454"Arbitration"1
"pySubjectID_466""-8570391784720831070"2024-06-17 18:21:40.202300"Engagement""Proactive_1""pyName_242""Mobile""Inbound"0.30.1008241.0598140.0098130.0098130.0099840.00032"Test"2024-06-1766"Arbitration"1
"pySubjectID_466""-8570391784720831070"2024-06-17 18:21:40.202300"ActivateUse""Mortgage""pyName_137""Mobile""Inbound"0.581.3639150.8797060.0021730.0021730.0021990.00153"Test"2024-06-1750"Arbitration"1
"pySubjectID_466""-8570391784720831070"2024-06-17 18:21:40.202300"Retention""Mortgage""pyName_61""Mobile""Inbound"0.581.2475560.7884460.50.50.5019490.286365"Test"2024-06-172"Arbitration"1
"pySubjectID_466""-8570391784720831070"2024-06-17 18:21:40.202300"Retention""Proactive""pyName_188""Mobile""Inbound"0.581.5831681.2215810.0083990.0083990.0087220.009784"Test"2024-06-1714"Arbitration"1
"pySubjectID_466""-8570391784720831070"2024-06-17 18:21:40.202300"Retention""Loans_1""pyName_194""Mobile""Inbound"0.810.7703430.7784080.0300920.0300920.0298640.014505"Test"2024-06-178"Arbitration"1

Overview

get_overview_stats property of DecisionData shows general statistics of the data.

[5]:
decision_data.get_overview_stats
---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 decision_data.get_overview_stats

File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
    999 val = cache.get(self.attrname, _NOT_FOUND)
   1000 if val is _NOT_FOUND:
-> 1001     val = self.func(instance)
   1002     try:
   1003         cache[self.attrname] = val

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:831, in DecisionAnalyzer.get_overview_stats(self)
    826 @cached_property
    827 def get_overview_stats(self):
    828     """Creates an overview from sampled data"""
    830     nOffersPerStage = (
--> 831         self.get_optionality_data(self.sample)
    832         .group_by(self.level)
    833         .agg(pl.col("nOffers").mean().round().cast(pl.Int16))
    834         .collect()
    835     )
    837     def _offer_counts(stage):
    838         return (
    839             (
    840                 nOffersPerStage.filter(pl.col(self.level) == stage)
   (...)
    845             else 0
    846         )

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:519, in DecisionAnalyzer.get_optionality_data(self, df)
    503 expr = [
    504     pl.len().alias("nOffers"),
    505     pl.col("Propensity")
   (...)
    508     .alias("bestPropensity"),
    509 ]
    510 per_offer_count_and_stage = (
    511     self.aggregate_remaining_per_stage(
    512         df=df,
   (...)
    517     .agg(Interactions=pl.len(), AverageBestPropensity=pl.mean("bestPropensity"))
    518 )
--> 519 schema = per_offer_count_and_stage.collect_schema()
    520 zero_actions = (
    521     per_offer_count_and_stage.group_by("StageGroup")
    522     .agg(interaction_count=pl.sum("Interactions"))
   (...)
    530     .drop("interaction_count")
    531 )
    532 optionality_data = pl.concat(
    533     [
    534         per_offer_count_and_stage,
    535         zero_actions.select(per_offer_count_and_stage.collect_schema().names()),
    536     ]
    537 ).sort("nOffers", descending=True)

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2446, in LazyFrame.collect_schema(self)
   2416 def collect_schema(self) -> Schema:
   2417     """
   2418     Resolve the schema of this LazyFrame.
   2419
   (...)
   2444     3
   2445     """
-> 2446     return Schema(self._ldf.collect_schema(), check_dtypes=False)

ColumnNotFoundError: unable to find column "pxRecordType"; valid columns: ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", "pyGroup", "pyName", "pyChannel", "pyDirection", "Value", "Context Weight", "Levers", "pyModelPropensity", "pyPropensity", "Propensity", "Priority", "ModelControlGroup", "day", "pxRank", "StageGroup", "StageOrder"]

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'group_by' <---
DF ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", ...]; PROJECT */20 COLUMNS

Lets take a look at 1 decision. From the height of the dataframe you can see how many actions are available at the Arbitration Stage for this interaction of a customer. pxRank column shows the ranks of actions in the arbitration.

[6]:
selected_interaction_id = (
    decision_data.unfiltered_raw_decision_data.select("pxInteractionID")
    .first()
    .collect()
    .row(0)[0]
)
print(f"{selected_interaction_id=}")
decision_data.unfiltered_raw_decision_data.filter(
    pl.col("pxInteractionID") == selected_interaction_id
).sort("pxRank").collect()
selected_interaction_id='-8570391784720840265'
[6]:
shape: (87, 20)
pySubjectIDpxInteractionIDpxDecisionTimepyIssuepyGrouppyNamepyChannelpyDirectionValueContext WeightLeverspyModelPropensitypyPropensityPropensityPriorityModelControlGroupdaypxRankStageGroupStageOrder
strstrdatetime[μs]strstrstrstrstrf64f64f64f64f64f64f32strdateu32cati32
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Mortgage""pyName_476""Mobile""Inbound"0.581.1683271.2807070.50.50.4879570.423471"Test"2024-06-141"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Mortgage""pyName_61""Mobile""Inbound"0.580.6121441.5927210.50.50.4919590.278196"Test"2024-06-142"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive_1""pyName_359""Mobile""Inbound"0.610.5605150.967870.50.50.5512380.18242"Test"2024-06-143"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""RetailBank_1""pyName_211""Mobile""Inbound"0.380.880161.0337460.50.50.5235990.181033"Test"2024-06-144"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive_1""pyName_124""Mobile""Inbound"0.480.4714980.7620640.50.50.4988860.086043"Test"2024-06-145"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Loans""pyName_128""Mobile""Inbound"0.40.5082820.9591860.0011860.0011860.0011890.000232"Test"2024-06-1483"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive""pyName_394""Mobile""Inbound"0.30.3559541.0192460.0007150.0007150.0007260.000079"Test"2024-06-1484"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""Proactive""pyName_197""Mobile""Inbound"0.30.0731621.3681150.0019190.0019190.001950.000059"Test"2024-06-1485"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Retention""RetailBank_1""pyName_379""Mobile""Inbound"0.420.1613250.2688330.0015890.0015890.0015560.000028"Test"2024-06-1486"Arbitration"1
"pySubjectID_1999""-8570391784720840265"2024-06-14 18:21:40.202300"Acquisition""Proactive""pyName_464""Mobile""Inbound"0.30.3204561.1304490.0001430.0001430.0001460.000016"Test"2024-06-1487"Arbitration"1

Action Distribution

Shows the overall distribution of actions at the Arbitration Stage. One can detect if a group of actions survive rarely until Arbitration.

[7]:
stage = "Arbitration"
scope_options = ["pyIssue", "pyGroup", "pyName"]
distribution_data = decision_data.getDistributionData(stage, scope_options)
fig = decision_data.plot.distribution_as_treemap(
    df=distribution_data, stage=stage, scope_options=scope_options
)
fig.show()
---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
Cell In[7], line 3
      1 stage = "Arbitration"
      2 scope_options = ["pyIssue", "pyGroup", "pyName"]
----> 3 distribution_data = decision_data.getDistributionData(stage, scope_options)
      4 fig = decision_data.plot.distribution_as_treemap(
      5     df=distribution_data, stage=stage, scope_options=scope_options
      6 )
      7 fig.show()

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:347, in DecisionAnalyzer.getDistributionData(self, stage, grouping_levels, trend, additional_filters)
    339 def getDistributionData(
    340     self,
    341     stage: str,
   (...)
    344     additional_filters: Optional[Union[pl.Expr, List[pl.Expr]]] = None,
    345 ) -> pl.LazyFrame:
    346     distribution_data = (
--> 347         apply_filter(self.getPreaggregatedRemainingView, additional_filters)
    348         .filter(pl.col(self.level) == stage)
    349         .group_by(["day"] + [grouping_levels] if trend else grouping_levels)
    350         .agg(pl.sum("Decisions"))
    351         .sort("Decisions", descending=True)
    352         .filter(pl.col("Decisions") > 0)
    353     )
    355     return distribution_data

File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
    999 val = cache.get(self.attrname, _NOT_FOUND)
   1000 if val is _NOT_FOUND:
-> 1001     val = self.func(instance)
   1002     try:
   1003         cache[self.attrname] = val

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:219, in DecisionAnalyzer.getPreaggregatedRemainingView(self)
    210 @cached_property
    211 def getPreaggregatedRemainingView(self):
    212     """Pre-aggregates the full dataset over customers and interactions providing a view of remaining offers.
    213
    214     This pre-aggregation builds on the filter view and aggregates over
    215     the stages remaining.
    216     """
    217     self.preaggregated_decision_data_remainingview = (
    218         self.aggregate_remaining_per_stage(
--> 219             self.getPreaggregatedFilterView,
    220             self.preaggregation_columns,
    221             [
    222                 pl.sum("Decisions"),
    223                 pl.min("pxDecisionTime_min"),
    224                 pl.max("pxDecisionTime_max"),
    225                 pl.min("Value_min"),
    226                 pl.max("Value_max"),
    227                 # pl.col("Propensity").sample(
    228                 #     n=num_samples, with_replacement=True, shuffle=True
    229                 # ),  # a list sample values - for distribution plots
    230                 pl.first("Propensity"),
    231                 pl.min("Propensity_min"),
    232                 pl.max("Propensity_max"),
    233                 # pl.col("Priority")
    234                 # .sample(n=num_samples, with_replacement=True, shuffle=True)
    235                 # .alias("Priority"),
    236                 pl.first("Priority"),
    237                 pl.min("Priority_min"),
    238                 pl.max("Priority_max"),
    239             ]
    240             + [pl.sum(f"Win_at_rank{i}") for i in range(1, self.max_win_rank + 1)],
    241         )
    242         .collect()
    243         .lazy()
    244     )
    245     return self.preaggregated_decision_data_remainingview

File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
    999 val = cache.get(self.attrname, _NOT_FOUND)
   1000 if val is _NOT_FOUND:
-> 1001     val = self.func(instance)
   1002     try:
   1003         cache[self.attrname] = val

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:205, in DecisionAnalyzer.getPreaggregatedFilterView(self)
    182 stats_cols = ["pxDecisionTime", "Value", "Propensity", "Priority"]
    183 exprs = [
    184     pl.col("pxInteractionID")
    185     .where(pl.col("pxRank") <= i)
   (...)
    195     pl.count().alias("Decisions"),
    196 ]
    198 self.preaggregated_decision_data_filterview = (
    199     self.decision_data.group_by(
    200         self.preaggregation_columns.union(
    201             {self.level, "StageOrder", "pxRecordType"}
    202         )
    203     )
    204     .agg(exprs)
--> 205     .collect()
    206     .lazy()
    207 )
    208 return self.preaggregated_decision_data_filterview

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     89         kwargs["engine"] = "in-memory"
     91     del kwargs["streaming"]
---> 93 return function(*args, **kwargs)

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2224, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
   2222 # Only for testing purposes
   2223 callback = _kwargs.get("post_opt_callback", callback)
-> 2224 return wrap_df(ldf.collect(engine, callback))

ColumnNotFoundError: pxRecordType

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'sink' <---
FILTER col("pyName").is_not_null()
FROM
   WITH_COLUMNS:
   [col("pyIssue").strict_cast(String), col("pyGroup").strict_cast(String), col("pyChannel").strict_cast(String), col("pyDirection").strict_cast(String)]
     WITH_COLUMNS:
     [col("StageGroup").strict_cast(Categorical(None, Physical))]
       WITH_COLUMNS:
       ["Arbitration".alias("StageGroup"), dyn int: 1.alias("StageOrder")]
         WITH_COLUMNS:
         [col("Priority").rank().over([col("pxInteractionID")]).alias("pxRank")]
           WITH_COLUMNS:
           [col("pxDecisionTime").dt.date().alias("day")]
            SELECT [col("pySubjectID"), col("pxInteractionID"), col("pxDecisionTime"), col("pyIssue"), col("pyGroup"), col("pyName"), col("pyChannel"), col("pyDirection"), col("Value"), col("Context Weight"), col("Levers"), col("pyModelPropensity"), col("pyPropensity"), col("Propensity"), col("Priority"), col("ModelControlGroup")]
            FROM
              RENAME
                 WITH_COLUMNS:
                 [col("ModelControlGroup").strict_cast(String)]
                   WITH_COLUMNS:
                   [col("Priority").strict_cast(Float32)]
                     WITH_COLUMNS:
                     [col("pyChannel").strict_cast(Categorical(None, Physical))]
                       WITH_COLUMNS:
                       [col("pyGroup").strict_cast(Categorical(None, Physical))]
                         WITH_COLUMNS:
                         [col("pyIssue").strict_cast(Categorical(None, Physical))]
                          DF ["pySubjectID", "pxInteractionID", "pyIssue", "pyGroup", ...]; PROJECT */17 COLUMNS

Global Sensitivity

The Global Sensitivity chart helps us understand how the 4 Arbitration factors propensity, value, levers, and context weights together affect the decision-making process.
Sensitivity refers to the impact on our top actions if one of these factors is omitted. The percentages indicate the potential change in our final decisions due to the absence of each factor.
  • X-Axis (Decisions): Represents the number of decisions affected by the exclusion of each factor.
  • Y-Axis (Prioritization Factor): Lists the Arbitration formula components.

  • Bars: Each bar represents the percentage of decisions affected by the absence of the corresponding factor.

  • By identifying the most impactful factors, stakeholders can make strategic adjustments to enhance decision-making accuracy.

  • It highlights which factors need more attention or refinement. For instance, if “Levers” were to show a significant percentage, it would indicate a need for closer examination and potential improvement.

[8]:
decision_data.plot.sensitivity(win_rank=1)

Wins and Losses in Arbitration

Displays the distribution of wins and losses for different “Issues” in the arbitration stage. You can change the level to “Group” or “Action”. Based on the win_rank actions are classified as either winning or losing.

X-Axis (Percentage): Represents the percentage of actions that are either wins or losses.
Y-Axis (Status): Differentiates between wins and losses.
Color Legend (Issue): Each color represents a different issue category, such as “Retention,” “Service,” “Growth,” etc.

How to Interpret the Visual:

  • Dominant Issues: The length of the bars helps identify which issues have the highest and lowest win and loss percentages. For example, if “Retention” has a longer bar in the Wins section, it indicates a higher percentage of winning actions for that issue.

  • Comparative Analysis: By comparing the bars, you can quickly see which issues are performing better in terms of winning in arbitration and which are underperforming.

  • Resource Allocation: By understanding which issues have higher loss percentages, resources can be reallocated to improve strategies in those areas.

  • Decision-Making: Provides a clear visual representation of how decisions are distributed across different issues, aiding in making data-driven decisions for future actions.

[9]:
decision_data.plot.global_winloss_distribution(level="pyIssue", win_rank=1)
---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
Cell In[9], line 1
----> 1 decision_data.plot.global_winloss_distribution(level="pyIssue", win_rank=1)

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/plots.py:122, in Plot.global_winloss_distribution(self, level, win_rank, return_df)
    120 def global_winloss_distribution(self, level, win_rank, return_df=False):
    121     # level, cat = getScope(level)
--> 122     df = self._decision_data.get_win_loss_distribution_data(level, win_rank)
    123     if return_df:
    124         return df

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:473, in DecisionAnalyzer.get_win_loss_distribution_data(self, level, win_rank)
    470 def get_win_loss_distribution_data(self, level, win_rank):
    471     win_col = f"Win_at_rank{win_rank}"
    472     group_level_win_losses = (
--> 473         self.getPreaggregatedRemainingView.filter(
    474             pl.col(self.level) == "Arbitration"
    475         )
    476         .group_by(level)
    477         .agg(Wins=pl.sum(win_col), Decisions=pl.sum("Decisions"))
    478         .with_columns(Losses=pl.col("Decisions") - pl.col("Wins"))
    479         .with_columns(
    480             Wins=pl.col("Wins") / pl.sum("Wins"),
    481             Losses=pl.col("Losses") / pl.sum("Losses"),
    482         )
    483     )
    485     group_level_win_losses = group_level_win_losses.melt(
    486         id_vars=level,
    487         value_vars=["Wins", "Losses"],
    488         variable_name="Status",
    489         value_name="Percentage",
    490     ).sort(["Status", "Percentage"], descending=True)
    492     return group_level_win_losses

File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
    999 val = cache.get(self.attrname, _NOT_FOUND)
   1000 if val is _NOT_FOUND:
-> 1001     val = self.func(instance)
   1002     try:
   1003         cache[self.attrname] = val

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:219, in DecisionAnalyzer.getPreaggregatedRemainingView(self)
    210 @cached_property
    211 def getPreaggregatedRemainingView(self):
    212     """Pre-aggregates the full dataset over customers and interactions providing a view of remaining offers.
    213
    214     This pre-aggregation builds on the filter view and aggregates over
    215     the stages remaining.
    216     """
    217     self.preaggregated_decision_data_remainingview = (
    218         self.aggregate_remaining_per_stage(
--> 219             self.getPreaggregatedFilterView,
    220             self.preaggregation_columns,
    221             [
    222                 pl.sum("Decisions"),
    223                 pl.min("pxDecisionTime_min"),
    224                 pl.max("pxDecisionTime_max"),
    225                 pl.min("Value_min"),
    226                 pl.max("Value_max"),
    227                 # pl.col("Propensity").sample(
    228                 #     n=num_samples, with_replacement=True, shuffle=True
    229                 # ),  # a list sample values - for distribution plots
    230                 pl.first("Propensity"),
    231                 pl.min("Propensity_min"),
    232                 pl.max("Propensity_max"),
    233                 # pl.col("Priority")
    234                 # .sample(n=num_samples, with_replacement=True, shuffle=True)
    235                 # .alias("Priority"),
    236                 pl.first("Priority"),
    237                 pl.min("Priority_min"),
    238                 pl.max("Priority_max"),
    239             ]
    240             + [pl.sum(f"Win_at_rank{i}") for i in range(1, self.max_win_rank + 1)],
    241         )
    242         .collect()
    243         .lazy()
    244     )
    245     return self.preaggregated_decision_data_remainingview

File ~/.local/share/uv/python/cpython-3.11.12-linux-x86_64-gnu/lib/python3.11/functools.py:1001, in cached_property.__get__(self, instance, owner)
    999 val = cache.get(self.attrname, _NOT_FOUND)
   1000 if val is _NOT_FOUND:
-> 1001     val = self.func(instance)
   1002     try:
   1003         cache[self.attrname] = val

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:205, in DecisionAnalyzer.getPreaggregatedFilterView(self)
    182 stats_cols = ["pxDecisionTime", "Value", "Propensity", "Priority"]
    183 exprs = [
    184     pl.col("pxInteractionID")
    185     .where(pl.col("pxRank") <= i)
   (...)
    195     pl.count().alias("Decisions"),
    196 ]
    198 self.preaggregated_decision_data_filterview = (
    199     self.decision_data.group_by(
    200         self.preaggregation_columns.union(
    201             {self.level, "StageOrder", "pxRecordType"}
    202         )
    203     )
    204     .agg(exprs)
--> 205     .collect()
    206     .lazy()
    207 )
    208 return self.preaggregated_decision_data_filterview

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/_utils/deprecation.py:93, in deprecate_streaming_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     89         kwargs["engine"] = "in-memory"
     91     del kwargs["streaming"]
---> 93 return function(*args, **kwargs)

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2224, in LazyFrame.collect(self, type_coercion, _type_check, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, collapse_joins, no_optimization, engine, background, _check_order, _eager, **_kwargs)
   2222 # Only for testing purposes
   2223 callback = _kwargs.get("post_opt_callback", callback)
-> 2224 return wrap_df(ldf.collect(engine, callback))

ColumnNotFoundError: pxRecordType

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'sink' <---
FILTER col("pyName").is_not_null()
FROM
   WITH_COLUMNS:
   [col("pyIssue").strict_cast(String), col("pyGroup").strict_cast(String), col("pyChannel").strict_cast(String), col("pyDirection").strict_cast(String)]
     WITH_COLUMNS:
     [col("StageGroup").strict_cast(Categorical(None, Physical))]
       WITH_COLUMNS:
       ["Arbitration".alias("StageGroup"), dyn int: 1.alias("StageOrder")]
         WITH_COLUMNS:
         [col("Priority").rank().over([col("pxInteractionID")]).alias("pxRank")]
           WITH_COLUMNS:
           [col("pxDecisionTime").dt.date().alias("day")]
            SELECT [col("pySubjectID"), col("pxInteractionID"), col("pxDecisionTime"), col("pyIssue"), col("pyGroup"), col("pyName"), col("pyChannel"), col("pyDirection"), col("Value"), col("Context Weight"), col("Levers"), col("pyModelPropensity"), col("pyPropensity"), col("Propensity"), col("Priority"), col("ModelControlGroup")]
            FROM
              RENAME
                 WITH_COLUMNS:
                 [col("ModelControlGroup").strict_cast(String)]
                   WITH_COLUMNS:
                   [col("Priority").strict_cast(Float32)]
                     WITH_COLUMNS:
                     [col("pyChannel").strict_cast(Categorical(None, Physical))]
                       WITH_COLUMNS:
                       [col("pyGroup").strict_cast(Categorical(None, Physical))]
                         WITH_COLUMNS:
                         [col("pyIssue").strict_cast(Categorical(None, Physical))]
                          DF ["pySubjectID", "pxInteractionID", "pyIssue", "pyGroup", ...]; PROJECT */17 COLUMNS

Personalization Analysis

The Personalization Analysis chart helps us understand the availability of options to present to customers during the arbitration stage. If only a limited number of actions survive to this stage, it becomes challenging to personalize offers effectively. Some customers may have only one action available, limiting the ability of our machine learning algorithm to arbitrate effectively.

In the chart:

The x-axis represents the number of actions available per customer. The left y-axis shows the number of decisions. The right y-axis shows the propensity percentage.

The bars (Optionality) indicate the number of decisions where customers had a specific number of actions available. For instance, a high bar at “2” means many customers had exactly two actions available in arbitration. The line (Propensity) represents the average propensity of the top-ranking actions within each bin.

This analysis helps in understanding the distribution of available actions. We expect the average propensity to increase as the number of available actions increase. If there are many customers with little or no actions available, it should be investigated

[10]:
decision_data.plot.propensity_vs_optionality(stage="Arbitration")
---------------------------------------------------------------------------
ColumnNotFoundError                       Traceback (most recent call last)
Cell In[10], line 1
----> 1 decision_data.plot.propensity_vs_optionality(stage="Arbitration")

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/plots.py:151, in Plot.propensity_vs_optionality(self, stage, df, return_df)
    148 if df is None:
    149     df = self._decision_data.sample
    150 plotData = (
--> 151     self._decision_data.get_optionality_data(df)
    152     .filter(pl.col(self._decision_data.level) == stage)
    153     .collect()
    154 )
    155 if return_df:
    156     return plotData

File ~/work/pega-datascientist-tools/pega-datascientist-tools/python/pdstools/decision_analyzer/decision_data.py:519, in DecisionAnalyzer.get_optionality_data(self, df)
    503 expr = [
    504     pl.len().alias("nOffers"),
    505     pl.col("Propensity")
   (...)
    508     .alias("bestPropensity"),
    509 ]
    510 per_offer_count_and_stage = (
    511     self.aggregate_remaining_per_stage(
    512         df=df,
   (...)
    517     .agg(Interactions=pl.len(), AverageBestPropensity=pl.mean("bestPropensity"))
    518 )
--> 519 schema = per_offer_count_and_stage.collect_schema()
    520 zero_actions = (
    521     per_offer_count_and_stage.group_by("StageGroup")
    522     .agg(interaction_count=pl.sum("Interactions"))
   (...)
    530     .drop("interaction_count")
    531 )
    532 optionality_data = pl.concat(
    533     [
    534         per_offer_count_and_stage,
    535         zero_actions.select(per_offer_count_and_stage.collect_schema().names()),
    536     ]
    537 ).sort("nOffers", descending=True)

File ~/work/pega-datascientist-tools/pega-datascientist-tools/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:2446, in LazyFrame.collect_schema(self)
   2416 def collect_schema(self) -> Schema:
   2417     """
   2418     Resolve the schema of this LazyFrame.
   2419
   (...)
   2444     3
   2445     """
-> 2446     return Schema(self._ldf.collect_schema(), check_dtypes=False)

ColumnNotFoundError: unable to find column "pxRecordType"; valid columns: ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", "pyGroup", "pyName", "pyChannel", "pyDirection", "Value", "Context Weight", "Levers", "pyModelPropensity", "pyPropensity", "Propensity", "Priority", "ModelControlGroup", "day", "pxRank", "StageGroup", "StageOrder"]

Resolved plan until failure:

        ---> FAILED HERE RESOLVING 'group_by' <---
DF ["pySubjectID", "pxInteractionID", "pxDecisionTime", "pyIssue", ...]; PROJECT */20 COLUMNS

Win/Loss Analysis

Win Analysis

Let’s select an action to determine how often it wins and identify which actions it defeats in arbitration.

[11]:
win_rank = 1
selected_action = (
    decision_data.unfiltered_raw_decision_data.filter(pl.col("pxRank") == 1)
    .group_by("pyName")
    .len()
    .sort("len", descending=True)
    .collect()
    .get_column("pyName")
    .to_list()[1]
)
filter_statement = pl.col("pyName") == selected_action

interactions_where_comparison_group_wins = (
    decision_data.get_winning_or_losing_interactions(
        win_rank=win_rank,
        group_filter=filter_statement,
        win=True,
    )
)

print(
    f"selected action '{selected_action}' wins(Rank{win_rank}) in {interactions_where_comparison_group_wins.collect().height} interactions."
)
selected action 'pyName_61' wins(Rank1) in 30 interactions.

Graph below shows the competing actions going into arbitration together with the selected action and how many times they lose. It highlights which actions are being surpassed by the chosen action.

[12]:
# Losing actions in interactions where the selected action wins.
groupby_cols = ["pyIssue", "pyGroup", "pyName"]
winning_from = decision_data.winning_from(
    interactions=interactions_where_comparison_group_wins,
    win_rank=win_rank,
    groupby_cols=groupby_cols,
    top_k=20,
)

decision_data.plot.distribution_as_treemap(
    df=winning_from, stage="Arbitration", scope_options=groupby_cols
)

Loss Analysis

Let’s analyze which actions come out on top when the selected action fails

[13]:
interactions_where_comparison_group_loses = (
    decision_data.get_winning_or_losing_interactions(
        win_rank=win_rank,
        group_filter=filter_statement,
        win=False,
    )
)

print(
    f"selected action '{selected_action}' loses in {interactions_where_comparison_group_loses.collect().height} interactions."
)
# Winning actions in interactions where the selected action loses.
losing_to = decision_data.winning_from(
    interactions=interactions_where_comparison_group_loses,
    win_rank=win_rank,
    groupby_cols=groupby_cols,
    top_k=20,
)

decision_data.plot.distribution_as_treemap(
    df=losing_to, stage="Arbitration", scope_options=groupby_cols
)
selected action 'pyName_61' loses in 58 interactions.

What are the Prioritization Factors that make these actions win or lose?

The analysis below shows the change in the number of times an action wins when each factor is individually removed from the prioritization calculation. Unlike the Global Sensitivity Analysis above, this chart can show negative numbers. A negative value means that the selected action would win more often if that component were removed from the arbitration process. Therefore, a component with a negative value is contributing to the action’s loss.

[14]:
decision_data.plot.sensitivity(
    limit_xaxis_range=False, reference_group=pl.col("pyName") == selected_action
)

Why are the actions winning

Here we show the distribution of the various arbitration factors of the comparison group vs the other actions that make it to arbitration for the same interactions.

[15]:
fig, warning_message = decision_data.plot.prio_factor_boxplots(
    reference=pl.col("pyName") == selected_action,
    sample_size=10000,
)
if warning_message:
    print(warning_message)
else:
    fig.show()

Rank Distribution of Comparison Group

Showing the distribution of the prioritization rank of the selected actions. If the rank is low, the selected actions are not (often) winning.

[16]:
decision_data.plot.rank_boxplot(
    reference=pl.col("pyName") == selected_action,
)