pdstools.ih

Submodules

Classes

IH

Package Contents

class IH(data: polars.LazyFrame)
Parameters:

data (polars.LazyFrame)

data: polars.LazyFrame
positive_outcome_labels: Dict[str, List[str]]
aggregates
plot
negative_outcome_labels
classmethod from_ds_export(ih_filename: os.PathLike | str, query: pdstools.utils.types.QUERY | None = None)

Create an IH instance from a file with Pega Dataset Export

Parameters:
  • ih_filename (Union[os.PathLike, str]) – The full path to the dataset files

  • query (Optional[QUERY], optional) – An optional argument to filter out selected data, by default None

Returns:

The properly initialized IH object

Return type:

IH

classmethod from_s3()

Not implemented yet. Please let us know if you would like this functionality!

classmethod from_mock_data(days=90, n=100000)

Initialize an IH instance with sample data

Parameters:
  • days (number of days, defaults to 90 days)

  • n (number of interaction data records, defaults to 100k)

Returns:

The properly initialized IH object

Return type:

IH

get_sequences(positive_outcome_label: str, level: str, outcome_column: str, customerid_column: str) tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]

Generates customer sequences, outcome labels, counts needed for PMI (Pointwise Mutual Information) calculations.

This function processes customer interaction data to produce: 1. Action sequences per customer. 2. Corresponding binary outcome sequences (1 for positive outcome, 0 otherwise). 3. Counts of bigrams and ≥3-grams that end with a positive outcome. 4. Counts of all possible bigrams within that corpus.

Parameters:
  • positive_outcome_label (str) – The outcome label that marks the final event in a sequence.

  • level (str) – Column name that contains the action (offer / treatment).

  • outcome_column (str) – Column name that contains the outcome label.

  • customerid_column (str) – Column name that identifies a unique customer / subject.

Returns:

  • customer_sequences (list[tuple[str, …]]) – Sequences of actions per customer.

  • customer_outcomes (list[tuple[int, …]]) – Binary outcomes (0 or 1) for each customer action sequence.

  • count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams

  • count_sequences (list[defaultdict[tuple[str, …], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer

Return type:

tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]

static calculate_pmi(count_actions: list[collections.defaultdict[tuple[str], int]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]]) tuple[dict[tuple[str, str], float], dict[tuple[str, Ellipsis], float]]

Computes PMI scores for n-grams (n ≥ 2) in customer action sequences. Returns an unsorted dictionary mapping sequences to their PMI values, providing insights into significant action associations.

Bigrams values are calculated by PMI. N-gram values are computed by averaging the PMI of their constituent bigrams. Higher values indicate more informative or surprising paths.

Parameters:
  • count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams

  • count_sequences (list[defaultdict[tuple[str, ], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer

Returns:

ngrams_pmi – Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:

  • ’average_pmi: The average PMI value.

  • ’links’: A dictionary mapping each constituent bigram to its PMI value.

Return type:

dict[tuple[str, …], float | dict[str, float | dict[tuple[str, str], float]]]

static pmi_overview(ngrams_pmi: Dict[str, Dict[str, Dict[str, float] | float]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]], customer_sequences: list[tuple[str, Ellipsis]], customer_outcomes: list[tuple[int, Ellipsis]]) polars.DataFrame

Analyzes customer sequences to identify patterns linked to positive outcomes. Returns a sorted Polars DataFrame of significant n-grams

Parameters:
  • ngrams_pmi (dict[tuple[str, ...], float | dict[str, float | dict[tuple[str, str], float]]]) –

    Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:

    • ’average_pmi: The average PMI value.

    • ’links’: A dictionary mapping each constituent bigram to its PMI value.

  • count_sequences (list[defaultdict[tuple[str, ...], int]]) – Sequence frequency counts. Index 1 = ≥3-grams ending in positive outcome. Index 2 = bigrams ending in positive outcome.

  • customer_sequences (list[tuple[str, ...]]) – Sequences of actions per customer.

  • customer_outcomes (list[tuple[int, ...]]) – Binary outcomes (0 or 1) for each customer action sequence.

Returns:

DataFrame containing: - ‘Sequence’: the action sequence - ‘Length’: number of actions - ‘Avg PMI’: average PMI value - ‘Frequency’: number of times the sequence appears - ‘Unique freq’: number of unique customers who had this sequence ending in a positive outcome - ‘Score’: Avg PMI x log(Frequency), sorted descending

Return type:

pl.DataFrame