pdstools.ih.IH¶

Classes¶

IH

Module Contents¶

class IH(data: polars.LazyFrame)¶

Parameters:: data (polars.LazyFrame)

data: polars.LazyFrame¶

positive_outcome_labels: Dict[str, List[str]]¶

aggregates¶

plot¶

negative_outcome_labels¶

classmethod from_ds_export(ih_filename: os.PathLike | str, query: pdstools.utils.types.QUERY | None = None)¶

Create an IH instance from a file with Pega Dataset Export

Parameters:

ih_filename (Union[os.PathLike, str]) – The full path to the dataset files
query (Optional[QUERY], optional) – An optional argument to filter out selected data, by default None

Returns:

The properly initialized IH object

Return type:

classmethod from_s3()¶: Not implemented yet. Please let us know if you would like this functionality!

classmethod from_mock_data(days=90, n=100000)¶

Initialize an IH instance with sample data

Parameters:

days (number of days, defaults to 90 days)
n (number of interaction data records, defaults to 100k)

Returns:

The properly initialized IH object

Return type:

get_sequences(positive_outcome_label: str, level: str, outcome_column: str, customerid_column: str) → tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]¶

Generates customer sequences, outcome labels, counts needed for PMI (Pointwise Mutual Information) calculations.

This function processes customer interaction data to produce: 1. Action sequences per customer. 2. Corresponding binary outcome sequences (1 for positive outcome, 0 otherwise). 3. Counts of bigrams and ≥3-grams that end with a positive outcome. 4. Counts of all possible bigrams within that corpus.

Parameters:

positive_outcome_label (str) – The outcome label that marks the final event in a sequence.
level (str) – Column name that contains the action (offer / treatment).
outcome_column (str) – Column name that contains the outcome label.
customerid_column (str) – Column name that identifies a unique customer / subject.

Returns:

customer_sequences (list[tuple[str, …]]) – Sequences of actions per customer.
customer_outcomes (list[tuple[int, …]]) – Binary outcomes (0 or 1) for each customer action sequence.
count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams
count_sequences (list[defaultdict[tuple[str, …], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer

Return type:

tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]

static calculate_pmi(count_actions: list[collections.defaultdict[tuple[str], int]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]]) → tuple[dict[tuple[str, str], float], dict[tuple[str, Ellipsis], float]]¶

Computes PMI scores for n-grams (n ≥ 2) in customer action sequences. Returns an unsorted dictionary mapping sequences to their PMI values, providing insights into significant action associations.

Bigrams values are calculated by PMI. N-gram values are computed by averaging the PMI of their constituent bigrams. Higher values indicate more informative or surprising paths.

Parameters:

count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams
count_sequences (list[defaultdict[tuple[str, …], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer

Returns:

ngrams_pmi – Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:

’average_pmi: The average PMI value.

’links’: A dictionary mapping each constituent bigram to its PMI value.

Return type:

dict[tuple[str, …], float | dict[str, float | dict[tuple[str, str], float]]]

static pmi_overview(ngrams_pmi: Dict[str, Dict[str, Dict[str, float] | float]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]], customer_sequences: list[tuple[str, Ellipsis]], customer_outcomes: list[tuple[int, Ellipsis]]) → polars.DataFrame¶

Analyzes customer sequences to identify patterns linked to positive outcomes. Returns a sorted Polars DataFrame of significant n-grams

Parameters:

ngrams_pmi (dict[tuple[str, ...], float | dict[str, float | dict[tuple[str, str], float]]]) –
Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:
- ’average_pmi: The average PMI value.
- ’links’: A dictionary mapping each constituent bigram to its PMI value.
count_sequences (list[defaultdict[tuple[str, ...], int]]) – Sequence frequency counts. Index 1 = ≥3-grams ending in positive outcome. Index 2 = bigrams ending in positive outcome.
customer_sequences (list[tuple[str, ...]]) – Sequences of actions per customer.
customer_outcomes (list[tuple[int, ...]]) – Binary outcomes (0 or 1) for each customer action sequence.

Returns:

DataFrame containing: - ‘Sequence’: the action sequence - ‘Length’: number of actions - ‘Avg PMI’: average PMI value - ‘Frequency’: number of times the sequence appears - ‘Unique freq’: number of unique customers who had this sequence ending in a positive outcome - ‘Score’: Avg PMI x log(Frequency), sorted descending

Return type:

pl.DataFrame