pdstools.ih¶
Submodules¶
Classes¶
Package Contents¶
- class IH(data: polars.LazyFrame)¶
- Parameters:
data (polars.LazyFrame)
- data: polars.LazyFrame¶
- aggregates¶
- plot¶
- negative_outcome_labels¶
- classmethod from_ds_export(ih_filename: os.PathLike | str, query: pdstools.utils.types.QUERY | None = None)¶
Create an IH instance from a file with Pega Dataset Export
- Parameters:
ih_filename (Union[os.PathLike, str]) – The full path to the dataset files
query (Optional[QUERY], optional) – An optional argument to filter out selected data, by default None
- Returns:
The properly initialized IH object
- Return type:
- classmethod from_s3()¶
Not implemented yet. Please let us know if you would like this functionality!
- classmethod from_mock_data(days=90, n=100000)¶
Initialize an IH instance with sample data
- Parameters:
days (number of days, defaults to 90 days)
n (number of interaction data records, defaults to 100k)
- Returns:
The properly initialized IH object
- Return type:
- get_sequences(positive_outcome_label: str, level: str, outcome_column: str, customerid_column: str) tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]] ¶
Generates customer sequences, outcome labels, counts needed for PMI (Pointwise Mutual Information) calculations.
This function processes customer interaction data to produce: 1. Action sequences per customer. 2. Corresponding binary outcome sequences (1 for positive outcome, 0 otherwise). 3. Counts of bigrams and ≥3-grams that end with a positive outcome. 4. Counts of all possible bigrams within that corpus.
- Parameters:
positive_outcome_label (str) – The outcome label that marks the final event in a sequence.
level (str) – Column name that contains the action (offer / treatment).
outcome_column (str) – Column name that contains the outcome label.
customerid_column (str) – Column name that identifies a unique customer / subject.
- Returns:
customer_sequences (list[tuple[str, …]]) – Sequences of actions per customer.
customer_outcomes (list[tuple[int, …]]) – Binary outcomes (0 or 1) for each customer action sequence.
count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams
count_sequences (list[defaultdict[tuple[str, …], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer
- Return type:
tuple[list[tuple[str, Ellipsis]], list[tuple[int, Ellipsis]], list[collections.defaultdict[tuple[str], int]], list[collections.defaultdict[tuple[str, Ellipsis], int]]]
- static calculate_pmi(count_actions: list[collections.defaultdict[tuple[str], int]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]]) tuple[dict[tuple[str, str], float], dict[tuple[str, Ellipsis], float]] ¶
Computes PMI scores for n-grams (n ≥ 2) in customer action sequences. Returns an unsorted dictionary mapping sequences to their PMI values, providing insights into significant action associations.
Bigrams values are calculated by PMI. N-gram values are computed by averaging the PMI of their constituent bigrams. Higher values indicate more informative or surprising paths.
- Parameters:
count_actions (list[defaultdict[tuple[str], int]]) – Actions frequency counts. Index 0 = count of first element in all bigrams Index 1 = count of second element in all bigrams
count_sequences (list[defaultdict[tuple[str, …], int]]) – Sequence frequency counts. Index 0 = bigrams (all) Index 1 = ≥3-grams that end with positive outcome Index 2 = bigrams that end with positive outcome Index 3 = unique ngrams per customer
- Returns:
ngrams_pmi – Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:
’average_pmi: The average PMI value.
’links’: A dictionary mapping each constituent bigram to its PMI value.
- Return type:
dict[tuple[str, …], float | dict[str, float | dict[tuple[str, str], float]]]
- static pmi_overview(ngrams_pmi: Dict[str, Dict[str, Dict[str, float] | float]], count_sequences: list[collections.defaultdict[tuple[str, Ellipsis], int]], customer_sequences: list[tuple[str, Ellipsis]], customer_outcomes: list[tuple[int, Ellipsis]]) polars.DataFrame ¶
Analyzes customer sequences to identify patterns linked to positive outcomes. Returns a sorted Polars DataFrame of significant n-grams
- Parameters:
ngrams_pmi (dict[tuple[str, ...], float | dict[str, float | dict[tuple[str, str], float]]]) –
Dictionary containing PMI information for bigrams and n-grams. For bigrams, the value is a float representing the PMI value. For higher-order n-grams, the value is a dictionary with:
’average_pmi: The average PMI value.
’links’: A dictionary mapping each constituent bigram to its PMI value.
count_sequences (list[defaultdict[tuple[str, ...], int]]) – Sequence frequency counts. Index 1 = ≥3-grams ending in positive outcome. Index 2 = bigrams ending in positive outcome.
customer_sequences (list[tuple[str, ...]]) – Sequences of actions per customer.
customer_outcomes (list[tuple[int, ...]]) – Binary outcomes (0 or 1) for each customer action sequence.
- Returns:
DataFrame containing: - ‘Sequence’: the action sequence - ‘Length’: number of actions - ‘Avg PMI’: average PMI value - ‘Frequency’: number of times the sequence appears - ‘Unique freq’: number of unique customers who had this sequence ending in a positive outcome - ‘Score’: Avg PMI x log(Frequency), sorted descending
- Return type:
pl.DataFrame