pdstools.utils.cdh_utils._polars

Polars expression and frame helpers (queries, sampling, schema, overlap).

Attributes

Functions

is_valid_polars_duration(→ bool)

Validate Polars duration syntax.

_apply_query(…)

_combine_queries(→ pdstools.utils.types.QUERY)

_polars_capitalize(→ pdstools.utils.cdh_utils._common.F)

_extract_keys(→ pdstools.utils.cdh_utils._common.F)

Extracts keys out of the pyName column

weighted_average_polars(→ polars.Expr)

weighted_performance_polars(→ polars.Expr)

Polars function to return a weighted performance

overlap_matrix(→ polars.DataFrame)

Calculate the overlap of a list element with all other list elements returning a full matrix.

overlap_lists_polars(→ polars.Series)

Calculate the average overlap ratio of each list element with all other list elements into a single Series.

lazy_sample(→ pdstools.utils.cdh_utils._common.F)

_apply_schema_types(→ pdstools.utils.cdh_utils._common.F)

This function is used to convert the data types of columns in a DataFrame to a desired types.

Module Contents

POLARS_DURATION_PATTERN
is_valid_polars_duration(value: str, max_length: int = 30) bool

Validate Polars duration syntax.

Checks if a string is a valid Polars duration (e.g., “1d”, “1w”, “1mo”, “1h30m”). Used to validate user input before passing to Polars methods like dt.truncate() or group_by_dynamic().

Parameters:
  • value (str) – The duration string to validate.

  • max_length (int, default 30) – Maximum allowed string length (prevents excessive input).

Returns:

True if the string is a valid Polars duration, False otherwise.

Return type:

bool

Examples

>>> is_valid_polars_duration("1d")
True
>>> is_valid_polars_duration("1w")
True
>>> is_valid_polars_duration("1h30m")
True
>>> is_valid_polars_duration("invalid")
False
>>> is_valid_polars_duration("")
False
_apply_query(df: polars.LazyFrame, query: pdstools.utils.types.QUERY | None = None, allow_empty: bool = False) polars.LazyFrame
_apply_query(df: polars.DataFrame, query: pdstools.utils.types.QUERY | None = None, allow_empty: bool = False) polars.DataFrame
_combine_queries(existing_query: pdstools.utils.types.QUERY, new_query: polars.Expr) pdstools.utils.types.QUERY
Parameters:
Return type:

pdstools.utils.types.QUERY

_polars_capitalize(df: pdstools.utils.cdh_utils._common.F, extra_endwords: collections.abc.Iterable[str] | None = None) pdstools.utils.cdh_utils._common.F
Parameters:
Return type:

pdstools.utils.cdh_utils._common.F

_extract_keys(df: pdstools.utils.cdh_utils._common.F, key: str = 'Name', capitalize: bool = True) pdstools.utils.cdh_utils._common.F

Extracts keys out of the pyName column

This is not a lazy operation as we don’t know the possible keys in advance. For that reason, we select only the key column, extract the keys from that, and then collect the resulting dataframe. This dataframe is then joined back to the original dataframe.

This is relatively efficient, but we still do need the whole pyName column in memory to do this, so it won’t work completely lazily from e.g. s3. That’s why it only works with eager mode.

The data in column for which the JSON is extract is normalized a little by taking out non-space, non-printable characters. Not just ASCII of course. This may be relatively expensive.

JSON extraction only happens on the unique values so saves a lot of time with multiple snapshots of the same models, it also only processes rows for which the key column appears to be valid JSON. It will break when you “trick” it with malformed JSON.

Column values for columns that are also encoded in the key column will be overwritten with values from the key column, but only for rows that are JSON. In previous versions all values were overwritten resulting in many nulls.

Parameters:
  • df (pl.DataFrame | pl.LazyFrame) – The dataframe to extract the keys from

  • key (str) – The column with embedded JSON

  • capitalize (bool) – If True (default) normalizes the names of the embedded columns otherwise keeps the names as-is.

Return type:

pdstools.utils.cdh_utils._common.F

weighted_average_polars(vals: str | polars.Expr, weights: str | polars.Expr) polars.Expr
Parameters:
  • vals (str | polars.Expr)

  • weights (str | polars.Expr)

Return type:

polars.Expr

weighted_performance_polars(vals: str | polars.Expr = 'Performance', weights: str | polars.Expr = 'ResponseCount') polars.Expr

Polars function to return a weighted performance

Parameters:
  • vals (str | polars.Expr)

  • weights (str | polars.Expr)

Return type:

polars.Expr

overlap_matrix(df: polars.DataFrame, list_col: str, by: str, show_fraction: bool = True) polars.DataFrame

Calculate the overlap of a list element with all other list elements returning a full matrix.

For each list in the specified column, this function calculates the overlap ratio (intersection size divided by the original list size) with every other list in the column, including itself. The result is a matrix where each row represents the overlap ratios for one list with all others.

Parameters:
  • df (pl.DataFrame) – The Polars DataFrame containing the list column and grouping column.

  • list_col (str) – The name of the column containing the lists. Each element in this column should be a list.

  • by (str) – The name of the column to use for grouping and labeling the rows in the result matrix.

  • show_fraction (bool)

Returns:

A DataFrame where: - Each row represents the overlap ratios for one list with all others - Each column (except the last) represents the overlap ratio with a specific list - Column names are formatted as “Overlap_{list_col_name}_{by}” - The last column contains the original values from the ‘by’ column

Return type:

pl.DataFrame

Examples

>>> import polars as pl
>>> df = pl.DataFrame({
...     "Channel": ["Mobile", "Web", "Email"],
...     "Actions": [
...         [1, 2, 3],
...         [2, 3, 4, 6],
...         [3, 5, 7, 8]
...     ]
... })
>>> overlap_matrix(df, "Actions", "Channel")
shape: (3, 4)
┌───────────────────┬───────────────┬───────────────┬─────────┐
│ Overlap_Actions_M… │ Overlap_Actio… │ Overlap_Actio… │ Channel │
│ ---               │ ---           │ ---           │ ---     │
│ f64               │ f64           │ f64           │ str     │
╞═══════════════════╪═══════════════╪═══════════════╪═════════╡
│ 1.0               │ 0.6666667     │ 0.3333333     │ Mobile  │
│ 0.5               │ 1.0           │ 0.25          │ Web     │
│ 0.25              │ 0.25          │ 1.0           │ Email   │
└───────────────────┴───────────────┴───────────────┴─────────┘
overlap_lists_polars(col: polars.Series) polars.Series

Calculate the average overlap ratio of each list element with all other list elements into a single Series.

For each list in the input Series, this function calculates the average overlap (intersection) with all other lists, normalized by the size of the original list. The overlap ratio represents how much each list has in common with all other lists on average.

Parameters:

col (pl.Series) – A Polars Series where each element is a list. The function will calculate the overlap between each list and all other lists in the Series.

Returns:

A Polars Series of float values representing the average overlap ratio for each list. Each value is calculated as: (sum of intersection sizes with all other lists) / (number of other lists) / (size of original list)

Return type:

pl.Series

Examples

>>> import polars as pl
>>> data = pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])
>>> overlap_lists_polars(data)
shape: (3,)
Series: '' [f64]
[
    0.5
    0.375
    0.25
]
>>> df = pl.DataFrame({"Channel" : ["Mobile", "Web", "Email"], "Actions" : pl.Series([
...     [1, 2, 3],
...     [2, 3, 4, 6],
...     [3, 5, 7, 8]
... ])})
>>> df.with_columns(pl.col("Actions").map_batches(overlap_lists_polars))
shape: (3, 2)
┌─────────┬─────────┐
│ Channel │ Actions │
│ ---     │ ---     │
│ str     │ f64     │
╞═════════╪═════════╡
│ Mobile  │ 0.5     │
│ Web     │ 0.375   │
│ Email   │ 0.25    │
└─────────┴─────────┘
lazy_sample(df: pdstools.utils.cdh_utils._common.F, n_rows: int, with_replacement: bool = True) pdstools.utils.cdh_utils._common.F
Parameters:
Return type:

pdstools.utils.cdh_utils._common.F

_apply_schema_types(df: pdstools.utils.cdh_utils._common.F, definition, **timestamp_opts) pdstools.utils.cdh_utils._common.F

This function is used to convert the data types of columns in a DataFrame to a desired types. The desired types are defined in a PegaDefaultTables class.

Parameters:
  • df (pl.LazyFrame) – The DataFrame whose columns’ data types need to be converted.

  • definition (PegaDefaultTables) – A PegaDefaultTables object that contains the desired data types for the columns.

  • timestamp_opts (str) – Additional arguments for timestamp parsing.

Returns:

A list with polars expressions for casting data types.

Return type:

list